strip url's out from a href thru br

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
gford
Forum Newbie
Posts: 5
Joined: Fri Mar 31, 2006 4:46 pm

strip url's out from a href thru br

Post by gford »

Hi all

I have a large text file and i want to strip out urls in it. it has the following format

<a href='http://subdomain.domain.com'>some text</a><br>
So I want to strip all above occurrences from my file.

I got it all read in,etc, just having problems matching/replacing the above pattern.

I dont quite grasp how to do this using ereg_replace. I assume thats the proper function for this

Thanks in advance for any help.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

what have you tried?
gford
Forum Newbie
Posts: 5
Joined: Fri Mar 31, 2006 4:46 pm

Post by gford »

Well this will be embarrasing, but ok. here is what I have tried.

Code: Select all

$text = preg_replace("/\<a href='http://subdomain(.+?)\](.+?)\<\/br\>/is","",$contents);
Am I even in the ballpark? :P
User avatar
tr0gd0rr
Forum Contributor
Posts: 305
Joined: Thu May 11, 2006 8:58 pm
Location: Utah, USA

Regex

Post by tr0gd0rr »

Yes, you are very close.

You will need to escape the forward slashes after http. Also, the br should have no forward slash.

You can find online testers such as http://www.nmitchell.co.uk/code/regexp_test.htm. I highly recommend RegexBuddy ($30) which will save you loads of time. http://www.regexbuddy.com/
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

you have a random closing square bracket in there too (])
gford
Forum Newbie
Posts: 5
Joined: Fri Mar 31, 2006 4:46 pm

Post by gford »

So let me try posting again. I don't do regex very often, so not sure I want to plop 30$ to solve one problem. Was hoping the community could help solve this.

Code: Select all

$text = preg_replace("/\<a href='http:\/\/subdomain(.+?)(.+?)\<\br\>/is","",$contents);
Closer?

Which characters need to be escaped? Any non alpha numerics?
User avatar
sweatje
Forum Contributor
Posts: 277
Joined: Wed Jun 29, 2005 10:04 pm
Location: Iowa, USA

Post by sweatje »

Here may be what you are looking for (expressed as a SimpleTest test):

Code: Select all

function testRemoveLinksFollowedByBr() {
$str = "<a href='http://subdomain.domain.com'>This goes</a><br>
<a href='http://subdomain.domain.com'>This will stay</a>
<a href='http://subdomain.domain.com'>This also goes</a><br>";

$expect = "<a href='http://subdomain.domain.com'>This will stay</a>";

$replaced = preg_replace("~<a href='http://subdomain[^']+'>((?!</a>).)+</a><br>~ms", '', $str);


$this->assertEqual($expect, trim($replaced));
}
A couple of pointers that may help you with future regex. If you have a character in your expression, like /, then do not choose it for the delimiter. As you see in mine, I used ~ instead.

Second, an explicity character class with a non-ungreedy match will often perform faster than an ungreedy all chacter match, hence the [^']+. Similarly, you need to stop at any </a>, not just one which happens to have a <br> behind it, so the match ((?!</a>).)+ grabs anything up to the end of a link, and becuase it always stops there, you do not need it to be ungreedy.

HTH
gford
Forum Newbie
Posts: 5
Joined: Fri Mar 31, 2006 4:46 pm

Post by gford »

Hi Jason,

Thats awesome. Definitely learned something good from this. Appreciate the post and advise.
Post Reply