Page 1 of 1

strip url's out from a href thru br

Posted: Fri May 26, 2006 6:06 pm
by gford
Hi all

I have a large text file and i want to strip out urls in it. it has the following format

<a href='http://subdomain.domain.com'>some text</a><br>
So I want to strip all above occurrences from my file.

I got it all read in,etc, just having problems matching/replacing the above pattern.

I dont quite grasp how to do this using ereg_replace. I assume thats the proper function for this

Thanks in advance for any help.

Posted: Fri May 26, 2006 6:12 pm
by feyd
what have you tried?

Posted: Fri May 26, 2006 6:21 pm
by gford
Well this will be embarrasing, but ok. here is what I have tried.

Code: Select all

$text = preg_replace("/\<a href='http://subdomain(.+?)\](.+?)\<\/br\>/is","",$contents);
Am I even in the ballpark? :P

Regex

Posted: Fri May 26, 2006 8:35 pm
by tr0gd0rr
Yes, you are very close.

You will need to escape the forward slashes after http. Also, the br should have no forward slash.

You can find online testers such as http://www.nmitchell.co.uk/code/regexp_test.htm. I highly recommend RegexBuddy ($30) which will save you loads of time. http://www.regexbuddy.com/

Posted: Fri May 26, 2006 9:23 pm
by Burrito
you have a random closing square bracket in there too (])

Posted: Sat May 27, 2006 7:42 am
by gford
So let me try posting again. I don't do regex very often, so not sure I want to plop 30$ to solve one problem. Was hoping the community could help solve this.

Code: Select all

$text = preg_replace("/\<a href='http:\/\/subdomain(.+?)(.+?)\<\br\>/is","",$contents);
Closer?

Which characters need to be escaped? Any non alpha numerics?

Posted: Sat May 27, 2006 10:58 am
by sweatje
Here may be what you are looking for (expressed as a SimpleTest test):

Code: Select all

function testRemoveLinksFollowedByBr() {
$str = "<a href='http://subdomain.domain.com'>This goes</a><br>
<a href='http://subdomain.domain.com'>This will stay</a>
<a href='http://subdomain.domain.com'>This also goes</a><br>";

$expect = "<a href='http://subdomain.domain.com'>This will stay</a>";

$replaced = preg_replace("~<a href='http://subdomain[^']+'>((?!</a>).)+</a><br>~ms", '', $str);


$this->assertEqual($expect, trim($replaced));
}
A couple of pointers that may help you with future regex. If you have a character in your expression, like /, then do not choose it for the delimiter. As you see in mine, I used ~ instead.

Second, an explicity character class with a non-ungreedy match will often perform faster than an ungreedy all chacter match, hence the [^']+. Similarly, you need to stop at any </a>, not just one which happens to have a <br> behind it, so the match ((?!</a>).)+ grabs anything up to the end of a link, and becuase it always stops there, you do not need it to be ungreedy.

HTH

Posted: Sat May 27, 2006 12:17 pm
by gford
Hi Jason,

Thats awesome. Definitely learned something good from this. Appreciate the post and advise.