Page 1 of 1
how to strip the href part from a link?
Posted: Fri Jul 17, 2009 8:59 am
by cipicip
Hi guys,
Assuming we have an html page we have to parse, the question is how to remove specific links and leave out the text or whatever the link shows? For instance we have this code:
Code: Select all
<div class="postbody"><a href="http://my.test.site.com">test this</a><b>» How do I change my settings?</b></div>
<div class="postbody"><a href="http://my.test.site.com">test that</a>If you are a registered user, all your settings are stored in the board database. To alter them, visit your User Control Panel; a link can usually be found at the <a href="http://www.site.com">top</a> of board pages. This system will allow you to change all your settings and preferences.</div>
<p class="gensmall"><a href="http://my.test.com">Top</a></p>
How can all links containing "site" in their url be removed and left only the text inside (eg. "<a href="http://my.test.site.com">test that</a>" => "test that") and leave the other links untouched?
Thanks.
Re: how to strip the href part from a link?
Posted: Fri Jul 17, 2009 9:24 am
by prometheuzz
Try:
Code: Select all
preg_replace('#<a\s[^>]*href="[^"]*site[^>]*+>([^<]*+)</a>#i', '$1', $html);
Although IMO a more robust solution would be to use an html parser: when a regex stumbles over some improperly formed html, it usually makes a mess of the entire file/html whereas a true parser will recover from it in most cases.
Re: how to strip the href part from a link?
Posted: Fri Jul 17, 2009 9:45 am
by ridgerunner
prometheuzz wrote:Try:
Code: Select all
preg_replace('#<a\s[^>]*href="[^"]*site[^>]*+>([^<]*+)</a>#i', '$1', $html);
Although IMO a more robust solution would be to use an html parser: when a regex stumbles over some improperly formed html, it usually makes a mess of the entire file/html whereas a true parser will recover from it in most cases.
Hey prometheuzz,
You've got my curiousity up. I'm familiar with how to use the DOM within Javascript, but it sounds like you are talking about something else. What HTML parser tools do you use/recommend?
Thanks

Re: how to strip the href part from a link?
Posted: Fri Jul 17, 2009 10:02 am
by prometheuzz
ridgerunner wrote:prometheuzz wrote:Try:
Code: Select all
preg_replace('#<a\s[^>]*href="[^"]*site[^>]*+>([^<]*+)</a>#i', '$1', $html);
Although IMO a more robust solution would be to use an html parser: when a regex stumbles over some improperly formed html, it usually makes a mess of the entire file/html whereas a true parser will recover from it in most cases.
Hey prometheuzz,
You've got my curiousity up. I'm familiar with how to use the DOM within Javascript, but it sounds like you are talking about something else. What HTML parser tools do you use/recommend?
Thanks

Hey ridgerunner,
I must confess that I know very little about web-related stuff. So I can't recommend a parser that I know of and/or have personal experience with. The reason I sometimes mention the fact that parsing html using regex can be dangerous is because it has happened often that the original poster comes back with some dirty html asking why my solution didn't work.
It's more of a personal motto: when parsing (and/or transforming) some language that can have a recursive nature (like html), use a dedicated parser and don't go hacking your way using regex. By definition, regex is (as the name suggests) a regular language not capable of arbitrary recursion*: only to a fixed depth.
Of course, anchor tags cannot be nested, so you should be okay using a little regex (that's why I posted an actual suggestion), but still the html can be improperly formed (missing closing- tags or quotes) in which case the regex will make a mess of it and a parser (should!) not.
Keep up the good postings!
Regards,
Bart.
* Yes, I know PHP has the ability to match recursively, which IMHO is not a feature and makes ones regex-es only usable by people who actually know regex (not the masses!). Which makes them even more a maintainability nightmare.
Re: how to strip the href part from a link?
Posted: Fri Jul 17, 2009 10:09 am
by cipicip
Thanks a lot guys.
Re: how to strip the href part from a link?
Posted: Fri Jul 17, 2009 2:46 pm
by cipicip
And if the text in the link is for instance an image like this:
<a href="/path/to/something"><img src="/path/to/image.jpg"/></a>
How can the regex be modified to extract either text or whatever is in that link?
Thanks.
Re: how to strip the href part from a link?
Posted: Tue Aug 04, 2009 5:27 pm
by tr0gd0rr
Parsing your html using PHP5's native
DOMDocument class: (not tested)
Code: Select all
$html = '<div class="postbody"><a...</a></p>'; // your string
$doc = new DOMDocument();
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('a') as $a) {
if (strpos($a->getAttribute('href'), 'site')) {
$text = new DOMText($a->nodeValue);
$a->parentNode->replaceNode($a, $text)
}
}
$newHtml = $doc->saveHTML();
It is indeed quite similar to manipulating the DOM in JavaScript because they are both derived from the same DOM standard. As prometheuzz mentioned, parsing is more reliable than regex use because parsing makes a "best guess" to convert malformed HTML into objects.
Some html, although malformed, is rendered fine by browsers and would produce unwanted results. For example:
Code: Select all
using the regex suggested above, having
`do you like my <a href="javascript\:\:alert('site>1')">text</a>?`
would turn into
`do you like my ')">text`
or the following:
`<a href="site">2 < 4</a>`
wouldn't be matched at all
And no matter how many cases you accounted for in the regex, there would always be a case that would make it fail.
Re: how to strip the href part from a link?
Posted: Tue Aug 04, 2009 6:41 pm
by cipicip
tr0gd0rr wrote:Parsing your html using PHP5's native
DOMDocument class: (not tested) ....
And no matter how many cases you accounted for in the regex, there would always be a case that would make it fail.
Thanks for your reply. I already did something like this. I was just looking for a more elegant solution. In this app I'm writing there is not just this rule, but several and all defined as regex-es so that's why I was looking in this direction. Anyway, problem solved.
Many thanks guys.