Page 1 of 1

Getting value of an html attribute

Posted: Thu Feb 07, 2008 2:36 pm
by RobertGonzalez
So yesterday a good friend of mine (*cough* pickle *cough*) helped me with a regular expression in which it returns to me the value of an attribute inside a specific HTML tag. That was working like a charm until I decided to through a few more variations into the mix. Now I need some savage help.

Quick note: I AM SO NOT A REGEX DUDE.

So here is what I need... say I have the following string (coming from a database):

Code: Select all

<?php
$string = 'Some text here for example <a href="image.png"  /><img src="../image-small.png"  /></a>
 
<p>And some more text so that you can <a href="morelinkage.php">See Links</a>.</p>
 
<p>And another example <a href="fields">Another one</a></p>';
?>
What I need to fetch from this string of text is all SRC and HREF (and possibly any other HTML reference to a file) in the markup that does not have a slash (either / forward or \ backward) and has a dot extension (like me.php or kieran-huggins-rocks-velour.png) but does not have a dot prefix (like ./picture.png - I know this would more than likely be covered in the slash check).

I also need this to fetch the values I am looking for even if the markup is malformed or variant, so if the markup looks like <a href = mingleme.php> it would still catch it. SO....

In my example above I would expect to retrieve the following:
image.png
morelinkage.php

What is working for me (thanks to pickle) as long as there are no slashes or dots before the value, is:

Code: Select all

<?php
$pattern = '/<img.*?src[ ]*=["\' ]*([\w\.]*).*>/i'; // pickles
?>
But this stops as soon as there are more than one HTML element that I need to search for. I'd post what I have tried so far, but I am not sure there is enough room in our database to house it.

Help me DevNet Regex gurus. You're my only hope.

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 2:44 pm
by Weirdan
It seems regexp greed is your problem. Try this:

Code: Select all

<(a|img).*?(src|href)[ ]*=["' ]*([\w\.]*).*?>

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 2:50 pm
by VladSun
Also, I see that your text is a multiline one - I think you need to use /is .

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 2:51 pm
by RobertGonzalez
That one only returns the first <img tag as a match, and if the src attribute value is ../something it returns the .. as the second match, not the name of the something.

And that is with using /im.

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 3:06 pm
by VladSun

Code: Select all

$pattern = '/<(?:img|a)\s+(?:src|href)\s*=\s*["\']\s*(.+?)\s*["\']\s*[\/]?>/is';

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 3:18 pm
by RobertGonzalez
Sample:

Code: Select all

$string = '<p><img src = thisimage.gif>Some text here for example <a href="image.png"  /><img src="../image-small.png"  /></a></p>
 
<p>And some more text so that you can <a href= \'morelinkage.php\' />See Links.</a></p>
 
<p>And another example <a href="fields">Another one</a></p>
 
<p><img src =\'newimage.jpg\' /> Here is another image</p>';
?>
Result:

Code: Select all

<pre>Array
(
    [0] => <a href="image.png"  />
    [1] => image.png
)
</pre>

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 3:21 pm
by VladSun

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 3:44 pm
by RobertGonzalez
Wow, for some reason in my sample code that did not work that way. Using your sample worked almost as expected. Thanks for that by the way, it is appreciated.

The only thing that I need it to do from here is to NOT match the "../filename.ext" string r the "fields" string. It should only match a "name.ext" format with or without quotes (single or double).

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 3:52 pm
by VladSun

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 3:59 pm
by RobertGonzalez
Hey Vlad, do you twitter? I'd really like to twitter you a beer right now. Thank you so much. That is exactly what I was looking for.

Edit | For those that are searching for a similar solution:

Code: Select all

<?php
header("Content-type: text/plain");
 
$string = '<p><img src = thisimage.gif>Some text here for example <a href="image.png"  /><img src="../image-small.png"  /></a></p>
 
<p>And some more text so that you can <a href= \'morelinkage.php\' />See Links.</a></p>
 
<p>And another example <a href="fields">Another one</a></p>
 
<p><img src =\'newimage.jpg\' /> Here is another image</p>';
 
$pattern = '/<(?:img|a)\s+(?:src|href)\s*=\s*["\']\s*(\w+\.\w+)?\s*["\']\s*[\/]?>/is'; 
 
preg_match_all($pattern, $string, $matches, PREG_SET_ORDER);
print_r($matches);
?>

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 4:00 pm
by RobertGonzalez
Sorry, spoke just a little too soon.

I also need it to catch this one: <img src = thisimage.gif>

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 4:04 pm
by VladSun
Everah wrote:Hey Vlad, do you twitter? I'd really like to twitter you a beer right now. Thank you so much. That is exactly what I was looking for.
Sorry, I don't twitter :) But you can use DHL instead :)))))

PS: I used *.phps urls because for some reason the escaping slash for the single quotes disappears in the syntax highlighting in my (and yours) posts.

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 4:06 pm
by VladSun
http://ipclassify.relef.net/3.php
http://ipclassify.relef.net/3.phps

pattern:

Code: Select all

$pattern = '/<(?:img|a)\s+(?:src|href)\s*=\s*["\']?\s*(\w+\.\w+)?\s*["\']?\s*[\/]?>/is';

Re: Getting value of an html attribute

Posted: Thu Feb 07, 2008 4:46 pm
by RobertGonzalez
Now that is perfect. Thanks Vlad.