Getting value of an html attribute

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Getting value of an html attribute

Post by RobertGonzalez »

So yesterday a good friend of mine (*cough* pickle *cough*) helped me with a regular expression in which it returns to me the value of an attribute inside a specific HTML tag. That was working like a charm until I decided to through a few more variations into the mix. Now I need some savage help.

Quick note: I AM SO NOT A REGEX DUDE.

So here is what I need... say I have the following string (coming from a database):

Code: Select all

<?php
$string = 'Some text here for example <a href="image.png"  /><img src="../image-small.png"  /></a>
 
<p>And some more text so that you can <a href="morelinkage.php">See Links</a>.</p>
 
<p>And another example <a href="fields">Another one</a></p>';
?>
What I need to fetch from this string of text is all SRC and HREF (and possibly any other HTML reference to a file) in the markup that does not have a slash (either / forward or \ backward) and has a dot extension (like me.php or kieran-huggins-rocks-velour.png) but does not have a dot prefix (like ./picture.png - I know this would more than likely be covered in the slash check).

I also need this to fetch the values I am looking for even if the markup is malformed or variant, so if the markup looks like <a href = mingleme.php> it would still catch it. SO....

In my example above I would expect to retrieve the following:
image.png
morelinkage.php

What is working for me (thanks to pickle) as long as there are no slashes or dots before the value, is:

Code: Select all

<?php
$pattern = '/<img.*?src[ ]*=["\' ]*([\w\.]*).*>/i'; // pickles
?>
But this stops as soon as there are more than one HTML element that I need to search for. I'd post what I have tried so far, but I am not sure there is enough room in our database to house it.

Help me DevNet Regex gurus. You're my only hope.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Getting value of an html attribute

Post by Weirdan »

It seems regexp greed is your problem. Try this:

Code: Select all

<(a|img).*?(src|href)[ ]*=["' ]*([\w\.]*).*?>
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Getting value of an html attribute

Post by VladSun »

Also, I see that your text is a multiline one - I think you need to use /is .
Last edited by VladSun on Thu Feb 07, 2008 2:52 pm, edited 1 time in total.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Re: Getting value of an html attribute

Post by RobertGonzalez »

That one only returns the first <img tag as a match, and if the src attribute value is ../something it returns the .. as the second match, not the name of the something.

And that is with using /im.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Getting value of an html attribute

Post by VladSun »

Code: Select all

$pattern = '/<(?:img|a)\s+(?:src|href)\s*=\s*["\']\s*(.+?)\s*["\']\s*[\/]?>/is';
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Re: Getting value of an html attribute

Post by RobertGonzalez »

Sample:

Code: Select all

$string = '<p><img src = thisimage.gif>Some text here for example <a href="image.png"  /><img src="../image-small.png"  /></a></p>
 
<p>And some more text so that you can <a href= \'morelinkage.php\' />See Links.</a></p>
 
<p>And another example <a href="fields">Another one</a></p>
 
<p><img src =\'newimage.jpg\' /> Here is another image</p>';
?>
Result:

Code: Select all

<pre>Array
(
    [0] => <a href="image.png"  />
    [1] => image.png
)
</pre>
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Getting value of an html attribute

Post by VladSun »

There are 10 types of people in this world, those who understand binary and those who don't
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Re: Getting value of an html attribute

Post by RobertGonzalez »

Wow, for some reason in my sample code that did not work that way. Using your sample worked almost as expected. Thanks for that by the way, it is appreciated.

The only thing that I need it to do from here is to NOT match the "../filename.ext" string r the "fields" string. It should only match a "name.ext" format with or without quotes (single or double).
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Getting value of an html attribute

Post by VladSun »

There are 10 types of people in this world, those who understand binary and those who don't
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Re: Getting value of an html attribute

Post by RobertGonzalez »

Hey Vlad, do you twitter? I'd really like to twitter you a beer right now. Thank you so much. That is exactly what I was looking for.

Edit | For those that are searching for a similar solution:

Code: Select all

<?php
header("Content-type: text/plain");
 
$string = '<p><img src = thisimage.gif>Some text here for example <a href="image.png"  /><img src="../image-small.png"  /></a></p>
 
<p>And some more text so that you can <a href= \'morelinkage.php\' />See Links.</a></p>
 
<p>And another example <a href="fields">Another one</a></p>
 
<p><img src =\'newimage.jpg\' /> Here is another image</p>';
 
$pattern = '/<(?:img|a)\s+(?:src|href)\s*=\s*["\']\s*(\w+\.\w+)?\s*["\']\s*[\/]?>/is'; 
 
preg_match_all($pattern, $string, $matches, PREG_SET_ORDER);
print_r($matches);
?>
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Re: Getting value of an html attribute

Post by RobertGonzalez »

Sorry, spoke just a little too soon.

I also need it to catch this one: <img src = thisimage.gif>
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Getting value of an html attribute

Post by VladSun »

Everah wrote:Hey Vlad, do you twitter? I'd really like to twitter you a beer right now. Thank you so much. That is exactly what I was looking for.
Sorry, I don't twitter :) But you can use DHL instead :)))))

PS: I used *.phps urls because for some reason the escaping slash for the single quotes disappears in the syntax highlighting in my (and yours) posts.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Re: Getting value of an html attribute

Post by VladSun »

http://ipclassify.relef.net/3.php
http://ipclassify.relef.net/3.phps

pattern:

Code: Select all

$pattern = '/<(?:img|a)\s+(?:src|href)\s*=\s*["\']?\s*(\w+\.\w+)?\s*["\']?\s*[\/]?>/is';
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Re: Getting value of an html attribute

Post by RobertGonzalez »

Now that is perfect. Thanks Vlad.
Post Reply