Page 1 of 1

Talk me through this regex please!

Posted: Tue Jan 06, 2004 6:22 am
by Skittlewidth
About 6 months ago, I had some help with a regular expression I needed for extracting hyperlinks from an html page:

Code: Select all

'!<a href=[''"]([^''"]+)["'']\s*>[^<]+</a>!mi'
It worked fine and did just what I wanted. Now I want to be able to extract only the contents of the src=" " parameter of an <img> tag.

I figure it must be in some way similar, but so far the alterations I've attempted haven't got me anywhere.

I don't want someone to write it for me, what I'd really like is for some one to talk me through the above regex and tell me what all the brackets and squiggles are actually doing, so I can figure out the changes myself!

Oh, and I have tried to read several tutorials on regular expressions, but I figured if I fully understood something that was relevant to what i was doing I would pick it up a bit quicker! :)

Thanks!

Posted: Tue Jan 06, 2004 7:37 am
by Weirdan
pattern started with "<a href=" (`!` is just a delimiter) then followed by single or double quote (backslash used to escape single quote) ( square brackets define the character class, i.e. all the chars in the square brackets match as if it would single character. ([^''"]+) means: capture substring of any characters except the quotes (negated ^ character class) for future use (it's actual url and you need it apart from the html tag pair). ["'']\s* - single or double quote followed by optional sequence of spaces. [^<]+ = sequence of any chars except left angle bracket. </a> treated literally.
m and i are modifiers: m - multiline pattern, i - case insensitive.

Posted: Tue Jan 06, 2004 7:47 am
by Skittlewidth
Great! Thanks. I think I know where I went wrong then!

I may be back later... :D

Skittlewidth

Posted: Tue Jan 06, 2004 9:14 am
by Skittlewidth
Ok, I'm back. This doesn't seem to be as simple as I'd hoped!

If

Code: Select all

'!<a href=[''"]([^''"]+)["'']\s*>[^<]+</a>!mi'
captures everything from <a href to the closing </a> plus separates the url from between the two " " then wouldn't the same thing for the src="...." part of an <img> tag be:

Code: Select all

'!src=[''"]([^''"]+)["'']\s*>[^''"]+"!mi'
where [^\"']+ allows any sequence of characters except " or ' and the final " is treated literally as the closing part of the search?

It doesn't seem to work.... :(

Posted: Tue Jan 06, 2004 9:49 am
by redmonkey
'!src=[''"]([^''"]+)["'']\s*>[^''"]+"!mi'

As you requested that you wanted to try and work it out on your own I won't give you the working code. But I suggest you look back at the previous breakdown and then try and work out what that part in bold is doing. And consider the contents of your image tag (very rarely do I see the src attribute being the last defined within the tag).

Posted: Tue Jan 06, 2004 10:31 am
by Skittlewidth
'!src=[''"]([^''"]+)["'']\s*>[^''"]+"!mi'

Ok, I see my first mistake - the bit in bold was originally dealing with the text between two <a href="blah.com"> and </a> tags. This bit doesn't appear in an image tag, so that must be wrong.

Secondly the src attribute ends with a " followed by the width attribute so....
the expression probably ends with ["'']\s*width!mi : single or double quote followed by possible spaces and then the word "width".

or even leave width out of it altogether....

And it works!

Code: Select all

'!src=&#1111;''"](&#1111;^''"]+)&#1111;"'']\s*!mi'
Thanks for the pointers!

Posted: Tue Jan 06, 2004 10:59 am
by redmonkey
So what happens if you have something like....

<script src="somescript.js"> within the content you are searching?

For what it's worth I've just come up with this....

Code: Select all

'/(<\s*?img.*?src\s*?=\s*?)(&#1111;''"])(.*?)\2(.*?)>/is'

Posted: Tue Jan 06, 2004 11:27 am
by Skittlewidth
Point taken. :)

I'll have a look at that tomorrow. Right now my brain is fried!

Thanks for your help though!