Talk me through this regex please!

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Talk me through this regex please!

Post by Skittlewidth »

About 6 months ago, I had some help with a regular expression I needed for extracting hyperlinks from an html page:

Code: Select all

'!<a href=[''"]([^''"]+)["'']\s*>[^<]+</a>!mi'
It worked fine and did just what I wanted. Now I want to be able to extract only the contents of the src=" " parameter of an <img> tag.

I figure it must be in some way similar, but so far the alterations I've attempted haven't got me anywhere.

I don't want someone to write it for me, what I'd really like is for some one to talk me through the above regex and tell me what all the brackets and squiggles are actually doing, so I can figure out the changes myself!

Oh, and I have tried to read several tutorials on regular expressions, but I figured if I fully understood something that was relevant to what i was doing I would pick it up a bit quicker! :)

Thanks!
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

pattern started with "<a href=" (`!` is just a delimiter) then followed by single or double quote (backslash used to escape single quote) ( square brackets define the character class, i.e. all the chars in the square brackets match as if it would single character. ([^''"]+) means: capture substring of any characters except the quotes (negated ^ character class) for future use (it's actual url and you need it apart from the html tag pair). ["'']\s* - single or double quote followed by optional sequence of spaces. [^<]+ = sequence of any chars except left angle bracket. </a> treated literally.
m and i are modifiers: m - multiline pattern, i - case insensitive.
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

Great! Thanks. I think I know where I went wrong then!

I may be back later... :D

Skittlewidth
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

Ok, I'm back. This doesn't seem to be as simple as I'd hoped!

If

Code: Select all

'!<a href=[''"]([^''"]+)["'']\s*>[^<]+</a>!mi'
captures everything from <a href to the closing </a> plus separates the url from between the two " " then wouldn't the same thing for the src="...." part of an <img> tag be:

Code: Select all

'!src=[''"]([^''"]+)["'']\s*>[^''"]+"!mi'
where [^\"']+ allows any sequence of characters except " or ' and the final " is treated literally as the closing part of the search?

It doesn't seem to work.... :(
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

'!src=[''"]([^''"]+)["'']\s*>[^''"]+"!mi'

As you requested that you wanted to try and work it out on your own I won't give you the working code. But I suggest you look back at the previous breakdown and then try and work out what that part in bold is doing. And consider the contents of your image tag (very rarely do I see the src attribute being the last defined within the tag).
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

'!src=[''"]([^''"]+)["'']\s*>[^''"]+"!mi'

Ok, I see my first mistake - the bit in bold was originally dealing with the text between two <a href="blah.com"> and </a> tags. This bit doesn't appear in an image tag, so that must be wrong.

Secondly the src attribute ends with a " followed by the width attribute so....
the expression probably ends with ["'']\s*width!mi : single or double quote followed by possible spaces and then the word "width".

or even leave width out of it altogether....

And it works!

Code: Select all

'!src=&#1111;''"](&#1111;^''"]+)&#1111;"'']\s*!mi'
Thanks for the pointers!
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

So what happens if you have something like....

<script src="somescript.js"> within the content you are searching?

For what it's worth I've just come up with this....

Code: Select all

'/(<\s*?img.*?src\s*?=\s*?)(&#1111;''"])(.*?)\2(.*?)>/is'
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

Point taken. :)

I'll have a look at that tomorrow. Right now my brain is fried!

Thanks for your help though!
Post Reply