Page 1 of 2

<img[\s]+(.*)>

Posted: Tue May 15, 2007 1:15 am
by nwp
I am trying extract all the attributes of a tag just by using regex

Code: Select all

$str = '<img src="loc" name = "my_name" title = "my_title">
<img src="loc2" name = "my_name2" title = "my_title2">';
preg_match_all('/<img[\s]+(.*)>/', $str, $m, PREG_SET_ORDER);
print_r($m);
and this outputs

Code: Select all

Array
(
    [0] => Array
        (
            [0] => <img src="loc" name = "my_name" title = "my_title">
            [1] => src="loc" name = "my_name" title = "my_title"
        )
    [1] => Array
        (
            [0] => <img src="loc2" name = "my_name2" title = "my_title2">
            [1] => src="loc2" name = "my_name2" title = "my_title2"
        )
)
Up to this evrything Is fine but if I change $str to

Code: Select all

$str = '<img src="loc"
name = "my_name" title = "my_title">
<img src="loc2" name = "my_name2" title = "my_title2">';
It just outputs

Code: Select all

Array
(
    [0] => Array
        (
            [0] => <img src="loc2" name = "my_name2" title = "my_title2">
            [1] => src="loc2" name = "my_name2" title = "my_title2"
        )
)
and It drops the first img.
but I've used \s here.

Posted: Tue May 15, 2007 2:29 am
by Chris Corbyn
.*?

Not .*

<img[\s]+(.*?)>

Posted: Tue May 15, 2007 3:17 am
by stereofrog
also to allow newlines, add /s

Code: Select all

/<img\b(.*?)>/si
note that this won't work for e.g. <img src=blah onclick="if(a>b)do(c)">

Posted: Tue May 15, 2007 4:22 am
by nwp
Thanks it worked

Code: Select all

/<img[\s]+(.*?)>/si
But whats the difference between (.*) and (.*?) I've read the regex tutorial here but didn't understand.

Posted: Tue May 15, 2007 6:23 am
by Chris Corbyn
nwp wrote:But whats the difference between (.*) and (.*?) I've read the regex tutorial here but didn't understand.
.* is greedy. That's the technical term.

.* says "Get anything any number of times".
(.*)> Says get anything any number of times, then get >.

That means that "anything any number of times" could include a ">" character, provided it can find another ">" further ahead in the string. Greedy.

.*? The "?" makes this pattern become ungreedy, so "any number of times" becomes "any number of times until > is seen".

This is a concept which commonly confuses people and takes a while to get your head around.

Take this string "Bumble Bee".

/^.*e/

Reading that pattern you may think it matches "Bumble" but it doesn't. The .* is greedy so it keeps going until it gets to the last available "e" which is incidentally the last character in our string. So in fact it matches "Bumble Bee".

/^.*?e/

Now it's ungreedy. It matches everything until it sees "e" so we have our expected "Bumble" ;)

Posted: Tue May 15, 2007 8:25 am
by feyd
Here's the general form I've used in many places

Code: Select all

#<\s*[^<>\s]+(?:\s+[a-z-]+(?:\s*=\s*(["'`]?).*?\\1)?)*[^>]*>#is

Posted: Tue May 15, 2007 9:00 am
by stereofrog
Yes, good expression. I use this one from Friedl's book

Code: Select all

$html_tag = <<<REGEXP
	~<
		\s*
		(\w+)
		\b
		(?:
			" [^"]* "
			|
			' [^']* '
			|
			[^>"']
		)*
	>
	~sx
REGEXP;

Posted: Tue May 15, 2007 11:10 am
by nwp
Thanks a lot d11wtq for the information on '?'
But I disn't understand the ~sx
Whats the job of the ~

Posted: Tue May 15, 2007 11:23 am
by stereofrog
~ is just a delimiter, in pcre you're not limited to /, you can use any non-alpha symbol

s = "dot matches newline" modifier, x = "extended" (ignore whitespace and comments in regexp).

Posted: Tue May 15, 2007 11:23 am
by nwp
And how can I extract from this string
hello@hello
string@string
name@name
nwp@nwp
txt@txt
i wanna extract that (.*)@(.*) But both of the (.*) would be same
I want to do it just by using regex.
is there something like if statement or variables in regex ??

Posted: Tue May 15, 2007 11:26 am
by stereofrog
re-read d11's post about *?
he explained that pretty well

Posted: Tue May 15, 2007 11:27 am
by nwp
nwp wrote:i wanna extract that (.*)@(.*) But both of the (.*) would be same
Sorry I may overlooked but I really didn't find anything about it.

Posted: Tue May 15, 2007 12:28 pm
by stereofrog
Please post an exact input string and what you want to match.

Posted: Tue May 15, 2007 1:15 pm
by nwp
nwp wrote:And how can I extract from this string
hello@hello
string@string
name@name
nwp@nwp
txt@txt
i wanna extract that (.*)@(.*) But both of the (.*) would be same
I wand to extract hello and hello if and only if the strings on both side of @ is same
e.g. stra@strb would not match
but str@str would extract both of the 'str'

Posted: Tue May 15, 2007 1:30 pm
by stereofrog

Code: Select all

/^(.+)@\1$/
\1 is called backreference, look it up in the manual