Page 1 of 2
<img[\s]+(.*)>
Posted: Tue May 15, 2007 1:15 am
by nwp
I am trying extract all the attributes of a tag just by using regex
Code: Select all
$str = '<img src="loc" name = "my_name" title = "my_title">
<img src="loc2" name = "my_name2" title = "my_title2">';
preg_match_all('/<img[\s]+(.*)>/', $str, $m, PREG_SET_ORDER);
print_r($m);
and this outputs
Code: Select all
Array
(
[0] => Array
(
[0] => <img src="loc" name = "my_name" title = "my_title">
[1] => src="loc" name = "my_name" title = "my_title"
)
[1] => Array
(
[0] => <img src="loc2" name = "my_name2" title = "my_title2">
[1] => src="loc2" name = "my_name2" title = "my_title2"
)
)
Up to this evrything Is fine but if I change $str to
Code: Select all
$str = '<img src="loc"
name = "my_name" title = "my_title">
<img src="loc2" name = "my_name2" title = "my_title2">';
It just outputs
Code: Select all
Array
(
[0] => Array
(
[0] => <img src="loc2" name = "my_name2" title = "my_title2">
[1] => src="loc2" name = "my_name2" title = "my_title2"
)
)
and It drops the first img.
but I've used \s here.
Posted: Tue May 15, 2007 2:29 am
by Chris Corbyn
.*?
Not .*
<img[\s]+(.*?)>
Posted: Tue May 15, 2007 3:17 am
by stereofrog
also to allow newlines, add /s
note that this won't work for e.g. <img src=blah onclick="if(a>b)do(c)">
Posted: Tue May 15, 2007 4:22 am
by nwp
Thanks it worked
But whats the difference between (.*) and (.*?) I've read the regex tutorial here but didn't understand.
Posted: Tue May 15, 2007 6:23 am
by Chris Corbyn
nwp wrote:But whats the difference between (.*) and (.*?) I've read the regex tutorial here but didn't understand.
.* is greedy. That's the technical term.
.* says "Get anything any number of times".
(.*)> Says get anything any number of times, then get >.
That means that "anything any number of times" could include a ">" character, provided it can find another ">" further ahead in the string. Greedy.
.*? The "?" makes this pattern become ungreedy, so "any number of times" becomes "any number of times until > is seen".
This is a concept which commonly confuses people and takes a while to get your head around.
Take this string "Bumble Bee".
/^.*e/
Reading that pattern you may think it matches "Bumble" but it doesn't. The .* is greedy so it keeps going until it gets to the last available "e" which is incidentally the last character in our string. So in fact it matches "Bumble Bee".
/^.*?e/
Now it's ungreedy. It matches everything until it sees "e" so we have our expected "Bumble"

Posted: Tue May 15, 2007 8:25 am
by feyd
Here's the general form I've used in many places
Code: Select all
#<\s*[^<>\s]+(?:\s+[a-z-]+(?:\s*=\s*(["'`]?).*?\\1)?)*[^>]*>#is
Posted: Tue May 15, 2007 9:00 am
by stereofrog
Yes, good expression. I use this one from Friedl's book
Code: Select all
$html_tag = <<<REGEXP
~<
\s*
(\w+)
\b
(?:
" [^"]* "
|
' [^']* '
|
[^>"']
)*
>
~sx
REGEXP;
Posted: Tue May 15, 2007 11:10 am
by nwp
Thanks a lot d11wtq for the information on '?'
But I disn't understand the ~sx
Whats the job of the ~
Posted: Tue May 15, 2007 11:23 am
by stereofrog
~ is just a delimiter, in pcre you're not limited to /, you can use any non-alpha symbol
s = "dot matches newline" modifier, x = "extended" (ignore whitespace and comments in regexp).
Posted: Tue May 15, 2007 11:23 am
by nwp
And how can I extract from this string
hello@hello
string@string
name@name
nwp@nwp
txt@txt
i wanna extract that (.*)@(.*) But both of the (.*) would be same
I want to do it just by using regex.
is there something like if statement or variables in regex ??
Posted: Tue May 15, 2007 11:26 am
by stereofrog
re-read d11's post about *?
he explained that pretty well
Posted: Tue May 15, 2007 11:27 am
by nwp
nwp wrote:i wanna extract that (.*)@(.*) But both of the (.*) would be same
Sorry I may overlooked but I really didn't find anything about it.
Posted: Tue May 15, 2007 12:28 pm
by stereofrog
Please post an exact input string and what you want to match.
Posted: Tue May 15, 2007 1:15 pm
by nwp
nwp wrote:And how can I extract from this string
hello@hello
string@string
name@name
nwp@nwp
txt@txt
i wanna extract that (.*)@(.*) But both of the (.*) would be same
I wand to extract hello and hello if and only if the strings on both side of @ is same
e.g. stra@strb would not match
but str@str would extract both of the 'str'
Posted: Tue May 15, 2007 1:30 pm
by stereofrog
\1 is called backreference, look it up in the manual