<img[\s]+(.*)>

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

nwp
Forum Contributor
Posts: 105
Joined: Sun Feb 04, 2007 12:25 pm

<img[\s]+(.*)>

Post by nwp »

I am trying extract all the attributes of a tag just by using regex

Code: Select all

$str = '<img src="loc" name = "my_name" title = "my_title">
<img src="loc2" name = "my_name2" title = "my_title2">';
preg_match_all('/<img[\s]+(.*)>/', $str, $m, PREG_SET_ORDER);
print_r($m);
and this outputs

Code: Select all

Array
(
    [0] => Array
        (
            [0] => <img src="loc" name = "my_name" title = "my_title">
            [1] => src="loc" name = "my_name" title = "my_title"
        )
    [1] => Array
        (
            [0] => <img src="loc2" name = "my_name2" title = "my_title2">
            [1] => src="loc2" name = "my_name2" title = "my_title2"
        )
)
Up to this evrything Is fine but if I change $str to

Code: Select all

$str = '<img src="loc"
name = "my_name" title = "my_title">
<img src="loc2" name = "my_name2" title = "my_title2">';
It just outputs

Code: Select all

Array
(
    [0] => Array
        (
            [0] => <img src="loc2" name = "my_name2" title = "my_title2">
            [1] => src="loc2" name = "my_name2" title = "my_title2"
        )
)
and It drops the first img.
but I've used \s here.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

.*?

Not .*

<img[\s]+(.*?)>
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

also to allow newlines, add /s

Code: Select all

/<img\b(.*?)>/si
note that this won't work for e.g. <img src=blah onclick="if(a>b)do(c)">
nwp
Forum Contributor
Posts: 105
Joined: Sun Feb 04, 2007 12:25 pm

Post by nwp »

Thanks it worked

Code: Select all

/<img[\s]+(.*?)>/si
But whats the difference between (.*) and (.*?) I've read the regex tutorial here but didn't understand.
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

nwp wrote:But whats the difference between (.*) and (.*?) I've read the regex tutorial here but didn't understand.
.* is greedy. That's the technical term.

.* says "Get anything any number of times".
(.*)> Says get anything any number of times, then get >.

That means that "anything any number of times" could include a ">" character, provided it can find another ">" further ahead in the string. Greedy.

.*? The "?" makes this pattern become ungreedy, so "any number of times" becomes "any number of times until > is seen".

This is a concept which commonly confuses people and takes a while to get your head around.

Take this string "Bumble Bee".

/^.*e/

Reading that pattern you may think it matches "Bumble" but it doesn't. The .* is greedy so it keeps going until it gets to the last available "e" which is incidentally the last character in our string. So in fact it matches "Bumble Bee".

/^.*?e/

Now it's ungreedy. It matches everything until it sees "e" so we have our expected "Bumble" ;)
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Here's the general form I've used in many places

Code: Select all

#<\s*[^<>\s]+(?:\s+[a-z-]+(?:\s*=\s*(["'`]?).*?\\1)?)*[^>]*>#is
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

Yes, good expression. I use this one from Friedl's book

Code: Select all

$html_tag = <<<REGEXP
	~<
		\s*
		(\w+)
		\b
		(?:
			" [^"]* "
			|
			' [^']* '
			|
			[^>"']
		)*
	>
	~sx
REGEXP;
nwp
Forum Contributor
Posts: 105
Joined: Sun Feb 04, 2007 12:25 pm

Post by nwp »

Thanks a lot d11wtq for the information on '?'
But I disn't understand the ~sx
Whats the job of the ~
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

~ is just a delimiter, in pcre you're not limited to /, you can use any non-alpha symbol

s = "dot matches newline" modifier, x = "extended" (ignore whitespace and comments in regexp).
nwp
Forum Contributor
Posts: 105
Joined: Sun Feb 04, 2007 12:25 pm

Post by nwp »

And how can I extract from this string
hello@hello
string@string
name@name
nwp@nwp
txt@txt
i wanna extract that (.*)@(.*) But both of the (.*) would be same
I want to do it just by using regex.
is there something like if statement or variables in regex ??
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

re-read d11's post about *?
he explained that pretty well
nwp
Forum Contributor
Posts: 105
Joined: Sun Feb 04, 2007 12:25 pm

Post by nwp »

nwp wrote:i wanna extract that (.*)@(.*) But both of the (.*) would be same
Sorry I may overlooked but I really didn't find anything about it.
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

Please post an exact input string and what you want to match.
nwp
Forum Contributor
Posts: 105
Joined: Sun Feb 04, 2007 12:25 pm

Post by nwp »

nwp wrote:And how can I extract from this string
hello@hello
string@string
name@name
nwp@nwp
txt@txt
i wanna extract that (.*)@(.*) But both of the (.*) would be same
I wand to extract hello and hello if and only if the strings on both side of @ is same
e.g. stra@strb would not match
but str@str would extract both of the 'str'
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

Code: Select all

/^(.+)@\1$/
\1 is called backreference, look it up in the manual
Post Reply