Why is regex so hard to understand!?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
nickvd
DevNet Resident
Posts: 1027
Joined: Thu Mar 10, 2005 5:27 pm
Location: Southern Ontario
Contact:

Why is regex so hard to understand!?

Post by nickvd »

I'm having a tough time grasping regex. I am implementing an editor for my boss, so our clients can edit their pages directly from the browser (i'm using FCKeditor). The way it has to be set up is as follows:

Code: Select all

this text is un-editable<br/>
<div class=&quote;editor&quote; id=&quote;firstEditBox&quote;>
   This Can be editted
</div>
More un-editable<br/>
<div class=&quote;editor&quote; id=&quote;secondEditBox&quote;>
   This also, can be editted
</div>
no edits here
We want certain parts of the page to remain the same, i.e. layout. but other portions to be freely edittable by the client.

I'm using the following code to try and find each instance of the div tags that denote the edittable area. However, as it is now, the pattern will only find one (though i thought preg_match_all() would find all occurances). It will only find the second of the two in the above example (if i invalidate the second, it finds the first).

Code: Select all

$pattern = '/<div[^>].+class="editor".+id="([\w]+[^\"]+)".*>(.*)<\/div>/i';
$try = '<div class="editor" id="introText">I can edit this text</div>text text<div class="editor" id="text">I can edit this text</div>';
preg_match_all($pattern, $try, $mat);

/* outputs:
Array
(
    [0] => Array
        (
            [0] => 
I can edit this text
text text
I can edit this text

        )

    [1] => Array
        (
            [0] => text
        )

    [2] => Array
        (
            [0] => I can edit this text
        )

)*/
One other note: I will NOT work with new-lines. So to be precise, the pattern WONT work on the above html, but if it were all on one line in the variable, it would work (with the logic error outlined above)

I'm not sure what the best way to acheive what I'm looking for, using file() to read each line into the array, or reading the whole thing using fread(). For now, I'll be the only one using this, so i can format the div tags in anyway it will work, but i'd love for it to find any of the tags i use no matter how they're formatted, as long as class="editor" and id="<any>" ... ANY help would be appreciated.

Thanks in advance...
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

use the ungreedy marks for the space between divs and inside the opening div.. For instance, .* should be .*?
nickvd
DevNet Resident
Posts: 1027
Joined: Thu Mar 10, 2005 5:27 pm
Location: Southern Ontario
Contact:

Post by nickvd »

feyd wrote:use the ungreedy marks for the space between divs and inside the opening div.. For instance, .* should be .*?
That didn't help me any :( However, I spent some time google'ing and was able to come up with

Code: Select all

$pattern = '#<div.+class=[\'|"]editor[\'|"].+id=[\'|"]([^\'"]+)[\'"][^>]*>(.*)</div>#i';
The above pattern works perfectly on multiple editor blocks, and on multiple lines, however it chokes if it detects an <[tag] id="<any>"> within the div tags, as you can see below:

Code: Select all

$try = '<div class=\'editor\' style="blah;" id="first">FIRST <span id="asdfsdf">asdfsd</span> MATCH</div>
<div class="editor" id="second">SECOND MATCH</div>
<div class="editor" id="third">THIRD MATCH</div>';


$pattern = '#<div.+class=[\'|"]editor[\'|"].+id=[\'|"]([^\'"]+)[\'"][^>]*>(.*)</div>#i';
preg_match_all($pattern, $try, $mat, PREG_SET_ORDER);

/*
Returns
Array
(
    [0] => Array
        (

            [0] => <div class='editor' style="blah;" id="first">FIRST <span id="asdfsdf">asdfsd</span> MATCH</div>
            [1] => asdfsdf
            [2] => asdfsd</span> MATCH
        )

    [1] => Array
        (
            [0] => <div class="editor" id="second">SECOND MATCH</div>
            [1] => second
            [2] => SECOND MATCH
        )

    [2] => Array
        (
            [0] => <div class="editor" id="third">THIRD MATCH</div>
            [1] => third
            [2] => THIRD MATCH
        )

)

Should Return:
Array
(
    [0] => Array
        (
            [0] => <div class='editor' style="blah;" id="first">FIRST <span i_d="asdfsdf">asdfsd</span> MATCH</div>
            [1] => first
            [2] => FIRST <span i_d="asdfsdf">asdfsd</span> MATCH
        )

    [1] => Array
        (
            [0] => <div class="editor" id="second">SECOND MATCH</div>
            [1] => second
            [2] => SECOND MATCH
        )

    [2] => Array
        (
            [0] => <div class="editor" id="third">THIRD MATCH</div>
            [1] => third
            [2] => THIRD MATCH
        )

)*/
Again, any help would be greatly appreciated!
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

what did you try as far as ungreedy marks go? Because I know it works as you are expecting...
User avatar
Chris Corbyn
Breakbeat Nuttzer
Posts: 13098
Joined: Wed Mar 24, 2004 7:57 am
Location: Melbourne, Australia

Post by Chris Corbyn »

RegExp's are only difficult to understand if you jump straight in and don't 'build' the regexp from a very basic one into a more complex one...
It's something that just comes with practise ;-) Yeah... I dont see any "?" following your " .* " -> .* consumes all ahead of it (pretty much) because it's greedy.
nickvd
DevNet Resident
Posts: 1027
Joined: Thu Mar 10, 2005 5:27 pm
Location: Southern Ontario
Contact:

Post by nickvd »

Okay, Now i'm trying

Code: Select all

$pattern = '#<div.+class=[\'|"]editor[\'|"].+id=[\'|"]([^\'"]+)[\'"][^>]*?>(.*?)</div>#i';
and it's not working at all...
To be exact, My intent is to use file_get_contents on an entire page such as:

Code: Select all

&lt;!DOCTYPE HTML PUBLIC &quote;-//W3C//DTD HTML 4.0 Transitional//EN&quote;&gt;
&lt;html&gt;
	&lt;head&gt;
		&lt;title&gt;FCKeditor - Sample&lt;/title&gt;
		&lt;meta http-equiv=&quote;Content-Type&quote; content=&quote;text/html; charset=utf-8&quote;&gt;
	&lt;/head&gt;
	&lt;body&gt;
   Hey there, welcome to my page...&lt;br/&gt;
&lt;hr/&gt;
   &lt;div class=&quote;editor&quote; id=&quote;introText&quote;&gt;
      I can edit this text!
   &lt;/div&gt;
&lt;hr/&gt;
wow, more text that's not edittable
&lt;hr/&gt;
   &lt;div class=&quote;editor&quote; id=&quote;moreText&quote;&gt;
      But here is some more text that can be changed!
   &lt;/div&gt;
	&lt;/body&gt;
&lt;/html&gt;
And pull just the <div class="editor".....'s out of it

I can get the pattern to work if i just use three div tags one on each line, with no whitespace or newlines inside them (or nested div's which is my other major road block). However, eveyr pattern i've tried, fails on the actual page itself...
User avatar
pickle
Briney Mod
Posts: 6445
Joined: Mon Jan 19, 2004 6:11 pm
Location: 53.01N x 112.48W
Contact:

Post by pickle »

This may not be much help, but why bring in the uneditable text into the editor? I use FCKeditor for my CMS, and I only put part of the page in the editor. The way my CMS works, like most, is that there's a navigation shell around the editable content. I don't put that part into the editor, just what the users can edit.
Real programmers don't comment their code. If it was hard to write, it should be hard to understand.
nickvd
DevNet Resident
Posts: 1027
Joined: Thu Mar 10, 2005 5:27 pm
Location: Southern Ontario
Contact:

Post by nickvd »

I'm not putting the whole page into the editor.. I'm loading the whole page into a variable so i can strip out the editable parts (withing tags that match: <div class="editor" id="editorId">*EDITME*</div>)

Some of our clients pages use a templatign system (?page=contact/etc) but some dont..
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

your pattern seems odd. What do you think [\'|"] means? .+ is greedy, by the way. Your pattern doesn't work with mulitple line parts because you aren't asking it to. Use the 's' modifer for that. Try this...

Code: Select all

$pattern = '#&lt;\s*div&#1111;^&gt;]*?\s+class\s*=\s*(&#1111;&quote;\']?)editor\\1&#1111;^&gt;]*?\s+id\s*=\s*(&#1111;&quote;\']?)(.*?)\\2&#1111;^&gt;]*?&gt;\s*(.*?)\s*&lt;\s*/\s*div\s*&gt;#is';
I'm not planning on explaining how that works, if it does :)

Beware... if you have nested <div> that will match, this will not work.
nickvd
DevNet Resident
Posts: 1027
Joined: Thu Mar 10, 2005 5:27 pm
Location: Southern Ontario
Contact:

Post by nickvd »

Thanks feyd, i havent tried it yet... but Is it possible to get a pattern that will match, but still let me have nested <div> tags, as long as they dont have a class="editor" ??

<edit>

I tested the pattern, and it works with the multiple lines, and with tags with id attr's (it still pukes on </div> tag's, but i can understand taht, and work around it for now)

but it will not work with a full html page... code below:

Code: Select all

$file = htmlspecialchars(file_get_contents(&quote;mainpage.php&quote;));
echo &quote;&lt;pre&gt;&quote;.print_r($file, TRUE).&quote;&lt;/pre&gt;&lt;hr/&gt;&quote;;

$file2 = '
&lt;div class=&quote;editor&quote; id=&quote;introText&quote;&gt;
   I can e &lt;span id=&quote;fsdfsdf&quote;&gt;blah &lt;/span&gt;this text!
&lt;hr/&gt;
Wow, more text that not edittable
&lt;hr/&gt;
&lt;/div&gt;
&lt;div class=&quote;editor&quote; id=&quote;moreText&quote;&gt;
   But &lt;a href=&quote;test&quote;&gt;yay?&lt;/a&gt;
      This is spanned
   is some more text that can be changed!
&lt;/div&gt;
';
/*$pattern = '/&lt;div&#1111;^&gt;].+class=&quote;editor&quote;.+(id=&quote;(&#1111;\w]+&#1111;^\&quote;]+))?&quote;.*&gt;(.*?)&lt;\/div&gt;?/i';
$pattern = '#&lt;div.+class=&#1111;\'|&quote;]editor&#1111;\'|&quote;].+id=&#1111;\'|&quote;](&#1111;^\'&quote;]+)&#1111;\'|&quote;]&#1111;^&gt;]*&gt;(&#1111;\s\S]*?)&#1111;&lt;/div&gt;]#i';
$pattern = '#&lt;/?\w+((\s+\w+(\s*=\s*(?:&quote;(.|\n)*?&quote;|\'(.|\n)*?\'|&#1111;^\'&quote;&gt;\s]+))?)+\s*|\s*)/?&gt;#i';
$pattern = '#&lt;div.+class=&#1111;\'|&quote;]editor&#1111;\'|&quote;].+id=&#1111;\'|&quote;](&#1111;^\'&quote;]+)&#1111;\'&quote;]&#1111;^&gt;]*?&gt;(.*?)&lt;/div&gt;#i';*/

$pattern = '#&lt;\s*div&#1111;^&gt;]*?\s+class\s*=\s*(&#1111;&quote;\']?)editor\\1&#1111;^&gt;]*?\s+id\s*=\s*(&#1111;&quote;\']?)(.*?)\\2&#1111;^&gt;]*\s*&gt;(.*?)\s*&lt;\s*/\s*div\s*&gt;#is';
if (preg_match_all($pattern, $file2, $mat, PREG_SET_ORDER))
   echo '&lt;pre&gt;'.htmlspecialchars(print_r($mat, TRUE)).'&lt;/pre&gt;';
else
   echo 'Failure';

Code: Select all

&lt;!DOCTYPE HTML PUBLIC &quote;-//W3C//DTD HTML 4.0 Transitional//EN&quote;&gt;
&lt;html&gt;
	&lt;head&gt;
		&lt;title&gt;FCKeditor - Sample&lt;/title&gt;
		&lt;meta http-equiv=&quote;Content-Type&quote; content=&quote;text/html; charset=utf-8&quote;&gt;
	&lt;/head&gt;
	&lt;body&gt;
   Hey there, welcome to my page...&lt;br/&gt;
&lt;hr/&gt;
   &lt;div class=&quote;editor&quote; id=&quote;introText&quote;&gt;
      I can edit this text!
   &lt;/div&gt;
&lt;hr/&gt;
wow, more text that's not edittable
&lt;hr/&gt;
   &lt;div class=&quote;editor&quote; id=&quote;moreText&quote;&gt;
      But here is some more text that can be changed!
   &lt;/div&gt;
	&lt;/body&gt;
&lt;/html&gt;
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Code: Select all

&lt;?php

$file2 = '
&lt;div class=&quote;editor&quote; id=&quote;introText&quote;&gt;
   I can e &lt;span id=&quote;fsdfsdf&quote;&gt;blah &lt;/span&gt;this text!
&lt;hr/&gt;
Wow, more text that not edittable
&lt;hr/&gt;
&lt;/div&gt;
&lt;div class=&quote;editor&quote; id=&quote;moreText&quote;&gt;
   But &lt;a href=&quote;test&quote;&gt;yay?&lt;/a&gt;
      This is spanned
   is some more text that can be changed!
&lt;/div&gt;
';
/*$pattern = '/&lt;div&#1111;^&gt;].+class=&quote;editor&quote;.+(id=&quote;(&#1111;\w]+&#1111;^\&quote;]+))?&quote;.*&gt;(.*?)&lt;\/div&gt;?/i';
$pattern = '#&lt;div.+class=&#1111;\'|&quote;]editor&#1111;\'|&quote;].+id=&#1111;\'|&quote;](&#1111;^\'&quote;]+)&#1111;\'|&quote;]&#1111;^&gt;]*&gt;(&#1111;\s\S]*?)&#1111;&lt;/div&gt;]#i';
$pattern = '#&lt;/?\w+((\s+\w+(\s*=\s*(?:&quote;(.|\n)*?&quote;|\'(.|\n)*?\'|&#1111;^\'&quote;&gt;\s]+))?)+\s*|\s*)/?&gt;#i';
$pattern = '#&lt;div.+class=&#1111;\'|&quote;]editor&#1111;\'|&quote;].+id=&#1111;\'|&quote;](&#1111;^\'&quote;]+)&#1111;\'&quote;]&#1111;^&gt;]*(.*?)&lt;/div&gt;#i';*/
 
$pattern = '#&lt;\s*div&#1111;^&gt;]*?\s+class\s*=\s*(&#1111;&quote;\']?)editor\\1&#1111;^&gt;]*?\s+id\s*=\s*(&#1111;&quote;\']?)(.*?)\\2&#1111;^&gt;]*\s*&gt;\s*(.*?)\s*&lt;\s*/\s*div\s*&gt;#is';
if (preg_match_all($pattern, $file2, $mat, PREG_SET_ORDER))
   echo print_r($mat, TRUE);
else
   echo 'Failure';
   
?&gt;

Code: Select all

Array
(
    &#1111;0] =&gt; Array
        (
            &#1111;0] =&gt; &lt;div class=&quote;editor&quote; id=&quote;introText&quote;&gt;
   I can e &lt;span id=&quote;fsdfsdf&quote;&gt;blah &lt;/span&gt;this text!
&lt;hr/&gt;
Wow, more text that not edittable
&lt;hr/&gt;
&lt;/div&gt;
            &#1111;1] =&gt; &quote;
            &#1111;2] =&gt; &quote;
            &#1111;3] =&gt; introText
            &#1111;4] =&gt; I can e &lt;span id=&quote;fsdfsdf&quote;&gt;blah &lt;/span&gt;this text!
&lt;hr/&gt;
Wow, more text that not edittable
&lt;hr/&gt;
        )

    &#1111;1] =&gt; Array
        (
            &#1111;0] =&gt; &lt;div class=&quote;editor&quote; id=&quote;moreText&quote;&gt;
   But &lt;a href=&quote;test&quote;&gt;yay?&lt;/a&gt;
      This is spanned
   is some more text that can be changed!
&lt;/div&gt;
            &#1111;1] =&gt; &quote;
            &#1111;2] =&gt; &quote;
            &#1111;3] =&gt; moreText
            &#1111;4] =&gt; But &lt;a href=&quote;test&quote;&gt;yay?&lt;/a&gt;
      This is spanned
   is some more text that can be changed!
        )

)
Post Reply