Page 1 of 1

preg_replace ?

Posted: Mon Feb 09, 2009 9:38 pm
by t2birkey
I have the entire webpage stored in a string variable before the end product is outputted. I want to first search through my page and decide if there are any div layers built which have certain words within them. An example would be

Code: Select all

<div name="content1" style="whatever"><h1>sample title</h1><p>this is a paragraph that i want to find something from.</p></div>
Now I want to search through my page before it is inserted and remove this div layer and replace it with a new div layer that may look like <div class="error"><h1>404</h1><p>this is a warning msg because you used bad word and I don't like it.</p></div>

I am having problems selecting the entire div layer. Sometimes I figure a way which will select it seems like too much of the body of the page.

Thanks in advance.

Re: preg_replace ?

Posted: Mon Feb 09, 2009 9:39 pm
by Citizen
What code are you using?

Re: preg_replace ?

Posted: Mon Feb 09, 2009 9:56 pm
by t2birkey
I actually made that example up on the fly.. But an example might be:

Code: Select all

$body = preg_replace('/<div[^>].*>.*paragraph.*<.div>/', $error_div, $body);
I am looking for a div layer that says paragraph in it. In the real code the words i am looking for are not as common as "paragraph". I have tried many variations as I think I said. I have found some tutorials on this like posting.php?mode=reply&f=1&t=94987. But I just keep running into problems!

Another thing I tried was

Code: Select all

$body = preg_replace('#<\s*div[^>].*?>.*?paragraph.*<\s*/\s*div\s*>#si', $error_div, $body);
Don't ask me what this one means.. I got it from someone else that I edited slightly but still can't figure out what it means.. I have no idea what \s means or #si and some of the other parts to this.

Re: preg_replace ?

Posted: Tue Feb 10, 2009 5:30 am
by mintedjo
I think due to the recursive nature of HTML markup it would be difficult to do this correctly using regex alone.
Unless you want to look at using something else (i'm not sure what) that uses a more backus naur approach then you might want to think about just replacing the individual words that you don't like, rather than replacing the whole div element.

Re: preg_replace ?

Posted: Tue Feb 10, 2009 4:10 pm
by t2birkey
Yes, that would be much easier but I need it to be done like this. Is there is way to read html code similar to xml. I could parse through each element. Like html, to body to each div child within the body tag. I was thinking maying using explode to explode the body tag and then explode from the div to div.. but I am afraid that when i try to explode the next </div> it might not be the correct end div.

Re: preg_replace ?

Posted: Tue Feb 10, 2009 4:13 pm
by John Cartwright
t2birkey wrote:Yes, that would be much easier but I need it to be done like this. Is there is way to read html code similar to xml.

Code: Select all

 
$xmlDoc = new DOMDocument();
$xmlDoc->load($html);
:)

Re: preg_replace ?

Posted: Tue Feb 10, 2009 4:27 pm
by t2birkey
Thank you John. I will give this a try.

Re: preg_replace ?

Posted: Tue Feb 10, 2009 5:07 pm
by t2birkey
Okay.. I have loaded the html string into the object. What next? I tried dumping the $xmlDoc object but it is empty. I also tried some functions like getElementsByTagName.. but the object is still empty. I' guessing there was a problem with loading the html data.

Code: Select all

object(DOMDocument)#1 (0) { }

Re: preg_replace ?

Posted: Wed Feb 11, 2009 4:47 am
by mintedjo
I think I had some trouble trying to dump DOMNodes.
If your HTML is valid then the code John gave will load it fine.
Have you tried

Code: Select all

$xmlDoc = new DOMDocument();
$xmlDoc->load($html);
echo $xmlDoc->saveXML();
Because if that doesn't work the chances are your HTML is wrong.

Re: preg_replace ?

Posted: Wed Feb 11, 2009 10:22 am
by John Cartwright
Take a look at HTMLpurifier, which can cleanse your html to produce valid markup.

Re: preg_replace ?

Posted: Wed Feb 11, 2009 3:47 pm
by t2birkey
HTMLpurifier looks like a great project. I think I will start using it in some of my other projects. I ran a test query on their demo to clean up my html and tried dumping the xmlDoc variable once again. The result was the same. (object(DOMDocument)#1 (0) { }). I also tried mintedjo suggestion of echo $xmlDoc->saveXML(); on the clean html. Nothing was echoed.

Any further steps/ideas. I am open to taking this in a new direction. I really would like to get an understanding of the preg_replace function's syntax for search terms.. but obviously the main goal is to get it working, then efficiently. :P

Finally, John Cartwright: Thanks for the continued support/ideas

Re: preg_replace ?

Posted: Wed Feb 11, 2009 4:17 pm
by t2birkey
Within the HTML Purifier code I saw a function which used preg_match to find the inner workings of the URI. This is the syntax that I do not fully understand nor do I know where to find documentation for it.

Code: Select all

 
            '(([^:/?#"<>]+):)?'. // 2. Scheme
            '(//([^/?#"<>]*))?'. // 4. Authority
            '([^?#"<>]*)'.       // 5. Path
            '(\?([^#"<>]*))?'.   // 7. Query
            '(#([^"<>]*))?'.     // 8. Fragment
 

I can make out bits, mostly because of the comments hint as to what they are

Re: preg_replace ?

Posted: Fri Feb 13, 2009 7:49 am
by mintedjo
preg_match searches for parts of strings that match a specified regular expression.
I would explain the meaning of all the expressions above but you can probably work them out if you read some stuff from the webpages below.

preg_match documentation
http://us2.php.net/manual/en/function.preg-match.php
For lots of useful info on regular expressions
http://www.regular-expressions.info

Back to the topic at hand, using preg_match alone isn't a good way to solve your problem at all. It will be more difficult than just finding out why the html wont load into a DOMDocument :-P
If you can get the document loaded you can check all the domelements individually (using a preg_match if you want) and then replace them with a new domelement if they contain naughty words.

EDIT: only just realised you said you can make out bits because of the comments... so waht I posted is mostly irrelevent. I deleted all the nonsense :-D

Re: preg_replace ?

Posted: Fri Feb 13, 2009 1:18 pm
by t2birkey
Actually your post was a good start. I will look into http://www.regular-expressions.info/ but I would also like to see more webpages which explain regular expression searching. There seems to be many different ways of searching within regular expressions and I would enjoy reading all about the different ways. I see this function as very useful if I could understand all its uses.

Re: preg_replace ?

Posted: Mon Feb 16, 2009 4:04 am
by mintedjo
Cool.
Googling will bring up lots of stuff about regular expressions but if you need help with anything specific I'm sure me or somebody else on here will be willing to help as much as possible.