Any questions involving matching text strings to patterns - the pattern is called a "regular expression."
Moderator: General Moderators
fgomez
Forum Commoner
Posts: 61 Joined: Mon Sep 26, 2005 11:23 pm
Location: Washington, DC
Post
by fgomez » Tue Dec 19, 2006 5:25 pm
Hello,
I'm working on a PHP script to parse a text file that looks something like this:
20040813093757<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
20040817113851<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
I need a pattern that matches everything from the first "<!DOC" to the first "</html>".
I am using the following but returning no results. Can anyone see what's wrong with this?
Code: Select all
$pattern_html = "#<!DOC[^(<!DOC)]+</html>#s" ;
preg_match_all($pattern_html, $source, $matches) ;
feyd
Neighborhood Spidermoddy
Posts: 31559 Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA
Post
by feyd » Tue Dec 19, 2006 5:41 pm
The following may be of use.
Code: Select all
[feyd@home]>php -r "$f = file_get_contents('test.txt'); preg_match_all('#^(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})(?s:(<!DOCTYPE.*))(?=^\d{14}<!DOCTYPE)#m',$f,$m,PREG_OFFSET_CAPTURE); var_dump($m);"
array(8) {
[0]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(111) "20040813093757<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
"
[1]=>
int(0)
}
}
[1]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(4) "2004"
[1]=>
int(0)
}
}
[2]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "08"
[1]=>
int(4)
}
}
[3]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "13"
[1]=>
int(6)
}
}
[4]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "09"
[1]=>
int(8)
}
}
[5]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "37"
[1]=>
int(10)
}
}
[6]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "57"
[1]=>
int(12)
}
}
[7]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(97) "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
"
[1]=>
int(14)
}
}
}
Kieran Huggins
DevNet Master
Posts: 3635 Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:
Post
by Kieran Huggins » Tue Dec 19, 2006 6:01 pm
mine's smaller
:
Code: Select all
preg_replace("/.*(<!DOC.*?<\/html>).*/s","$1", $page);
Cheers,
Kieran
John Cartwright
Site Admin
Posts: 11470 Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:
Post
by John Cartwright » Wed Dec 20, 2006 12:03 am
Kieran Huggins wrote: mine's smaller
:
Code: Select all
preg_replace("/.*(<!DOC.*?<\/html>).*/s","$1", $page);
Cheers,
Kieran
but it is too greedy, and won't work with multiple <html> </html> statements.. for whatever reason there is multiple statements.
Kieran Huggins
DevNet Master
Posts: 3635 Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:
Post
by Kieran Huggins » Wed Dec 20, 2006 12:34 am
sure it does, I tested it - it's un greedy!
.*? is the un-greed-i-fier.. thingy.
Honest!
Cheers,
Kieran
John Cartwright
Site Admin
Posts: 11470 Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:
Post
by John Cartwright » Wed Dec 20, 2006 9:44 am
ungreedy was the wrong word.. was late when I posted. This is what I get, obviously not what the op was looking for.
Code: Select all
Array
(
[0] => Array
(
[0] =>
20040813093757<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
20040817113851<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
)
[1] => Array
(
[0] => <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
)
)
Kieran Huggins
DevNet Master
Posts: 3635 Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:
Post
by Kieran Huggins » Wed Dec 20, 2006 9:59 am
not sure I understand... is my solution incorrect somehow?
Cheers,
Kieran
fgomez
Forum Commoner
Posts: 61 Joined: Mon Sep 26, 2005 11:23 pm
Location: Washington, DC
Post
by fgomez » Wed Dec 20, 2006 6:35 pm
I wrote:
Can anyone see what's wrong with this?
Code: Select all
$pattern_html = "#<!DOC[^(<!DOC)]+</html>#s" ;
What's wrong with it is that [^(<!DOC)] is not the appropriate way to negate the "<!DOC" pattern, though I'm not sure what is. That is, it's correct to negate a single character in that fashion -- [^Z] means no "Z"s please! -- but I'm not sure how to negate a set of characters.
The code below is what ended up working for me:
Code: Select all
$pattern_html = "#<!DOC.+?</html>#s" ; //this should give you the entire HTML message
RobertGonzalez
Site Administrator
Posts: 14293 Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA
Post
by RobertGonzalez » Wed Dec 20, 2006 7:29 pm
That last pattern seemed to have worked the best (yes, I needed a break so I tested each of the posted patterns). Don't know why I am posting other than I am tired and need to take my mind off my own code at the moment or I will explode. Thanks.