Page 1 of 1
pull out everything between <html> tags
Posted: Tue Dec 19, 2006 5:25 pm
by fgomez
Hello,
I'm working on a PHP script to parse a text file that looks something like this:
20040813093757<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
20040817113851<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
I need a pattern that matches everything from the first "<!DOC" to the first "</html>".
I am using the following but returning no results. Can anyone see what's wrong with this?
Code: Select all
$pattern_html = "#<!DOC[^(<!DOC)]+</html>#s" ;
preg_match_all($pattern_html, $source, $matches) ;
Posted: Tue Dec 19, 2006 5:41 pm
by feyd
The following may be of use.
Code: Select all
[feyd@home]>php -r "$f = file_get_contents('test.txt'); preg_match_all('#^(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})(?s:(<!DOCTYPE.*))(?=^\d{14}<!DOCTYPE)#m',$f,$m,PREG_OFFSET_CAPTURE); var_dump($m);"
array(8) {
[0]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(111) "20040813093757<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
"
[1]=>
int(0)
}
}
[1]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(4) "2004"
[1]=>
int(0)
}
}
[2]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "08"
[1]=>
int(4)
}
}
[3]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "13"
[1]=>
int(6)
}
}
[4]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "09"
[1]=>
int(8)
}
}
[5]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "37"
[1]=>
int(10)
}
}
[6]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(2) "57"
[1]=>
int(12)
}
}
[7]=>
array(1) {
[0]=>
array(2) {
[0]=>
string(97) "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
"
[1]=>
int(14)
}
}
}
Posted: Tue Dec 19, 2006 6:01 pm
by Kieran Huggins
mine's smaller

:
Code: Select all
preg_replace("/.*(<!DOC.*?<\/html>).*/s","$1", $page);
Cheers,
Kieran
Posted: Wed Dec 20, 2006 12:03 am
by John Cartwright
Kieran Huggins wrote:mine's smaller

:
Code: Select all
preg_replace("/.*(<!DOC.*?<\/html>).*/s","$1", $page);
Cheers,
Kieran
but it is too greedy, and won't work with multiple <html> </html> statements.. for whatever reason there is multiple statements.
Posted: Wed Dec 20, 2006 12:34 am
by Kieran Huggins
sure it does, I tested it - it's ungreedy!
.*? is the un-greed-i-fier.. thingy.
Honest!
Cheers,
Kieran
Posted: Wed Dec 20, 2006 9:44 am
by John Cartwright
ungreedy was the wrong word.. was late when I posted. This is what I get, obviously not what the op was looking for.
Code: Select all
Array
(
[0] => Array
(
[0] =>
20040813093757<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
20040817113851<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
)
[1] => Array
(
[0] => <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
)
)
Posted: Wed Dec 20, 2006 9:59 am
by Kieran Huggins
not sure I understand... is my solution incorrect somehow?
Cheers,
Kieran
Posted: Wed Dec 20, 2006 6:35 pm
by fgomez
I wrote:
Can anyone see what's wrong with this?
Code: Select all
$pattern_html = "#<!DOC[^(<!DOC)]+</html>#s" ;
What's wrong with it is that [^(<!DOC)] is not the appropriate way to negate the "<!DOC" pattern, though I'm not sure what is. That is, it's correct to negate a single character in that fashion -- [^Z] means no "Z"s please! -- but I'm not sure how to negate a set of characters.
The code below is what ended up working for me:
Code: Select all
$pattern_html = "#<!DOC.+?</html>#s" ; //this should give you the entire HTML message
Posted: Wed Dec 20, 2006 7:29 pm
by RobertGonzalez
That last pattern seemed to have worked the best (yes, I needed a break so I tested each of the posted patterns). Don't know why I am posting other than I am tired and need to take my mind off my own code at the moment or I will explode. Thanks.