pull out everything between <html> tags

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
fgomez
Forum Commoner
Posts: 61
Joined: Mon Sep 26, 2005 11:23 pm
Location: Washington, DC

pull out everything between <html> tags

Post by fgomez »

Hello,

I'm working on a PHP script to parse a text file that looks something like this:
20040813093757<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
20040817113851<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
I need a pattern that matches everything from the first "<!DOC" to the first "</html>".

I am using the following but returning no results. Can anyone see what's wrong with this?

Code: Select all

$pattern_html = "#<!DOC[^(<!DOC)]+</html>#s" ;
preg_match_all($pattern_html, $source, $matches) ;
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

The following may be of use.

Code: Select all

[feyd@home]>php -r "$f = file_get_contents('test.txt'); preg_match_all('#^(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{2})(?s:(<!DOCTYPE.*))(?=^\d{14}<!DOCTYPE)#m',$f,$m,PREG_OFFSET_CAPTURE); var_dump($m);"
array(8) {
  [0]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(111) "20040813093757<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
"
      [1]=>
      int(0)
    }
  }
  [1]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(4) "2004"
      [1]=>
      int(0)
    }
  }
  [2]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(2) "08"
      [1]=>
      int(4)
    }
  }
  [3]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(2) "13"
      [1]=>
      int(6)
    }
  }
  [4]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(2) "09"
      [1]=>
      int(8)
    }
  }
  [5]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(2) "37"
      [1]=>
      int(10)
    }
  }
  [6]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(2) "57"
      [1]=>
      int(12)
    }
  }
  [7]=>
  array(1) {
    [0]=>
    array(2) {
      [0]=>
      string(97) "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
"
      [1]=>
      int(14)
    }
  }
}
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

mine's smaller :oops: :

Code: Select all

preg_replace("/.*(<!DOC.*?<\/html>).*/s","$1", $page);
Cheers,
Kieran
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

Kieran Huggins wrote:mine's smaller :oops: :

Code: Select all

preg_replace("/.*(<!DOC.*?<\/html>).*/s","$1", $page);
Cheers,
Kieran
but it is too greedy, and won't work with multiple <html> </html> statements.. for whatever reason there is multiple statements.
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

sure it does, I tested it - it's ungreedy!

.*? is the un-greed-i-fier.. thingy.

Honest!

Cheers,
Kieran
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

ungreedy was the wrong word.. was late when I posted. This is what I get, obviously not what the op was looking for.

Code: Select all

Array
(
    [0] => Array
        (
            [0] => 
20040813093757<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
20040817113851<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
        )

    [1] => Array
        (
            [0] => <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
blah blah blah
</html>
        )

)
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

not sure I understand... is my solution incorrect somehow?

Cheers,
Kieran
fgomez
Forum Commoner
Posts: 61
Joined: Mon Sep 26, 2005 11:23 pm
Location: Washington, DC

Post by fgomez »

I wrote:
Can anyone see what's wrong with this?

Code: Select all

$pattern_html = "#<!DOC[^(<!DOC)]+</html>#s" ;
What's wrong with it is that [^(<!DOC)] is not the appropriate way to negate the "<!DOC" pattern, though I'm not sure what is. That is, it's correct to negate a single character in that fashion -- [^Z] means no "Z"s please! -- but I'm not sure how to negate a set of characters.

The code below is what ended up working for me:

Code: Select all

$pattern_html = "#<!DOC.+?</html>#s" ; //this should give you the entire HTML message
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

That last pattern seemed to have worked the best (yes, I needed a break so I tested each of the posted patterns). Don't know why I am posting other than I am tired and need to take my mind off my own code at the moment or I will explode. Thanks.
Post Reply