Preg_match_all to get <div> tag contents

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Squibbles1077
Forum Newbie
Posts: 8
Joined: Sat Jul 04, 2009 2:00 pm
Location: US

Preg_match_all to get <div> tag contents

Post by Squibbles1077 »

Hi all,
I posted this in the PHP-Code section but I think it's more of a regex problem than a code one, so here it is:

Code: Select all

<?php
$data = file_get_contents('http://www.mywebsite.com/');
preg_match_all ("/<div class=\"main\">([^`]*?)<\/div>/", $data, $matches);
//testing the array $matches
echo sizeof($matches);
echo sprintf('<pre>%s</pre>', print_r($matches, true));
?>
is what I'm using to get the contents of a <div class="main"> tag in some HTML. However, it's not putting anything in the array $matches. It creates a 2D array, but none of the dimensions contain anything. The output of the tests is:
2

Array
(
[0] => Array
(
)

[1] => Array
(
)

)



Thanks for any input.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Preg_match_all to get <div> tag contents

Post by prometheuzz »

Squibbles1077 wrote:Hi all,
I posted this in the PHP-Code section but I think it's more of a regex problem than a code one, so here it is:

Code: Select all

<?php
$data = file_get_contents('http://www.mywebsite.com/');
preg_match_all ("/<div class=\"main\">([^`]*?)<\/div>/", $data, $matches);
//testing the array $matches
echo sizeof($matches);
echo sprintf('<pre>%s</pre>', print_r($matches, true));
?>
is what I'm using to get the contents of a <div class="main"> tag in some HTML. However, it's not putting anything in the array $matches. It creates a 2D array, but none of the dimensions contain anything. The output of the tests is:
2

Array
(
[0] => Array
(
)

[1] => Array
(
)

)



Thanks for any input.
Hard to say why things are going wrong: you didn't post the actual html (or a link to it) you're trying to match.
Furthermore, is there a reason for not using a traditional html parser instead? Parsing/matching (x)html with regex is not the way to go.
Squibbles1077
Forum Newbie
Posts: 8
Joined: Sat Jul 04, 2009 2:00 pm
Location: US

Re: Preg_match_all to get <div> tag contents

Post by Squibbles1077 »

What I'm trying to do is get the content between a <div class="main"> tag (<div class="main">content</div>). I spent a good hour or so on google trying to find examples of how to get <div> content with PHP and preg_match was the only solution I could find. I'm new to PHP so I don't know its specifics, what I was going to do was

$a = explode('<', $website);
//then a while loop
while ($i <= sizeof($a))
{
if (substr($a, 0, 17) == '<div class="main"')
{
//add it to a new array
}
}

but that seemed like a really slow way to go about it. Is there a better way that I'm unaware of?
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Preg_match_all to get <div> tag contents

Post by prometheuzz »

Squibbles1077 wrote:What I'm trying to do is get the content between a <div class="main"> tag (<div class="main">content</div>).
I figured that much.
Your regex already does that. Here's your code with the "<div class="main">content</div>" example string:

Code: Select all

$data = 'aaa <div class="main">content</div> bbb';
preg_match_all ("/<div.*?>([^`]*?)<\/div>/", $data, $matches);
//testing the array $matches
echo sizeof($matches);
echo sprintf('<pre>%s</pre>', print_r($matches, true));
As you can see, it does exactly what you wanted.
That's why I asked you to post the text you're trying to match or post the url of the page you're trying to match.
Squibbles1077 wrote:I spent a good hour or so on google trying to find examples of how to get <div> content with PHP and preg_match was the only solution I could find. I'm new to PHP so I don't know its specifics, what I was going to do was

$a = explode('<', $website);
//then a while loop
while ($i <= sizeof($a))
{
if (substr($a, 0, 17) == '<div class="main"')
{
//add it to a new array
}
}

but that seemed like a really slow way to go about it. Is there a better way that I'm unaware of?


If you're already concerned about speed*, then choosing a regex-solution is not the way to go. I "simple" solution using only basic string operations will almost always be faster than a regex solution.
Like I said in my first reply: use a proper html parser for this. Especially if you're concerned about speed.

* Is there a reason to be concerned about speed at this early stage? Is the code you're writing going to be executed thousands of times a second? IMO, you should write code that is easy to follow [and therefore to maintain]. Worry about performance when you have properly profiled your application/code.
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Preg_match_all to get <div> tag contents

Post by ridgerunner »

Follows is a PHP script which extracts and prints the contents of all <div class="main">contents</div> tags where the contents may contain arbitrarily nested DIVs. This script utilizes a regex which takes advantage of the advanced recursive capabilities of the PHP/PCRE engine. The regex is fully commented.

Code: Select all

<?php // File: MatchAllDivMain.php
 
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
 
// Commented regex to extract contents from <div class="main">contents</div>
//  where "contents" may contain nested <div>s.
//  Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{           # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*>              # match the "main" class DIV opening tag
  (                                   # capture "main" DIV contents into $1
    (?:                               # non-cap group for nesting * quantifier
      (?: (?!<div[^>]*>|</div>). )++  # possessively match all non-DIV tag chars
    |                                 # or 
      <div[^>]*>(?1)</div>            # recursively match nested <div>xyz</div>
    )*                                # loop however deep as necessary
  )                                   # end group 1 capture
</div>                               # match the "main" class DIV closing tag
}six';  // single-line (dot matches all), ignore case and free spacing modes ON
 
// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>}si';
 
$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
    echo("$matchcount matches found.\n");
//  print_r($matches);
    for($i = 0; $i < $matchcount; $i++) {
        echo("\nMatch #" . ($i + 1) . ":\n");
        echo($matches[1][$i]); // print 1st capture group for match number i
    }
} else {
    echo('No matches');
}
echo("\n</pre>");
?>
Hope this helps! :)
broomstick
Forum Newbie
Posts: 2
Joined: Tue Sep 22, 2009 1:55 pm

Re: Preg_match_all to get <div> tag contents

Post by broomstick »

ridgerunner wrote:Follows is a PHP script which extracts and prints the contents of all <div class="main">contents</div> tags where the contents may contain arbitrarily nested DIVs. This script utilizes a regex which takes advantage of the advanced recursive capabilities of the PHP/PCRE engine. The regex is fully commented.

Code: Select all

<?php // File: MatchAllDivMain.php
 
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
 
// Commented regex to extract contents from <div class="main">contents</div>
//  where "contents" may contain nested <div>s.
//  Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{           # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*>              # match the "main" class DIV opening tag
  (                                   # capture "main" DIV contents into $1
    (?:                               # non-cap group for nesting * quantifier
      (?: (?!<div[^>]*>|</div>). )++  # possessively match all non-DIV tag chars
    |                                 # or 
      <div[^>]*>(?1)</div>            # recursively match nested <div>xyz</div>
    )*                                # loop however deep as necessary
  )                                   # end group 1 capture
</div>                               # match the "main" class DIV closing tag
}six';  // single-line (dot matches all), ignore case and free spacing modes ON
 
// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>}si';
 
$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
    echo("$matchcount matches found.\n");
//  print_r($matches);
    for($i = 0; $i < $matchcount; $i++) {
        echo("\nMatch #" . ($i + 1) . ":\n");
        echo($matches[1][$i]); // print 1st capture group for match number i
    }
} else {
    echo('No matches');
}
echo("\n</pre>");
?>
Hope this helps! :)

That definitely helped me! Thank you!

Now here's another question...

Here's the code:

Code: Select all

 
<?php
  // Read html file to be processed into $data variable
  $data = file_get_contents('http://www.music.umn.edu/marchingband/index.php');
  
  // Commented regex to extract contents from <div class="main">contents</div>
  //  where "contents" may contain nested <div>s.
  //  Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
  // short version of same regex
  $pattern_short = '{<div\s+id="pepBandEvents"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>}si';
  
  $matchcount = preg_match_all($pattern_short, $data, $matches);
  echo("<div id=\"pepBandEvents\">\n");
  if ($matchcount > 0) {
      for ($i = 0; $i < $matchcount; $i++) {
          // print 1st capture group for match number i
          echo($matches[1][$i]);
      }
  } else {
      echo('No matches');
  }
  echo("</div>\n");
?>
 
In the code grabbed from http://www.music.umn.edu/marchingband/index.php, there are some locally referenced links. How would I go about inserting "http://www.music.umn.edu/marchingband/" right before the locally referenced URL?
broomstick
Forum Newbie
Posts: 2
Joined: Tue Sep 22, 2009 1:55 pm

Re: Preg_match_all to get <div> tag contents

Post by broomstick »

Nevermind! I figured it out!

Code: Select all

echo str_replace("<a href=\"","<a href=\"http://www.music.umn.edu/marchingband/",($matches[1][$i]));
AlDm
Forum Newbie
Posts: 1
Joined: Sun Nov 22, 2009 3:43 pm

Re: Preg_match_all to get <div> tag contents

Post by AlDm »

ridgerunner wrote:Follows is a PHP script which extracts and prints the contents of all <div class="main">contents</div> tags where the contents may contain arbitrarily nested DIVs. This script utilizes a regex which takes advantage of the advanced recursive capabilities of the PHP/PCRE engine. The regex is fully commented.

Code: Select all

<?php // File: MatchAllDivMain.php
 
// Read html file to be processed into $data variable
$data = file_get_contents('test.html');
 
// Commented regex to extract contents from <div class="main">contents</div>
//  where "contents" may contain nested <div>s.
//  Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{           # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*>              # match the "main" class DIV opening tag
  (                                   # capture "main" DIV contents into $1
    (?:                               # non-cap group for nesting * quantifier
      (?: (?!<div[^>]*>|</div>). )++  # possessively match all non-DIV tag chars
    |                                 # or 
      <div[^>]*>(?1)</div>            # recursively match nested <div>xyz</div>
    )*                                # loop however deep as necessary
  )                                   # end group 1 capture
</div>                               # match the "main" class DIV closing tag
}six';  // single-line (dot matches all), ignore case and free spacing modes ON
 
// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>}si';
 
$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>\n");
if ($matchcount > 0) {
    echo("$matchcount matches found.\n");
//  print_r($matches);
    for($i = 0; $i < $matchcount; $i++) {
        echo("\nMatch #" . ($i + 1) . ":\n");
        echo($matches[1][$i]); // print 1st capture group for match number i
    }
} else {
    echo('No matches');
}
echo("\n</pre>");
?>
Hope this helps! :)
Thats really perfect solution but i want to ask that, if we have <div id="someid" class="main"> Content goes here </div>, how can we grab it from its class="main" property? I mean let the regex doesn't care any tags between <div and class="main" parts, and let it find the proper div tag.
Because in the regex we say that, after "<div" tag there is a space and then class="main" property. In addition, thats valid for end part of opening div tag because if we have <div class="main" style="some style codes"> then will regex work? If you answer these questions i'll be very glad. Thanks right now.
Finally, i want to say that again, your code is perfect.
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Re: Preg_match_all to get <div> tag contents

Post by josh »

Hey prometheuzz, what are the advantages of ^` over .

I usually scrape tags with arbitrary attributes like this

'#<div(.*?)>(.*?)</div>#'

Then I always know my match will be in $2

The recursive regex solution is interesting. Normally I am just wanting the plain text and haven't cared much about matching the right closing </div>, I would probably use simplexmllib if I had to do that. Where can I learn more about the recursive regex technique tho?
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Preg_match_all to get <div> tag contents

Post by ridgerunner »

AlDm wrote:...
Thats really perfect solution but i want to ask that, if we have <div id="someid" class="main"> Content goes here </div>, how can we grab it from its class="main" property? I mean let the regex doesn't care any tags between <div and class="main" parts, and let it find the proper div tag.
Because in the regex we say that, after "<div" tag there is a space and then class="main" property. In addition, thats valid for end part of opening div tag because if we have <div class="main" style="some style codes"> then will regex work? If you answer these questions i'll be very glad. Thanks right now.
Finally, i want to say that again, your code is perfect.
Yes, the solution I provided <div\s+class="main"\s*> was very specific and did not allow for any other attributes in the DIV opening tag. This is easily adjusted to allow any other attributes to appear before and/or after the class="main" attribute like so:

Code: Select all

<div\s+[^>]*?class="main"[^>]*>
User avatar
Guldstrand
Forum Newbie
Posts: 20
Joined: Mon Jul 26, 2004 11:16 pm
Location: Sweden

Re: Preg_match_all to get <div> tag contents

Post by Guldstrand »

I have a similar question.
I want to parse:

ALT-tag: alt="Heroes - Säsong 3 (6 disc)"
Price: <div class="price">&nbsp;299 kr</div>

..from 10 different movies in a loop

HTML-output:

Code: Select all

        <div class="buy-container">
            <div class="product"><a href="http://*****/heroes_-_s%c3%a4song_3_%286_disc%29-7172084"><img src="/media-dynamic/images/product/00/07/17/20/84/6/bf049543-0b19-4c2c-942b-8928e8ebf468.jpg" alt="Heroes - Säsong 3 (6 disc)" title="Heroes - Säsong 3 (6 disc)"></a></div>
            <div class="price">&nbsp;299 kr</div>
            <img class="icon" src="/media-dynamic/images/format/2-199-big.gif" alt="DVD" title="DVD">
            <a href="http://*****/add?product-id=7172084&referer=%2ffilm%2ftv-serierna%2fnyheter%2f" title="Köp" rel="noindex nofollow"><img src="/media-static/images/button/sv/buy.gif" alt="Köp" title="Köp"></a>
        </div>
Can someone please help me with a regexp for the code above!? :oops:

Thanks in advance and happy new year to you all!
User avatar
Guldstrand
Forum Newbie
Posts: 20
Joined: Mon Jul 26, 2004 11:16 pm
Location: Sweden

Re: Preg_match_all to get <div> tag contents

Post by Guldstrand »

*bump* :oops:
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Preg_match_all to get <div> tag contents

Post by ridgerunner »

Guldstrand wrote:I have a similar question.
I want to parse: ...
Parse? Your question is not clear. Please explain exactly what you wish to accomplish. i.e. What is your subject text? Do you wish to match something? Or replace something?

need a more detailed and precise question please...
User avatar
Guldstrand
Forum Newbie
Posts: 20
Joined: Mon Jul 26, 2004 11:16 pm
Location: Sweden

Re: Preg_match_all to get <div> tag contents

Post by Guldstrand »

Well.. as i wrote in my first post, i want to parse the alt-tag and the "price" from the HTML-output posted above.
What´s not clear about that? :?
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Preg_match_all to get <div> tag contents

Post by ridgerunner »

Guldstrand wrote:... What´s not clear about that? :?
A regex can match or replace and it can capture sub strings. Nothing more. You use the word "parse" - what do you mean by that? (It may be obvious to you but it is not obvious to me) Also you talk about 10 movies and provide some HTML, but the HTML does not contain 10 of anything. Is the price always a whole number or will some have fractional parts? The bottom line is that when you talk about writing a regex, you need to be very explicit and precise when describing exactly what you wish to match and that which you don't want to match.

That said, I think this script may do what you are looking for:

Code: Select all

<?php
$data = file_get_contents('test.txt');
$re = '%<div class="product"[^>]*><a[^>]*><img[^>]*?alt="([^"]+)"[^>]*></a></div>\s*<div class="price">(?:&nbsp;|\s)*(\d+(?:\.\d+)?)%i';
$nmovies = preg_match_all($re, $data, $matches, PREG_SET_ORDER);
for ($i = 0; $i < $nmovies; $i++) {
    print 'Movie ' . ($i+1) . " of " . $nmovies . "\n";
    print ' Title = ' . $matches[$i][1] . "\n";
    print ' Price = ' . $matches[$i][2] . "\n";
}
?>
This regex solution assumes that the other "movies" look pretty much like this one. (i.e. same <div class="product"><a><img></a></div>\n<div class="price"></div> HTML structure) It captures prices having whole numbers (111) or whole+fractional numbers (123.45).

Hope this helps! :)
Post Reply