regex - extract data from web page

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
rupam_jaiswal
Forum Newbie
Posts: 22
Joined: Thu Jun 05, 2008 12:54 am

regex - extract data from web page

Post by rupam_jaiswal »

Hi,
My html looks like this

Code: Select all

 
<meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" />
<!-- message -->
      <div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br />
<br />
<br />
<img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br />
<br />
<br />
info!<br />
<br />
 
<div style="margin:20px; margin-top:5px">
   <div class="smallfont" style="margin-bottom:2px">Code:</div>
   <pre class="alt2" dir="ltr" style="
      margin: 0px;
      padding: 6px;
      border: 1px inset;
      width: 470px;
      height: 34px;
      text-align: left;
      overflow: auto">http://www.sample1.com/part1.html
      http://www.sample1.com/part1.html
      http://www.sample1.com/part1.html</pre>
</div><br />
 
<div class="smallfont" style="margin-bottom:2px">Code:</div>
   <pre class="alt2" dir="ltr" style="
      margin: 0px;
      padding: 6px;
      border: 1px inset;
      width: 470px;
      height: 1490px;
      text-align: left;
      overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar</pre>
 
</div></div>
 
I want all the values that are after Code:</div> and between pre tags.
eg http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
and
http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar

Please note that at the start in meta tag there is also string Code: and I don't value from it.
Thanks in advance
Regards
Last edited by Benjamin on Fri Jun 12, 2009 1:14 am, edited 1 time in total.
Reason: Added [code=html] tags. Disabled Links
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: regex - extract data from web page

Post by prometheuzz »

Use an html parser for this. Using regex to parse (x)html is asking for trouble.
I've heard good things about: http://simplehtmldom.sourceforge.net/
PM2008
Forum Newbie
Posts: 7
Joined: Mon Dec 29, 2008 10:47 am

Re: regex - extract data from web page

Post by PM2008 »

You want to extract everything betwen the pairs of

<pre....>

and

</pre>

. Let's make it more general and make it case insensitive. I am using a local file "file.html", but you can substitute a URL in its place like "http://xxx.yyy.zzz..." . It will work the exact same way.

Use the sen and stex commands. sen command counts the number of instances of a search string. stex command extracts around a search string. Use regular expression with -r option. Make our search string cases insensitive with the -c option. Search string is <pre&\>&</pre\> . Searh string is enclosed in ^^.

In regular expressions, & means any number of any characters. Character > is part of the search string. But, it is a regular expression special character, so we will backspace it.

Here is the little script.

Code: Select all

# Script extract.txt
var str html ; cat "file.html" > $html
while ( { sen -c -r "^<pre&\>&</pre\>^" $html } > 0 )
do
    var str out ; stex -c -r "^<pre&\>&</pre\>^" $html > $out
    stex -c -r "^<pre&\>^]" $out > null ; stex -c -r "[^</pre\>^" $out > null
    echo $out
done
Script is in biterscripting. To try,

1. Save the script as C:\extract.txt.

2. Download biterscripting from http://www.biterscripting.com.

3. Start biterscripting. Run the extract script by entering the following command.

script "C:\extract.txt"

Patrick
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: regex - extract data from web page

Post by prometheuzz »

PM2008 wrote:You want to extract everything betwen the pairs of

<pre....>

and

</pre>

. Let's make it more general and make it case insensitive. I am using a local file "file.html", but you can substitute a URL in its place like "http://xxx.yyy.zzz..." . It will work the exact same way.

Use the sen and stex commands. sen command counts the number of instances of a search string. stex command extracts around a search string. Use regular expression with -r option. Make our search string cases insensitive with the -c option. Search string is <pre&\>&</pre\> . Searh string is enclosed in ^^.

In regular expressions, & means any number of any characters. Character > is part of the search string. But, it is a regular expression special character,
No, that is not correct. In PCRE engines (PHP's preg_ functions are PCRE: Perl Compatible REgex), the '<' nor '>' are special characters.
PM2008 wrote:so we will backspace it.

Here is the little script.

Code: Select all

# Script extract.txt
var str html ; cat "file.html" > $html
while ( { sen -c -r "^<pre&\>&</pre\>^" $html } > 0 )
do
    var str out ; stex -c -r "^<pre&\>&</pre\>^" $html > $out
    stex -c -r "^<pre&\>^]" $out > null ; stex -c -r "[^</pre\>^" $out > null
    echo $out
done
Script is in biterscripting. To try,

1. Save the script as C:\extract.txt.

2. Download biterscripting from http://www.biterscripting.com.

3. Start biterscripting. Run the extract script by entering the following command.

script "C:\extract.txt"

Patrick
Err, this is a PHP forum. And it doesn't look like this biterscripting uses a PCRE implementation of it's string-matching (as far as I can tell), so I don't see how this post will help the OP very much...
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: regex - extract data from web page

Post by ridgerunner »

To answer the original thread topic question (albeit a bit too late), using a regex to extract the desired data is both fitting and appropriate. There is no need to resort to an HTML parser when a simple regex will do the trick. The data to be extracted has been specifically defined to be everything between <pre> tags, that immediately follow a <div> tag containing the text: "Code:". Here's a simple regex that efficiently captures the desired text into the first and only capture group:

Code: Select all

Code:</div>\s*<pre[^>]*>(.*?)</pre>
The 's' "single line" mode (or "dot matches newline" mode) as well as the 'i' "ignore case" modifiers are set.

Here is a PHP command line script that prints out the matches in the given HTML test data:

Code: Select all

<?php // File: test.php - find all text within <pre> tags which follow: "Code:</div>"
 
// here is the regex which captures the wanted data into $1 capture group
$pattern = '#Code:</div>\s*<pre[^>]*>(.*?)</pre>#si';
 
// begin HTML text to be searched
$htmltext = <<<HEREDOC_STRING
 
<meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" />
<!-- message -->
      <div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br />
<br />
<br />
<img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br />
<br />
<br />
info!<br />
<br />
 
<div style="margin:20px; margin-top:5px">
   <div class="smallfont" style="margin-bottom:2px">Code:</div>
   <pre class="alt2" dir="ltr" style="
     margin: 0px;
     padding: 6px;
     border: 1px inset;
     width: 470px;
     height: 34px;
     text-align: left;
     overflow: auto">http://www.sample1.com/part1.html
      http://www.sample1.com/part1.html
      http://www.sample1.com/part1.html</pre>
</div><br />
 
<div class="smallfont" style="margin-bottom:2px">Code:</div>
   <pre class="alt2" dir="ltr" style="
     margin: 0px;
     padding: 6px;
     border: 1px inset;
     width: 470px;
     height: 1490px;
     text-align: left;
     overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar</pre>
 
</div></div>
HEREDOC_STRING;
// end of HTML text to be searched
 
// call preg_match_all() to put all captured matches into $matches array
//   $matchcount is the number of matches found
$matchcount = preg_match_all($pattern, $htmltext, $matches);
if ($matchcount > 0) {
    echo("$matchcount matches found\n");
    for ($i = 0; $i < $matchcount; $i++) {
        echo("\nMatch #" . ($i + 1) . ":\n");
        echo($matches[1][$i] . "\n");
    }
} else {
    echo('No matches');
}
?>
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: regex - extract data from web page

Post by prometheuzz »

ridgerunner wrote:To answer the original thread topic question (albeit a bit too late), using a regex to extract the desired data is both fitting and appropriate. There is no need to resort to an HTML parser when a simple regex will do the trick. The data to be extracted has been specifically defined to be everything between <pre> tags, that immediately follow a <div> tag ...
I respectfully disagree.
Yes, the format of the example is perhaps as clear to describe as that, but since it is so often the case that html is not so well formed, choosing an html parser to get specific parts from the contents of a such a file is IMO in many cases the "better" way to solve things. I don't see the advantage of using regex.
Post Reply