Page 1 of 1
regex - extract data from web page
Posted: Fri Jun 12, 2009 12:50 am
by rupam_jaiswal
Hi,
My html looks like this
Code: Select all
<meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" />
<!-- message -->
<div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br />
<br />
<br />
<img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br />
<br />
<br />
info!<br />
<br />
<div style="margin:20px; margin-top:5px">
<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 34px;
text-align: left;
overflow: auto">http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html</pre>
</div><br />
<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 1490px;
text-align: left;
overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar</pre>
</div></div>
I want all the values that are after Code:</div> and between pre tags.
eg http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
and
http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar
Please note that at the start in meta tag there is also string Code: and I don't value from it.
Thanks in advance
Regards
Re: regex - extract data from web page
Posted: Fri Jun 12, 2009 2:28 am
by prometheuzz
Use an html parser for this. Using regex to parse (x)html is asking for trouble.
I've heard good things about:
http://simplehtmldom.sourceforge.net/
Re: regex - extract data from web page
Posted: Tue Jun 23, 2009 12:09 pm
by PM2008
You want to extract everything betwen the pairs of
<pre....>
and
</pre>
. Let's make it more general and make it case insensitive. I am using a local file "file.html", but you can substitute a URL in its place like "
http://xxx.yyy.zzz..." . It will work the exact same way.
Use the sen and stex commands. sen command counts the number of instances of a search string. stex command extracts around a search string. Use regular expression with -r option. Make our search string cases insensitive with the -c option.
Search string is <pre&\>&</pre\> . Searh string is enclosed in ^^.
In regular expressions, & means any number of any characters. Character > is part of the search string. But, it is a regular expression special character, so we will backspace it.
Here is the little script.
Code: Select all
# Script extract.txt
var str html ; cat "file.html" > $html
while ( { sen -c -r "^<pre&\>&</pre\>^" $html } > 0 )
do
var str out ; stex -c -r "^<pre&\>&</pre\>^" $html > $out
stex -c -r "^<pre&\>^]" $out > null ; stex -c -r "[^</pre\>^" $out > null
echo $out
done
Script is in biterscripting. To try,
1. Save the script as C:\extract.txt.
2. Download biterscripting from
http://www.biterscripting.com.
3. Start biterscripting. Run the extract script by entering the following command.
script "C:\extract.txt"
Patrick
Re: regex - extract data from web page
Posted: Thu Jun 25, 2009 10:45 am
by prometheuzz
PM2008 wrote:You want to extract everything betwen the pairs of
<pre....>
and
</pre>
. Let's make it more general and make it case insensitive. I am using a local file "file.html", but you can substitute a URL in its place like "
http://xxx.yyy.zzz..." . It will work the exact same way.
Use the sen and stex commands. sen command counts the number of instances of a search string. stex command extracts around a search string. Use regular expression with -r option. Make our search string cases insensitive with the -c option.
Search string is <pre&\>&</pre\> . Searh string is enclosed in ^^.
In regular expressions, & means any number of any characters. Character > is part of the search string. But, it is a regular expression special character,
No, that is not correct. In PCRE engines (PHP's preg_ functions are PCRE: Perl Compatible REgex), the '<' nor '>' are special characters.
PM2008 wrote:so we will backspace it.
Here is the little script.
Code: Select all
# Script extract.txt
var str html ; cat "file.html" > $html
while ( { sen -c -r "^<pre&\>&</pre\>^" $html } > 0 )
do
var str out ; stex -c -r "^<pre&\>&</pre\>^" $html > $out
stex -c -r "^<pre&\>^]" $out > null ; stex -c -r "[^</pre\>^" $out > null
echo $out
done
Script is in biterscripting. To try,
1. Save the script as C:\extract.txt.
2. Download biterscripting from
http://www.biterscripting.com.
3. Start biterscripting. Run the extract script by entering the following command.
script "C:\extract.txt"
Patrick
Err, this is a PHP forum. And it doesn't look like this biterscripting uses a PCRE implementation of it's string-matching (as far as I can tell), so I don't see how this post will help the OP very much...
Re: regex - extract data from web page
Posted: Thu Jul 09, 2009 1:26 pm
by ridgerunner
To answer the original thread topic question (albeit a bit too late), using a regex to extract the desired data is both fitting and appropriate. There is no need to resort to an HTML parser when a simple regex will do the trick. The data to be extracted has been specifically defined to be everything between <pre> tags, that immediately follow a <div> tag containing the text: "Code:". Here's a simple regex that efficiently captures the desired text into the first and only capture group:
Code: Select all
Code:</div>\s*<pre[^>]*>(.*?)</pre>
The 's'
"single line" mode (or
"dot matches newline" mode) as well as the 'i'
"ignore case" modifiers are set.
Here is a PHP command line script that prints out the matches in the given HTML test data:
Code: Select all
<?php // File: test.php - find all text within <pre> tags which follow: "Code:</div>"
// here is the regex which captures the wanted data into $1 capture group
$pattern = '#Code:</div>\s*<pre[^>]*>(.*?)</pre>#si';
// begin HTML text to be searched
$htmltext = <<<HEREDOC_STRING
<meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" />
<!-- message -->
<div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br />
<br />
<br />
<img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br />
<br />
<br />
info!<br />
<br />
<div style="margin:20px; margin-top:5px">
<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 34px;
text-align: left;
overflow: auto">http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html</pre>
</div><br />
<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 1490px;
text-align: left;
overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar</pre>
</div></div>
HEREDOC_STRING;
// end of HTML text to be searched
// call preg_match_all() to put all captured matches into $matches array
// $matchcount is the number of matches found
$matchcount = preg_match_all($pattern, $htmltext, $matches);
if ($matchcount > 0) {
echo("$matchcount matches found\n");
for ($i = 0; $i < $matchcount; $i++) {
echo("\nMatch #" . ($i + 1) . ":\n");
echo($matches[1][$i] . "\n");
}
} else {
echo('No matches');
}
?>
Re: regex - extract data from web page
Posted: Fri Jul 10, 2009 3:16 pm
by prometheuzz
ridgerunner wrote:To answer the original thread topic question (albeit a bit too late), using a regex to extract the desired data is both fitting and appropriate. There is no need to resort to an HTML parser when a simple regex will do the trick. The data to be extracted has been specifically defined to be everything between <pre> tags, that immediately follow a <div> tag ...
I respectfully disagree.
Yes, the format of the example is perhaps as clear to describe as that, but since it is so often the case that html is not so well formed, choosing an html parser to get specific parts from the contents of a such a file is IMO in many cases the "better" way to solve things. I don't see the advantage of using regex.