quoting an easy web extraction test

Discussion of testing theory and practice, including methodologies (such as TDD, BDD, DDD, Agile, XP) and software - anything to do with testing goes here. (Formerly "The Testing Side of Development")

Moderator: General Moderators

Post Reply
kapil1089theking
Forum Commoner
Posts: 46
Joined: Wed May 28, 2008 1:51 pm
Location: Kolkata, India
Contact:

quoting an easy web extraction test

Post by kapil1089theking »

I have an url http://url.com/some.htm
I want to log all the links in this html file to a text file.

Only links.

How to do that?
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: quoting an easy web extraction test

Post by AbraCadaver »

Read the file, then use a regex to capture the links, then write out to your text file. Alternately, you could load it up with the DOM extension and loop through the anchors and extract the URLs and write out to a text file.

http://us2.php.net/manual/en/book.dom.php
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
User avatar
tr0gd0rr
Forum Contributor
Posts: 305
Joined: Thu May 11, 2006 8:58 pm
Location: Utah, USA

Re: quoting an easy web extraction test

Post by tr0gd0rr »

Here is a quick and dirty way that I just tested:

Code: Select all

<?php
$html = file_get_contents('http://forums.devnetwork.net/viewtopic.php?f=39&t=110110');
preg_match_all('/<a[^>]+href[^>]*=[^>]*(\'|")([^>]*)\1/si', $html, $matches, PREG_SET_ORDER);
$links = '';
foreach ($matches as $match) {
    $links .= $match[2] . "\n";
}
echo "<pre>" . $links;
echoes this string:

Code: Select all

./index.php?sid=155659f42576cef48964d51018cf8222
./ucp.php?mode=login&sid=155659f42576cef48964d51018cf8222
./ucp.php?mode=register&sid=155659f42576cef48964d51018cf8222
<snip>
./viewtopic.php?p=582433&sid=155659f42576cef48964d51018cf8222#p582433
http://url.com/some.htm
#wrapheader
<snip>
./viewforum.php?f=59&sid=155659f42576cef48964d51018cf8222
./viewforum.php?f=39&sid=155659f42576cef48964d51018cf8222
http://www.phpbb.com/
You'd probably want to add some code that runs parse_url() on the original url and resolves relative and absolute links. You may also want to discard those that are simply hashes like #wrapheader.
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: quoting an easy web extraction test

Post by AbraCadaver »

I thought they could do their own research, but since we're posting code:

Code: Select all

$html = new DOMDocument();
$html->loadHTMLFile('http://url.com/some.htm');
$tags = $html->getElementsByTagName('a');
 
$links = '';
foreach ($tags as $tag) {
    $links .= $tag->getAttribute('href') . "\n";
}
file_put_contents('somefile.txt', $links);
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
kapil1089theking
Forum Commoner
Posts: 46
Joined: Wed May 28, 2008 1:51 pm
Location: Kolkata, India
Contact:

Re: quoting an easy web extraction test

Post by kapil1089theking »

Hi Shawn,

Working with url viewtopic.php?f=39&t=110110
but i need it to work with https://mobile.bet365.com/wap?task=upda ... w!&login=F

when I write

$html->loadHTMLFile('https://mobile.bet365.com/wap?task=upda ... w!&login=F');

its showing these errors:

Notice: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Unable to find the wrapper "https" - did you forget to enable it when you configured PHP? in C:\wamp\www\paisa\web extractor.php on line 3

Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: I/O warning : failed to load external entity "https://mobile.bet365.com/wap?task=upda ... w!&login=F" in C:\wamp\www\paisa\web extractor.php on line 3
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: quoting an easy web extraction test

Post by AbraCadaver »

You probably need to enable the extension=php_openssl.??? in php.ini.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
kapil1089theking
Forum Commoner
Posts: 46
Joined: Wed May 28, 2008 1:51 pm
Location: Kolkata, India
Contact:

Re: quoting an easy web extraction test

Post by kapil1089theking »

But why it is working with some url and some not?
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: quoting an easy web extraction test

Post by AbraCadaver »

Because some are https and some are http.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
kapil1089theking
Forum Commoner
Posts: 46
Joined: Wed May 28, 2008 1:51 pm
Location: Kolkata, India
Contact:

Re: quoting an easy web extraction test

Post by kapil1089theking »

that means for https I have to install openssl?

Can anyone help me installing openssl on wamp 2.0
User avatar
AbraCadaver
DevNet Master
Posts: 2572
Joined: Mon Feb 24, 2003 10:12 am
Location: The Republic of Texas
Contact:

Re: quoting an easy web extraction test

Post by AbraCadaver »

kapil1089theking wrote:that means for https I have to install openssl?

Can anyone help me installing openssl on wamp 2.0
I think all you have to do is uncomment extension=php_openssl.dll in php.ini and restart apache. Make sure that your PHP directory is in your path or copy the openssleay.dll to \windows\system.

It's been years since I used windows.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
Post Reply