Page 1 of 1
quoting an easy web extraction test
Posted: Fri Dec 11, 2009 12:53 pm
by kapil1089theking
I have an url
http://url.com/some.htm
I want to log all the links in this html file to a text file.
Only links.
How to do that?
Re: quoting an easy web extraction test
Posted: Fri Dec 11, 2009 2:03 pm
by AbraCadaver
Read the file, then use a regex to capture the links, then write out to your text file. Alternately, you could load it up with the DOM extension and loop through the anchors and extract the URLs and write out to a text file.
http://us2.php.net/manual/en/book.dom.php
Re: quoting an easy web extraction test
Posted: Fri Dec 11, 2009 2:08 pm
by tr0gd0rr
Here is a quick and dirty way that I just tested:
Code: Select all
<?php
$html = file_get_contents('http://forums.devnetwork.net/viewtopic.php?f=39&t=110110');
preg_match_all('/<a[^>]+href[^>]*=[^>]*(\'|")([^>]*)\1/si', $html, $matches, PREG_SET_ORDER);
$links = '';
foreach ($matches as $match) {
$links .= $match[2] . "\n";
}
echo "<pre>" . $links;
echoes this string:
Code: Select all
./index.php?sid=155659f42576cef48964d51018cf8222
./ucp.php?mode=login&sid=155659f42576cef48964d51018cf8222
./ucp.php?mode=register&sid=155659f42576cef48964d51018cf8222
<snip>
./viewtopic.php?p=582433&sid=155659f42576cef48964d51018cf8222#p582433
http://url.com/some.htm
#wrapheader
<snip>
./viewforum.php?f=59&sid=155659f42576cef48964d51018cf8222
./viewforum.php?f=39&sid=155659f42576cef48964d51018cf8222
http://www.phpbb.com/
You'd probably want to add some code that runs parse_url() on the original url and resolves relative and absolute links. You may also want to discard those that are simply hashes like #wrapheader.
Re: quoting an easy web extraction test
Posted: Fri Dec 11, 2009 2:23 pm
by AbraCadaver
I thought they could do their own research, but since we're posting code:
Code: Select all
$html = new DOMDocument();
$html->loadHTMLFile('http://url.com/some.htm');
$tags = $html->getElementsByTagName('a');
$links = '';
foreach ($tags as $tag) {
$links .= $tag->getAttribute('href') . "\n";
}
file_put_contents('somefile.txt', $links);
Re: quoting an easy web extraction test
Posted: Fri Dec 11, 2009 3:00 pm
by kapil1089theking
Hi Shawn,
Working with url
viewtopic.php?f=39&t=110110
but i need it to work with
https://mobile.bet365.com/wap?task=upda ... w!&login=F
when I write
$html->loadHTMLFile('
https://mobile.bet365.com/wap?task=upda ... w!&login=F');
its showing these errors:
Notice: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Unable to find the wrapper "https" - did you forget to enable it when you configured PHP? in C:\wamp\www\paisa\web extractor.php on line 3
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: I/O warning : failed to load external entity "
https://mobile.bet365.com/wap?task=upda ... w!&login=F" in C:\wamp\www\paisa\web extractor.php on line 3
Re: quoting an easy web extraction test
Posted: Fri Dec 11, 2009 3:40 pm
by AbraCadaver
You probably need to enable the extension=php_openssl.??? in php.ini.
Re: quoting an easy web extraction test
Posted: Fri Dec 11, 2009 3:57 pm
by kapil1089theking
But why it is working with some url and some not?
Re: quoting an easy web extraction test
Posted: Fri Dec 11, 2009 4:47 pm
by AbraCadaver
Because some are https and some are http.
Re: quoting an easy web extraction test
Posted: Fri Dec 11, 2009 4:52 pm
by kapil1089theking
that means for https I have to install openssl?
Can anyone help me installing openssl on wamp 2.0
Re: quoting an easy web extraction test
Posted: Fri Dec 11, 2009 5:35 pm
by AbraCadaver
kapil1089theking wrote:that means for https I have to install openssl?
Can anyone help me installing openssl on wamp 2.0
I think all you have to do is uncomment extension=php_openssl.dll in php.ini and restart apache. Make sure that your PHP directory is in your path or copy the openssleay.dll to \windows\system.
It's been years since I used windows.