quoting an easy web extraction test
Moderator: General Moderators
-
kapil1089theking
- Forum Commoner
- Posts: 46
- Joined: Wed May 28, 2008 1:51 pm
- Location: Kolkata, India
- Contact:
quoting an easy web extraction test
I have an url http://url.com/some.htm
I want to log all the links in this html file to a text file.
Only links.
How to do that?
I want to log all the links in this html file to a text file.
Only links.
How to do that?
- AbraCadaver
- DevNet Master
- Posts: 2572
- Joined: Mon Feb 24, 2003 10:12 am
- Location: The Republic of Texas
- Contact:
Re: quoting an easy web extraction test
Read the file, then use a regex to capture the links, then write out to your text file. Alternately, you could load it up with the DOM extension and loop through the anchors and extract the URLs and write out to a text file.
http://us2.php.net/manual/en/book.dom.php
http://us2.php.net/manual/en/book.dom.php
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
Re: quoting an easy web extraction test
Here is a quick and dirty way that I just tested:
echoes this string:
You'd probably want to add some code that runs parse_url() on the original url and resolves relative and absolute links. You may also want to discard those that are simply hashes like #wrapheader.
Code: Select all
<?php
$html = file_get_contents('http://forums.devnetwork.net/viewtopic.php?f=39&t=110110');
preg_match_all('/<a[^>]+href[^>]*=[^>]*(\'|")([^>]*)\1/si', $html, $matches, PREG_SET_ORDER);
$links = '';
foreach ($matches as $match) {
$links .= $match[2] . "\n";
}
echo "<pre>" . $links;Code: Select all
./index.php?sid=155659f42576cef48964d51018cf8222
./ucp.php?mode=login&sid=155659f42576cef48964d51018cf8222
./ucp.php?mode=register&sid=155659f42576cef48964d51018cf8222
<snip>
./viewtopic.php?p=582433&sid=155659f42576cef48964d51018cf8222#p582433
http://url.com/some.htm
#wrapheader
<snip>
./viewforum.php?f=59&sid=155659f42576cef48964d51018cf8222
./viewforum.php?f=39&sid=155659f42576cef48964d51018cf8222
http://www.phpbb.com/- AbraCadaver
- DevNet Master
- Posts: 2572
- Joined: Mon Feb 24, 2003 10:12 am
- Location: The Republic of Texas
- Contact:
Re: quoting an easy web extraction test
I thought they could do their own research, but since we're posting code:
Code: Select all
$html = new DOMDocument();
$html->loadHTMLFile('http://url.com/some.htm');
$tags = $html->getElementsByTagName('a');
$links = '';
foreach ($tags as $tag) {
$links .= $tag->getAttribute('href') . "\n";
}
file_put_contents('somefile.txt', $links);mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
-
kapil1089theking
- Forum Commoner
- Posts: 46
- Joined: Wed May 28, 2008 1:51 pm
- Location: Kolkata, India
- Contact:
Re: quoting an easy web extraction test
Hi Shawn,
Working with url viewtopic.php?f=39&t=110110
but i need it to work with https://mobile.bet365.com/wap?task=upda ... w!&login=F
when I write
$html->loadHTMLFile('https://mobile.bet365.com/wap?task=upda ... w!&login=F');
its showing these errors:
Notice: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Unable to find the wrapper "https" - did you forget to enable it when you configured PHP? in C:\wamp\www\paisa\web extractor.php on line 3
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: I/O warning : failed to load external entity "https://mobile.bet365.com/wap?task=upda ... w!&login=F" in C:\wamp\www\paisa\web extractor.php on line 3
Working with url viewtopic.php?f=39&t=110110
but i need it to work with https://mobile.bet365.com/wap?task=upda ... w!&login=F
when I write
$html->loadHTMLFile('https://mobile.bet365.com/wap?task=upda ... w!&login=F');
its showing these errors:
Notice: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: Unable to find the wrapper "https" - did you forget to enable it when you configured PHP? in C:\wamp\www\paisa\web extractor.php on line 3
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: I/O warning : failed to load external entity "https://mobile.bet365.com/wap?task=upda ... w!&login=F" in C:\wamp\www\paisa\web extractor.php on line 3
- AbraCadaver
- DevNet Master
- Posts: 2572
- Joined: Mon Feb 24, 2003 10:12 am
- Location: The Republic of Texas
- Contact:
Re: quoting an easy web extraction test
You probably need to enable the extension=php_openssl.??? in php.ini.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
-
kapil1089theking
- Forum Commoner
- Posts: 46
- Joined: Wed May 28, 2008 1:51 pm
- Location: Kolkata, India
- Contact:
Re: quoting an easy web extraction test
But why it is working with some url and some not?
- AbraCadaver
- DevNet Master
- Posts: 2572
- Joined: Mon Feb 24, 2003 10:12 am
- Location: The Republic of Texas
- Contact:
Re: quoting an easy web extraction test
Because some are https and some are http.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.
-
kapil1089theking
- Forum Commoner
- Posts: 46
- Joined: Wed May 28, 2008 1:51 pm
- Location: Kolkata, India
- Contact:
Re: quoting an easy web extraction test
that means for https I have to install openssl?
Can anyone help me installing openssl on wamp 2.0
Can anyone help me installing openssl on wamp 2.0
- AbraCadaver
- DevNet Master
- Posts: 2572
- Joined: Mon Feb 24, 2003 10:12 am
- Location: The Republic of Texas
- Contact:
Re: quoting an easy web extraction test
I think all you have to do is uncomment extension=php_openssl.dll in php.ini and restart apache. Make sure that your PHP directory is in your path or copy the openssleay.dll to \windows\system.kapil1089theking wrote:that means for https I have to install openssl?
Can anyone help me installing openssl on wamp 2.0
It's been years since I used windows.
mysql_function(): WARNING: This extension is deprecated as of PHP 5.5.0, and will be removed in the future. Instead, the MySQLi or PDO_MySQLextension should be used. See also MySQL: choosing an API guide and related FAQ for more information.