Page 1 of 8
Php link exporter
Posted: Sun Nov 22, 2009 6:59 pm
by The_L
This is my code:
taggrab.class.php
Code: Select all
<?php
class tagSpider
{
var $crl; // this will hold our curl instance
var $html; // this is where we dump the html we get
var $binary; // set for binary type transfer
var $url; // this is the url we are going to do a pass on
function tagSpider()
{
$this->html = "";
$this->binary = 0;
$this->url = "";
}
function fetchPage($url)
{
$this->url = $url;
if (isset($this->url)) {
$this->ch = curl_init (); // start cURL instance
curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); // this tells cUrl to return the data
curl_setopt ($this->ch, CURLOPT_URL, $this->url); // set the url to download
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); // tell cURL if the data is binary data or not
$this->html = curl_exec($this->ch); // grabs the webpage from the internet
curl_close ($this->ch); // closes the connection
}
}
function parse_array($beg_tag, $close_tag) // this function takes the grabbed html and picked out the pieces we want
{
preg_match_all("($beg_tag.*$close_tag)siU", $this->html, $matching_data); // match data between specificed tags
return $matching_data[0];
}
}
?>
tag-example.php
Code: Select all
<?php
// Inlcude our tag grab class
require("taggrab.class.php"); // class for spider
// Enter the URL you want to run
$urlrun="some url";
// Specify the start and end tags you want to grab data between
$stag="<a href=";
$etag="</a>";
// Make a title spider
$tspider = new tagSpider();
// Pass URL to the fetch page function
$tspider->fetchPage($urlrun);
// Enter the tags into the parse array function
$linkarray = $tspider->parse_array($stag, $etag);
echo "<h2>Links present on page: ".$urlrun."</h2><br />";
// Loop to pump out the results
foreach ($linkarray as $result) {
echo $result;
echo "<br/>";
}
?>
The script works just fine but i need an input field that will define $urlrun var in second file...i tried almost everything...but all i get is errors...can someone help me with this??
Thanks.
Re: Php link exporter
Posted: Mon Nov 23, 2009 4:36 am
by The_L
Can anyone take a look?? :/
Re: Php link exporter
Posted: Mon Nov 23, 2009 4:58 am
by Apollo
Exactly what kind of error message do you get?
At first glance, the only problem I see is missing begin and end delimiter chars (to separate expression from modifiers) in your regular expression:
The_L wrote:preg_match_all("($beg_tag.*$close_tag)siU",
Re: Php link exporter
Posted: Mon Nov 23, 2009 5:28 am
by The_L
Both files works fine i just want to insert an input field and button for
Code: Select all
// Enter the URL you want to run
$urlrun="some url";
Re: Php link exporter
Posted: Mon Nov 23, 2009 6:07 am
by Apollo
Then what's the problem? Just a simple form would do I guess?
Code: Select all
$urlrun = $_POST['urlrun'];
if (!$urlrun) die("<form method='post'><input type='text' name='urlrun'> <input type='submit'></form>");
Re: Php link exporter
Posted: Mon Nov 23, 2009 6:55 am
by timWebUK
Can you not create an HTML form then POST the URL, and process it using your PHP file? Or have I missed something...
Re: Php link exporter
Posted: Mon Nov 23, 2009 2:42 pm
by The_L
Great,its just perfect THANKS...instead of opening new topic ill ask here again...
this part:
how should i make it list only urls that begins with
http://youtube.com/ and
http://google.com/ (for example) i tried:
Code: Select all
$stag="<a href=http://google.com/";
$etag="</a>";
But it wont list anything...
And when i try this:
Code: Select all
$stag="<a href=";
$etag="" class="bbc_link new_win" target="_blank">";
Then i get this error:
Code: Select all
Parse error: syntax error, unexpected T_CLASS in *host path*/test/tag-example.php on line 12
Re: Php link exporter
Posted: Mon Nov 23, 2009 2:48 pm
by iankent
even if your regex is correct (which I can't guarantee as I'm no regex expert!), you'll probably find most google/youtube links won't start
http://google.com/ etc but will instead be
http://www.google.com/ (or google.com.au, google.co.uk etc). You may want to match http://*.google.* instead (no idea what that is as a regex sorry - really must learn!)
edit:
actually, if you want to match google links and be sure that its definately from google, you'd need to match the TLD part against a list of valid ones, or better still against a list of google owned ones. Just matching http://*.google.* would also match
http://something.google.anothersite.com/, which you may want to exclude
The_L wrote:
And when i try this:
Code: Select all
$stag="<a href=";
$etag="" class="bbc_link new_win" target="_blank">";
Then i get this error:
Code: Select all
Parse error: syntax error, unexpected T_CLASS in *host path*/test/tag-example.php on line 12
You can't put a " inside "" without escaping it. I.e., on the line:
code]$etag="" class="bbc_link new_win" target="_blank">";
you're opening the double quotes then closing them., so class=etc is being treated as PHP. It should be this:
Code: Select all
$etag="\" class=\"bbc_link new_win\" target=\"_blank\">";
alternatively you could enclose it with single quotes which would allow the double quotes to be included as normal
hth
Re: Php link exporter
Posted: Mon Nov 23, 2009 3:05 pm
by The_L
even if your regex is correct (which I can't guarantee as I'm no regex expert!), you'll probably find most google/youtube links won't start
http://google.com/ etc but will instead be
http://www.google.com/ (or google.com.au, google.co.uk etc). You may want to match http://*.google.* instead (no idea what that is as a regex sorry - really must learn!)
Shouldn't it be just easier to insert all variations of google site? Like:
google.com/
http://google.com/
http://www.google.com/
http://www.google.com/
the problem is that i don't know how to put "OR" command xD
As for the second problem...it works. Just to make it clear before every -"- (witch is not part of code) i should put -\- ???
Re: Php link exporter
Posted: Mon Nov 23, 2009 3:13 pm
by iankent
when you say an OR command, what do you mean? If you want to match against a list of possible items you can use an array, for example:
Code: Select all
$possibilities = array('http://google.com/', 'http://www.google.com/', 'http://www.google.co.uk/', 'http://google.co.uk/');
foreach($possibilities as $possibility) {
// run your existing regexp here
}
But, that's a bit of a messy solution and almost guaranteed you won't account for every google URL available. What if you come across the url
http://images.google.com/, should that match or not? You can do it either way but a decent regex will be a lot more accurate and a lot faster, and means you don't have to manually type out every possible google URL variation you can think of. Its just a matter of learning regex well enough or finding somebody willing to help. Personally I don't have a clue lol.
The_L wrote:
As for the second problem...it works. Just to make it clear before every -"- (witch is not part of code) i should put -\- ???
correct - if you're putting a value in a string using double quotes, e.g. "blah", any 'special characters' inside that need to be escaped with a backslash. So \n is newline, \r is carriage-return, \t is tab, \\ is a backslash, \" is ". There are others but I can't remember them :p
Re: Php link exporter
Posted: Mon Nov 23, 2009 3:26 pm
by The_L
Hehe you are really clearing up php to me...
But, that's a bit of a messy solution and almost guaranteed you won't account for every google URL available. What if you come across the url
http://images.google.com/, should that match or not? You can do it either way but a decent regex will be a lot more accurate and a lot faster, and means you don't have to manually type out every possible google URL variation you can think of. Its just a matter of learning regex well enough or finding somebody willing to help. Personally I don't have a clue lol.
I guess you got it wrong when i said google.com in my first post i told like example...so its not rly have to be google...it should be ordinary site so im guessing that "
http://www..." "
www..." "http://..." and "justurl.com" combinations are just fine...
Re: Php link exporter
Posted: Mon Nov 23, 2009 3:32 pm
by iankent
The_L wrote:I guess you got it wrong when i said google.com in my first post i told like example...so its not rly have to be google...it should be ordinary site so im guessing that "
http://www..." "
www..." "http://..." and "justurl.com" combinations are just fine...
Ah I see, you don't want to match google/youtube, you want to match any URL you come across as long as its a URL?
Here's a good tip for you (but be careful around licensing etc if you're going to sell/redistribute you're code) - have a look in the status.net source code (
http://status.net/ - it's like twitter), and there's a handy function that uses a single regex to match almost all recognised URLs. You could also have a look in the phpbb source code which I'm sure will contain similar useful regexes!
Re: Php link exporter
Posted: Mon Nov 23, 2009 3:49 pm
by The_L
Hehe,im not gonna sell anything...to make you all clear i wanna export forum post links...so when someone posts lot of links i wanna get them without any text...simply just links..so when someone posts tones of youtube links i just wanna copy them...so i don't think this should be so confusing...it should be simple..

Re: Php link exporter
Posted: Mon Nov 23, 2009 4:03 pm
by iankent
The_L wrote:Hehe,im not gonna sell anything...to make you all clear i wanna export forum post links...so when someone posts lot of links i wanna get them without any text...simply just links..so when someone posts tones of youtube links i just wanna copy them...so i don't think this should be so confusing...it should be simple..

If that's all you want to do then the status.net regex should do exactly what you need. Just looked it up, this regex should find all the matches you need
Code: Select all
$regex = '#'.
'(?:^|[\s\(\)\[\]\{\}\\\'\\\";]+)(?![\@\!\#])'.
'('.
'(?:'.
'(?:'. //Known protocols
'(?:'.
'(?:(?:https?|ftps?|mms|rtsp|gopher|news|nntp|telnet|wais|file|prospero|webcal|irc)://)'.
'|'.
'(?:(?:mailto|aim|tel|xmpp):)'.
')'.
'(?:[\pN\pL\-\_\+\%\~]+(?::[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
'(?:'.
'(?:'.
'\[[\pN\pL\-\_\:\.]+(?<![\.\:])\]'. //[dns]
')|(?:'.
'[\pN\pL\-\_\:\.]+(?<![\.\:])'. //dns
')'.
')'.
')'.
'|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'. //IPv4
'|(?:'. //IPv6
'\[?(?:(?:(?:[0-9A-Fa-f]{1,4}:){7}(?:(?:[0-9A-Fa-f]{1,4})|:))|(?:(?:[0-9A-Fa-f]{1,4}:){6}(?::|(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})|(?::[0-9A-Fa-f]{1,4})))|(?:(?:[0-9A-Fa-f]{1,4}:){5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){4}(?::[0-9A-Fa-f]{1,4}){0,1}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){3}(?::[0-9A-Fa-f]{1,4}){0,2}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){2}(?::[0-9A-Fa-f]{1,4}){0,3}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:)(?::[0-9A-Fa-f]{1,4}){0,4}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?::(?::[0-9A-Fa-f]{1,4}){0,5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})))\]?(?<!:)'.
')|(?:'. //DNS
'(?:[\pN\pL\-\_\+\%\~]+(?:\:[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
'[\pN\pL\-\_]+(?:\.[\pN\pL\-\_]+)*\.'.
//tld list from http://data.iana.org/TLD/tlds-alpha-by-domain.txt, also added local, loc, and onion
'(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN--0ZWM56D|??|XN--11B5BS3A9AJ6G|???????|XN--80AKHBYKNJ4F|?????????|XN--9T4B11YI5A|???|XN--DEBA0AD|????|XN--G6W251D|??|XN--HGBK6AJ7F53BBA|???????|XN--HLCJ6AYA9ESC7A|???????|XN--JXALPDLP|??????|XN--KGBECHTV|??????|XN--ZCKZAH|???|YE|YT|YU|ZA|ZM|ZW|local|loc|onion)'.
')(?![\pN\pL\-\_])'.
')'.
'(?:'.
'(?:\:\d+)?'. //:port
'(?:/[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@]*)?'. // /path
'(?:\?[\pN\pL\$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@\/]*)?'. // ?query string
'(?:\#[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"\@/\?\#]*)?'. // #fragment
')(?<![\?\.\,\#\,])'.
')'.
'#ixu';
courtesy of status.net

Re: Php link exporter
Posted: Mon Nov 23, 2009 4:12 pm
by The_L
Hmm...as i tried it doesnot work to me...
Can you just please point me how to add this:
Code: Select all
$possibilities = array('http://google.com/', 'http://www.google.com/', 'http://www.google.co.uk/', 'http://google.co.uk/');
foreach($possibilities as $possibility) {
// run your existing regexp here
}
into
Code: Select all
// Specify the start and end tags you want to grab data between
$stag="<a href=";
$etag=".html\" class=\"bbc_link new_win\" target=\"_blank\">";
:/