Page 2 of 8

Re: Php link exporter

Posted: Mon Nov 23, 2009 4:25 pm
by iankent
You don't need to use the $atag and $etag bits now, the regex takes care of working out which bits are links and which aren't, so you completely ignore the fact its in HTML

You need to run the regex against the pages HTML source which will return all of the URLs in the HTML that follow HTTP standards. In your parse_array you should be able to do something like this:

Code: Select all

 
preg_match_all($regex, $this->html, $matching_data);
return $matching_data;
 
where $regex is the one from my earlier post

I think that should return an array of URLs, though I haven't tested it. Let me know if you have any problems with it!

Re: Php link exporter

Posted: Mon Nov 23, 2009 4:37 pm
by The_L
omg im so confusied now...are you talking about $possibilities code???? O.0

Re: Php link exporter

Posted: Mon Nov 23, 2009 4:44 pm
by iankent
The_L wrote:omg im so confusied now...are you talking about $possibilities code???? O.0
You don't need to use the array of possibilities, I suggested that when I thought you were looking for google/youtube links, and only to suggest it as a bad idea lol.

To achieve what you want, personally I'd forget about the tag matching and use the regex above to do some URL matching instead. You can then filter the results for unwanted URLs. Far quicker, more accurate and does what you want with little fuss. You don't need to understand the complicated regex (I certainly dont lol), but it does the job!

To use the regex, in tag-example.php you can drop lines 9-11, and change line 20 to:

Code: Select all

$linkarray = $tspider->parse_array();
You'd then change your parse_array() function in taggrab.class.php to:

Code: Select all

 
function parse_array() // this function takes the grabbed html and picked out the pieces we want
{
    $regex = "get this from my earlier post - too long to include again";
    preg_match_all($regex, $this->html, $matching_data); // match data between specificed tags
    return $matching_data;
}
 
When you then run the function it should return an array containing all of the links in the HTML source, so the rest of your tag-example.php page should work as you expect

hth

Re: Php link exporter

Posted: Mon Nov 23, 2009 4:56 pm
by The_L
Looks cool at first sight...but i got few errors....

Code: Select all

Warning: preg_match_all() [function.preg-match-all]: Delimiter must not be alphanumeric or backslash in /*path*/test/taggrab.class.php on line 42
and

Code: Select all

Warning: Invalid argument supplied for foreach() in /*path*/test/tag-example.php on line 21

:/

Re: Php link exporter

Posted: Mon Nov 23, 2009 5:15 pm
by iankent
The second error is because of the first one I think :P its trying to loop an array that wasn't created. The problem seems to be with the regex. I'll try it out now and see what happens.

Re: Php link exporter

Posted: Mon Nov 23, 2009 5:20 pm
by iankent
It may not have copied properly the first time - I tried copying and pasting the posted one into my editor and it doesn't work, so here it is again straight from the status.net source:

Code: Select all

 
  $regex = '#'.
    '(?:^|[\s\(\)\[\]\{\}\\\'\\\";]+)(?![\@\!\#])'.
    '('.
        '(?:'.
            '(?:'. //Known protocols
                '(?:'.
                    '(?:(?:https?|ftps?|mms|rtsp|gopher|news|nntp|telnet|wais|file|prospero|webcal|irc)://)'.
                    '|'.
                    '(?:(?:mailto|aim|tel|xmpp):)'.
                ')'.
                '(?:[\pN\pL\-\_\+\%\~]+(?::[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
                '(?:'.
                    '(?:'.
                        '\[[\pN\pL\-\_\:\.]+(?<![\.\:])\]'. //[dns]
                    ')|(?:'.
                        '[\pN\pL\-\_\:\.]+(?<![\.\:])'. //dns
                    ')'.
                ')'.
            ')'.
            '|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'. //IPv4
            '|(?:'. //IPv6
                '\[?(?:(?:(?:[0-9A-Fa-f]{1,4}:){7}(?:(?:[0-9A-Fa-f]{1,4})|:))|(?:(?:[0-9A-Fa-f]{1,4}:){6}(?::|(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})|(?::[0-9A-Fa-f]{1,4})))|(?:(?:[0-9A-Fa-f]{1,4}:){5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){4}(?::[0-9A-Fa-f]{1,4}){0,1}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){3}(?::[0-9A-Fa-f]{1,4}){0,2}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){2}(?::[0-9A-Fa-f]{1,4}){0,3}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:)(?::[0-9A-Fa-f]{1,4}){0,4}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?::(?::[0-9A-Fa-f]{1,4}){0,5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})))\]?(?<!:)'.
            ')|(?:'. //DNS
                '(?:[\pN\pL\-\_\+\%\~]+(?:\:[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
                '[\pN\pL\-\_]+(?:\.[\pN\pL\-\_]+)*\.'.
                //tld list from http://data.iana.org/TLD/tlds-alpha-by-domain.txt, also added local, loc, and onion
                '(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN--0ZWM56D|测试|XN--11B5BS3A9AJ6G|परीक्षा|XN--80AKHBYKNJ4F|испытание|XN--9T4B11YI5A|테스트|XN--DEBA0AD|טעסט|XN--G6W251D|測試|XN--HGBK6AJ7F53BBA|آزمایشی|XN--HLCJ6AYA9ESC7A|பரிட்சை|XN--JXALPDLP|δοκιμή|XN--KGBECHTV|إختبار|XN--ZCKZAH|テスト|YE|YT|YU|ZA|ZM|ZW|local|loc|onion)'.
            ')(?![\pN\pL\-\_])'.
        ')'.
        '(?:'.
            '(?:\:\d+)?'. //:port
            '(?:/[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@]*)?'. // /path
            '(?:\?[\pN\pL\$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@\/]*)?'. // ?query string
            '(?:\#[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"\@/\?\#]*)?'. // #fragment
        ')(?<![\?\.\,\#\,])'.
    ')'.
    '#ixu';
 
if that doesn't work it might be a problem with some of the characters it uses. In which case either have a look in the status.net source code (get it from their website, the file is \lib\util.php, or email me via the forum and i'll send back a copy of the code as an attachment :)

Re: Php link exporter

Posted: Mon Nov 23, 2009 5:25 pm
by The_L
And on the end whats wrong with my code?? :/

Re: Php link exporter

Posted: Mon Nov 23, 2009 5:26 pm
by iankent
The_L wrote:And on the end whats wrong with my code?? :/
What do you mean? This error?

Code: Select all

Warning: Invalid argument supplied for foreach() in /*path*/test/tag-example.php on line 21
If so,
iankent wrote: The second error is because of the first one I think :P its trying to loop an array that wasn't created.
so once the parse_array function is returning an array correctly (i.e., the regex is working), line 21 will start working again

Re: Php link exporter

Posted: Mon Nov 23, 2009 5:44 pm
by The_L
So how can i fix that?

Re: Php link exporter

Posted: Mon Nov 23, 2009 5:49 pm
by iankent
The_L wrote:So how can i fix that?
you need to make the regex work, i.e.
iankent wrote:once the parse_array function is returning an array correctly (i.e., the regex is working), line 21 will start working again
please read my replies carefully :)

the error you're getting with preg_replace is because the regex hasn't copied and pasted correctly into your code. Try again using the last version I posted to see if it works.

Re: Php link exporter

Posted: Mon Nov 23, 2009 5:59 pm
by The_L
I still get errors...maybe i'm doing it wrong can you paste whole code here?

Re: Php link exporter

Posted: Mon Nov 23, 2009 6:00 pm
by iankent
The_L wrote:I still get errors...maybe i'm doing it wrong can you paste whole code here?
What errors are you getting? Is it the preg_replace() error you had earlier?

edit: I have already posted the code you need along with instructions as to which lines you need to replace or change.

Re: Php link exporter

Posted: Mon Nov 23, 2009 6:06 pm
by The_L
You can take a look here...

Re: Php link exporter

Posted: Mon Nov 23, 2009 6:16 pm
by iankent
The_L wrote:You can take a look here...
I assume line 42 is the preg_match_all line? If so, its a problem with the value of $regex. If you've copied and pasted the parse_array() function from my example and copied and pasted the $regex value from my earlier post then it should work.

If it doesn't... it's because the string is being corrupted by the forum for some reason, so:
  • check the status.net source code yourself. you can get it at http://www.status.net/. the file you need is util.php. copy and paste the $regex line from there yourself - its easy to spot, its the only regex that big
  • or, send me a PM with your e-mail address or send an email via the forum. I'll e-mail you an attachment containing the $regex string that you need.
I wont be here much longer tonight but if you cant get it working I'll have another look tomorrow

Re: Php link exporter

Posted: Tue Nov 24, 2009 6:07 pm
by The_L
Great i managed to do it right,only one problem...in source code of the page there is link but in
google shape,not in
http://google.com :

Code: Select all

 
<a href="http://site.com/blalbalbalblbla.html" class="bbc_link new_win" target="_blank">some text</a><br />
 
When i insert this code:

Code: Select all

 
$stag="<a href=\"http://site.com";
$etag="</a>";
 
Php script exports the text with hyperlink... i mean not url -.-

And when i insert:

Code: Select all

 
$stag="<a href=\"http://site.com";
$etag="\" class=\"bbc_link new_win\" target=\"_blank\">";
 
I get empty space as result...
Why that? Can't i extract real url??? ("http://site.com/blalbalbalblbla.html") not text...
How can i do that???