Php link exporter

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Php link exporter

Post by iankent »

You don't need to use the $atag and $etag bits now, the regex takes care of working out which bits are links and which aren't, so you completely ignore the fact its in HTML

You need to run the regex against the pages HTML source which will return all of the URLs in the HTML that follow HTTP standards. In your parse_array you should be able to do something like this:

Code: Select all

 
preg_match_all($regex, $this->html, $matching_data);
return $matching_data;
 
where $regex is the one from my earlier post

I think that should return an array of URLs, though I haven't tested it. Let me know if you have any problems with it!
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Php link exporter

Post by The_L »

omg im so confusied now...are you talking about $possibilities code???? O.0
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Php link exporter

Post by iankent »

The_L wrote:omg im so confusied now...are you talking about $possibilities code???? O.0
You don't need to use the array of possibilities, I suggested that when I thought you were looking for google/youtube links, and only to suggest it as a bad idea lol.

To achieve what you want, personally I'd forget about the tag matching and use the regex above to do some URL matching instead. You can then filter the results for unwanted URLs. Far quicker, more accurate and does what you want with little fuss. You don't need to understand the complicated regex (I certainly dont lol), but it does the job!

To use the regex, in tag-example.php you can drop lines 9-11, and change line 20 to:

Code: Select all

$linkarray = $tspider->parse_array();
You'd then change your parse_array() function in taggrab.class.php to:

Code: Select all

 
function parse_array() // this function takes the grabbed html and picked out the pieces we want
{
    $regex = "get this from my earlier post - too long to include again";
    preg_match_all($regex, $this->html, $matching_data); // match data between specificed tags
    return $matching_data;
}
 
When you then run the function it should return an array containing all of the links in the HTML source, so the rest of your tag-example.php page should work as you expect

hth
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Php link exporter

Post by The_L »

Looks cool at first sight...but i got few errors....

Code: Select all

Warning: preg_match_all() [function.preg-match-all]: Delimiter must not be alphanumeric or backslash in /*path*/test/taggrab.class.php on line 42
and

Code: Select all

Warning: Invalid argument supplied for foreach() in /*path*/test/tag-example.php on line 21

:/
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Php link exporter

Post by iankent »

The second error is because of the first one I think :P its trying to loop an array that wasn't created. The problem seems to be with the regex. I'll try it out now and see what happens.
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Php link exporter

Post by iankent »

It may not have copied properly the first time - I tried copying and pasting the posted one into my editor and it doesn't work, so here it is again straight from the status.net source:

Code: Select all

 
  $regex = '#'.
    '(?:^|[\s\(\)\[\]\{\}\\\'\\\";]+)(?![\@\!\#])'.
    '('.
        '(?:'.
            '(?:'. //Known protocols
                '(?:'.
                    '(?:(?:https?|ftps?|mms|rtsp|gopher|news|nntp|telnet|wais|file|prospero|webcal|irc)://)'.
                    '|'.
                    '(?:(?:mailto|aim|tel|xmpp):)'.
                ')'.
                '(?:[\pN\pL\-\_\+\%\~]+(?::[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
                '(?:'.
                    '(?:'.
                        '\[[\pN\pL\-\_\:\.]+(?<![\.\:])\]'. //[dns]
                    ')|(?:'.
                        '[\pN\pL\-\_\:\.]+(?<![\.\:])'. //dns
                    ')'.
                ')'.
            ')'.
            '|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'. //IPv4
            '|(?:'. //IPv6
                '\[?(?:(?:(?:[0-9A-Fa-f]{1,4}:){7}(?:(?:[0-9A-Fa-f]{1,4})|:))|(?:(?:[0-9A-Fa-f]{1,4}:){6}(?::|(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})|(?::[0-9A-Fa-f]{1,4})))|(?:(?:[0-9A-Fa-f]{1,4}:){5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){4}(?::[0-9A-Fa-f]{1,4}){0,1}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){3}(?::[0-9A-Fa-f]{1,4}){0,2}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){2}(?::[0-9A-Fa-f]{1,4}){0,3}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:)(?::[0-9A-Fa-f]{1,4}){0,4}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?::(?::[0-9A-Fa-f]{1,4}){0,5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})))\]?(?<!:)'.
            ')|(?:'. //DNS
                '(?:[\pN\pL\-\_\+\%\~]+(?:\:[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
                '[\pN\pL\-\_]+(?:\.[\pN\pL\-\_]+)*\.'.
                //tld list from http://data.iana.org/TLD/tlds-alpha-by-domain.txt, also added local, loc, and onion
                '(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN--0ZWM56D|测试|XN--11B5BS3A9AJ6G|परीक्षा|XN--80AKHBYKNJ4F|испытание|XN--9T4B11YI5A|테스트|XN--DEBA0AD|טעסט|XN--G6W251D|測試|XN--HGBK6AJ7F53BBA|آزمایشی|XN--HLCJ6AYA9ESC7A|பரிட்சை|XN--JXALPDLP|δοκιμή|XN--KGBECHTV|إختبار|XN--ZCKZAH|テスト|YE|YT|YU|ZA|ZM|ZW|local|loc|onion)'.
            ')(?![\pN\pL\-\_])'.
        ')'.
        '(?:'.
            '(?:\:\d+)?'. //:port
            '(?:/[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@]*)?'. // /path
            '(?:\?[\pN\pL\$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@\/]*)?'. // ?query string
            '(?:\#[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"\@/\?\#]*)?'. // #fragment
        ')(?<![\?\.\,\#\,])'.
    ')'.
    '#ixu';
 
if that doesn't work it might be a problem with some of the characters it uses. In which case either have a look in the status.net source code (get it from their website, the file is \lib\util.php, or email me via the forum and i'll send back a copy of the code as an attachment :)
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Php link exporter

Post by The_L »

And on the end whats wrong with my code?? :/
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Php link exporter

Post by iankent »

The_L wrote:And on the end whats wrong with my code?? :/
What do you mean? This error?

Code: Select all

Warning: Invalid argument supplied for foreach() in /*path*/test/tag-example.php on line 21
If so,
iankent wrote: The second error is because of the first one I think :P its trying to loop an array that wasn't created.
so once the parse_array function is returning an array correctly (i.e., the regex is working), line 21 will start working again
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Php link exporter

Post by The_L »

So how can i fix that?
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Php link exporter

Post by iankent »

The_L wrote:So how can i fix that?
you need to make the regex work, i.e.
iankent wrote:once the parse_array function is returning an array correctly (i.e., the regex is working), line 21 will start working again
please read my replies carefully :)

the error you're getting with preg_replace is because the regex hasn't copied and pasted correctly into your code. Try again using the last version I posted to see if it works.
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Php link exporter

Post by The_L »

I still get errors...maybe i'm doing it wrong can you paste whole code here?
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Php link exporter

Post by iankent »

The_L wrote:I still get errors...maybe i'm doing it wrong can you paste whole code here?
What errors are you getting? Is it the preg_replace() error you had earlier?

edit: I have already posted the code you need along with instructions as to which lines you need to replace or change.
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Php link exporter

Post by The_L »

You can take a look here...
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Php link exporter

Post by iankent »

The_L wrote:You can take a look here...
I assume line 42 is the preg_match_all line? If so, its a problem with the value of $regex. If you've copied and pasted the parse_array() function from my example and copied and pasted the $regex value from my earlier post then it should work.

If it doesn't... it's because the string is being corrupted by the forum for some reason, so:
  • check the status.net source code yourself. you can get it at http://www.status.net/. the file you need is util.php. copy and paste the $regex line from there yourself - its easy to spot, its the only regex that big
  • or, send me a PM with your e-mail address or send an email via the forum. I'll e-mail you an attachment containing the $regex string that you need.
I wont be here much longer tonight but if you cant get it working I'll have another look tomorrow
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Php link exporter

Post by The_L »

Great i managed to do it right,only one problem...in source code of the page there is link but in
google shape,not in
http://google.com :

Code: Select all

 
<a href="http://site.com/blalbalbalblbla.html" class="bbc_link new_win" target="_blank">some text</a><br />
 
When i insert this code:

Code: Select all

 
$stag="<a href=\"http://site.com";
$etag="</a>";
 
Php script exports the text with hyperlink... i mean not url -.-

And when i insert:

Code: Select all

 
$stag="<a href=\"http://site.com";
$etag="\" class=\"bbc_link new_win\" target=\"_blank\">";
 
I get empty space as result...
Why that? Can't i extract real url??? ("http://site.com/blalbalbalblbla.html") not text...
How can i do that???
Post Reply