Page 1 of 1

Regex help..

Posted: Sun Nov 29, 2009 2:35 pm
by The_L
Hello,i wanna ask something about regex here...iankent and i made this script that would export rapidshare links from forum pages...It works very well when url looks like
http://rapidshare.com/doesnotmatter.rar.html
but when it comes to
http://rapidshare.com/doesnotmatter.rar (something else than html)
it wont print the resault...
Here is the script and Here is downloadable source code... (php file)

Can anyone make it work for .rar .zip and some other types of link?

thanks..:P

Re: Regex help..

Posted: Sun Nov 29, 2009 3:35 pm
by ridgerunner
I'd take a look at it but you need to provide it in something other than a .rar file for me to see it.

Can't you just post the script here in CODE tag?

Re: Regex help..

Posted: Sun Nov 29, 2009 3:42 pm
by The_L
Lol Whole script??? Well...oke :D

Code: Select all

 
<?php
// begin of smf elit 2  
require_once('forum/SSI.php');
unset($context['linktree']);
$context['menu_buttons'] = array(
    'home' => array(
        'title' => 'Home',
        'href' => $scripturl,
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => true,
    ),
    'forum' => array(
        'title' => 'Forum',
        'href' => 'http://www.forum-racunara.com/forum/',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'download' => array(
        'title' => 'Download',
        'href' => 'http://link2.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'ihosting' => array(
        'title' => 'I-Hosting',
        'href' => 'http://link2.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'tools' => array(
        'title' => 'Tools',
        'href' => 'http://link2.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'chat' => array(
        'title' => 'Chat',
        'href' => 'http://link2.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'kontakt' => array(
        'title' => 'Kontakt',
        'href' => 'http://link3.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
        'is_last' => true,
    ),
);
template_header();
// end of smf elit 2
 
 
//Make a title spider
$tspider = new tagSpider();
 
// input box
$urlrun = $_POST['urlrun'];
if (!$urlrun) die("<form method='post'><input type='text' name='urlrun'> <input type='submit'></form>"); 
 
//Pass URL to the fetch page function
$tspider->fetchPage($urlrun);
 
// Enter the tags into the parse array function
$linkarray = $tspider->parse_array();
print_r($linkarray);
 
echo "<h2>Links present on page: ".$urlrun."</h2><br />";
echo "<table align=\"center\" bordercolor=\"#006600\" bgcolor=\"#CCCCCC\">";
$search = array('http://rapidshare.com','hotfile.com');
foreach ($linkarray as $link) {
    foreach($search as $search_term) { // this line had gone
        if(strpos($link, $search_term) > -1) { // this line should contain $search_term not $search
            $link = rtrim($link, "\"");
            echo "
               <tr>
               <td>$link</td>
               </tr>";
        }
    } // this line had gone
}
echo "</table>";
exit;
 
class tagSpider
{
 
var $crl; // this will hold our curl instance
var $html; // this is where we dump the html we get
var $binary; // set for binary type transfer
var $url; // this is the url we are going to do a pass on
 
 
 
function tagSpider()
{
    $this->html = "";
    $this->binary = 0;
    $this->url = "";
}
 
 
function fetchPage($url)
{
 
 
    $this->url = $url;
    if (isset($this->url)) {
 
        
        $this->ch = curl_init (); // start cURL instance
        curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); // this tells cUrl to return the data
        curl_setopt ($this->ch, CURLOPT_URL, $this->url); // set the url to download
        curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
        curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); // tell cURL if the data is binary data or not
        $this->html = curl_exec($this->ch); // grabs the webpage from the internet
        curl_close ($this->ch); // closes the connection
        
        
        // $this->html = file_get_contents($url);
    }
}
 
 
function parse_array() // this function takes the grabbed html and picked out the pieces we want
{
    $regex = '#'.
    '(?:^|[\s\(\)\[\]\{\}\\\'\\\";]+)(?![\@\!\#])'.
    '('.
        '(?:'.
            '(?:'. //Known protocols
                '(?:'.
                    '(?:(?:https?|ftps?|mms|rtsp|gopher|news|nntp|telnet|wais|file|prospero|webcal|irc)://)'.
                    '|'.
                    '(?:(?:mailto|aim|tel|xmpp):)'.
                ')'.
                '(?:[\pN\pL\-\_\+\%\~]+(?::[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
                '(?:'.
                    '(?:'.
                        '\[[\pN\pL\-\_\:\.]+(?<![\.\:])\]'. //[dns]
                    ')|(?:'.
                        '[\pN\pL\-\_\:\.]+(?<![\.\:])'. //dns
                    ')'.
                ')'.
            ')'.
            '|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'. //IPv4
            '|(?:'. //IPv6
                '\[?(?:(?:(?:[0-9A-Fa-f]{1,4}:){7}(?:(?:[0-9A-Fa-f]{1,4})|:))|(?:(?:[0-9A-Fa-f]{1,4}:){6}(?::|(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})|(?::[0-9A-Fa-f]{1,4})))|(?:(?:[0-9A-Fa-f]{1,4}:){5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){4}(?::[0-9A-Fa-f]{1,4}){0,1}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){3}(?::[0-9A-Fa-f]{1,4}){0,2}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){2}(?::[0-9A-Fa-f]{1,4}){0,3}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:)(?::[0-9A-Fa-f]{1,4}){0,4}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?::(?::[0-9A-Fa-f]{1,4}){0,5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})))\]?(?<!:)'.
            ')|(?:'. //DNS
                '(?:[\pN\pL\-\_\+\%\~]+(?:\:[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
                '[\pN\pL\-\_]+(?:\.[\pN\pL\-\_]+)*\.'.
                //tld list from http://data.iana.org/TLD/tlds-alpha-by-domain.txt, also added local, loc, and onion
                '(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN--0ZWM56D|??|XN--11B5BS3A9AJ6G|???????|XN--80AKHBYKNJ4F|?????????|XN--9T4B11YI5A|???|XN--DEBA0AD|????|XN--G6W251D|??|XN--HGBK6AJ7F53BBA|???????|XN--HLCJ6AYA9ESC7A|???????|XN--JXALPDLP|??????|XN--KGBECHTV|??????|XN--ZCKZAH|???|YE|YT|YU|ZA|ZM|ZW|local|loc|onion)'.
            ')(?![\pN\pL\-\_])'.
        ')'.
        '(?:'.
            '(?:\:\d+)?'. //:port
            '(?:/[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@]*)?'. // /path
            '(?:\?[\pN\pL\$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@\/]*)?'. // ?query string
            '(?:\#[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"\@/\?\#]*)?'. // #fragment
        ')(?<![\?\.\,\#\,])'.
    ')'.
    '#ixu';
    
    preg_match_all($regex, $this->html, $matching_data); // match data between specificed tags 
    
    return $matching_data[1];
}
}
 
Keep in mind that first 65 lines are just something i added for smf stuff :D

Re: Regex help..

Posted: Sun Nov 29, 2009 4:00 pm
by iankent
Also a reminder that the regex doesn't seem to work if copy and pasted from the forum, so if you wanted to test the code you'll need to download the rar unfortunately

Re: Regex help..

Posted: Sun Nov 29, 2009 11:09 pm
by ridgerunner
<off-topic>
iankent wrote:... the regex doesn't seem to work if copy and pasted from the forum ...
Yes, very strange. It appears that when you post text into a CODE tag of type TEXT, any escaped single quotes are stripped of their escape (backslash). i.e. if you post: " \'single quoted\' " it is converted to " 'single quoted' ". Here is a test post:

Code: Select all

'No escape'
\'One escape\'
\\'Two escapes\\'
\\\'Three escapes\\\'
\\\\'Four escapes\\\\'
 
I would consider this to be a pretty serious forum BUG. Any text within a CODE=TEXT tag should not be modified!
</off-topic>

Re: Regex help..

Posted: Thu Dec 03, 2009 1:08 pm
by The_L
Can anyone check this?? :/

Re: Regex help..

Posted: Thu Dec 03, 2009 5:24 pm
by ridgerunner
The_L wrote:Hello,i wanna ask something about regex here...iankent and i made this script that would export rapidshare links from forum pages...It works very well when url looks like
http://rapidshare.com/doesnotmatter.rar.html
but when it comes to
http://rapidshare.com/doesnotmatter.rar (something else than html)
it wont print the resault...
Here is the script and Here is downloadable source code... (php file)

Can anyone make it work for .rar .zip and some other types of link?

thanks..:P
The script is rather long and I don't have time to figure out everything it does. I did however, take a look at the one (long) regex in there, and it does in fact match both the: http://rapidshare.com/doesnotmatter.rar.html and
http://rapidshare.com/doesnotmatter.rar URLs.

Not sure what else to say about why one prints and the other does not.

Re: Regex help..

Posted: Fri Dec 04, 2009 4:29 am
by The_L
:..(((
Can anyone give me similar regex that would work? :/