Regex help..

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Regex help..

Post by The_L »

Hello,i wanna ask something about regex here...iankent and i made this script that would export rapidshare links from forum pages...It works very well when url looks like
http://rapidshare.com/doesnotmatter.rar.html
but when it comes to
http://rapidshare.com/doesnotmatter.rar (something else than html)
it wont print the resault...
Here is the script and Here is downloadable source code... (php file)

Can anyone make it work for .rar .zip and some other types of link?

thanks..:P
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Regex help..

Post by ridgerunner »

I'd take a look at it but you need to provide it in something other than a .rar file for me to see it.

Can't you just post the script here in CODE tag?
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Regex help..

Post by The_L »

Lol Whole script??? Well...oke :D

Code: Select all

 
<?php
// begin of smf elit 2  
require_once('forum/SSI.php');
unset($context['linktree']);
$context['menu_buttons'] = array(
    'home' => array(
        'title' => 'Home',
        'href' => $scripturl,
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => true,
    ),
    'forum' => array(
        'title' => 'Forum',
        'href' => 'http://www.forum-racunara.com/forum/',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'download' => array(
        'title' => 'Download',
        'href' => 'http://link2.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'ihosting' => array(
        'title' => 'I-Hosting',
        'href' => 'http://link2.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'tools' => array(
        'title' => 'Tools',
        'href' => 'http://link2.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'chat' => array(
        'title' => 'Chat',
        'href' => 'http://link2.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
    ),
    'kontakt' => array(
        'title' => 'Kontakt',
        'href' => 'http://link3.com/otherpage.php',
        'show' => true,
        'sub_buttons' => array(
        ),
        'active_button' => false,
        'is_last' => true,
    ),
);
template_header();
// end of smf elit 2
 
 
//Make a title spider
$tspider = new tagSpider();
 
// input box
$urlrun = $_POST['urlrun'];
if (!$urlrun) die("<form method='post'><input type='text' name='urlrun'> <input type='submit'></form>"); 
 
//Pass URL to the fetch page function
$tspider->fetchPage($urlrun);
 
// Enter the tags into the parse array function
$linkarray = $tspider->parse_array();
print_r($linkarray);
 
echo "<h2>Links present on page: ".$urlrun."</h2><br />";
echo "<table align=\"center\" bordercolor=\"#006600\" bgcolor=\"#CCCCCC\">";
$search = array('http://rapidshare.com','hotfile.com');
foreach ($linkarray as $link) {
    foreach($search as $search_term) { // this line had gone
        if(strpos($link, $search_term) > -1) { // this line should contain $search_term not $search
            $link = rtrim($link, "\"");
            echo "
               <tr>
               <td>$link</td>
               </tr>";
        }
    } // this line had gone
}
echo "</table>";
exit;
 
class tagSpider
{
 
var $crl; // this will hold our curl instance
var $html; // this is where we dump the html we get
var $binary; // set for binary type transfer
var $url; // this is the url we are going to do a pass on
 
 
 
function tagSpider()
{
    $this->html = "";
    $this->binary = 0;
    $this->url = "";
}
 
 
function fetchPage($url)
{
 
 
    $this->url = $url;
    if (isset($this->url)) {
 
        
        $this->ch = curl_init (); // start cURL instance
        curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1); // this tells cUrl to return the data
        curl_setopt ($this->ch, CURLOPT_URL, $this->url); // set the url to download
        curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
        curl_setopt($this->ch, CURLOPT_BINARYTRANSFER, $this->binary); // tell cURL if the data is binary data or not
        $this->html = curl_exec($this->ch); // grabs the webpage from the internet
        curl_close ($this->ch); // closes the connection
        
        
        // $this->html = file_get_contents($url);
    }
}
 
 
function parse_array() // this function takes the grabbed html and picked out the pieces we want
{
    $regex = '#'.
    '(?:^|[\s\(\)\[\]\{\}\\\'\\\";]+)(?![\@\!\#])'.
    '('.
        '(?:'.
            '(?:'. //Known protocols
                '(?:'.
                    '(?:(?:https?|ftps?|mms|rtsp|gopher|news|nntp|telnet|wais|file|prospero|webcal|irc)://)'.
                    '|'.
                    '(?:(?:mailto|aim|tel|xmpp):)'.
                ')'.
                '(?:[\pN\pL\-\_\+\%\~]+(?::[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
                '(?:'.
                    '(?:'.
                        '\[[\pN\pL\-\_\:\.]+(?<![\.\:])\]'. //[dns]
                    ')|(?:'.
                        '[\pN\pL\-\_\:\.]+(?<![\.\:])'. //dns
                    ')'.
                ')'.
            ')'.
            '|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'. //IPv4
            '|(?:'. //IPv6
                '\[?(?:(?:(?:[0-9A-Fa-f]{1,4}:){7}(?:(?:[0-9A-Fa-f]{1,4})|:))|(?:(?:[0-9A-Fa-f]{1,4}:){6}(?::|(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})|(?::[0-9A-Fa-f]{1,4})))|(?:(?:[0-9A-Fa-f]{1,4}:){5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){4}(?::[0-9A-Fa-f]{1,4}){0,1}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){3}(?::[0-9A-Fa-f]{1,4}){0,2}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:){2}(?::[0-9A-Fa-f]{1,4}){0,3}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:[0-9A-Fa-f]{1,4}:)(?::[0-9A-Fa-f]{1,4}){0,4}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?::(?::[0-9A-Fa-f]{1,4}){0,5}(?:(?::(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})?)|(?:(?::[0-9A-Fa-f]{1,4}){1,2})))|(?:(?:(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})(?:\.(?:25[0-5]|2[0-4]\d|[01]?\d{1,2})){3})))\]?(?<!:)'.
            ')|(?:'. //DNS
                '(?:[\pN\pL\-\_\+\%\~]+(?:\:[\pN\pL\-\_\+\%\~]+)?\@)?'. //user:pass@
                '[\pN\pL\-\_]+(?:\.[\pN\pL\-\_]+)*\.'.
                //tld list from http://data.iana.org/TLD/tlds-alpha-by-domain.txt, also added local, loc, and onion
                '(?:AC|AD|AE|AERO|AF|AG|AI|AL|AM|AN|AO|AQ|AR|ARPA|AS|ASIA|AT|AU|AW|AX|AZ|BA|BB|BD|BE|BF|BG|BH|BI|BIZ|BJ|BM|BN|BO|BR|BS|BT|BV|BW|BY|BZ|CA|CAT|CC|CD|CF|CG|CH|CI|CK|CL|CM|CN|CO|COM|COOP|CR|CU|CV|CX|CY|CZ|DE|DJ|DK|DM|DO|DZ|EC|EDU|EE|EG|ER|ES|ET|EU|FI|FJ|FK|FM|FO|FR|GA|GB|GD|GE|GF|GG|GH|GI|GL|GM|GN|GOV|GP|GQ|GR|GS|GT|GU|GW|GY|HK|HM|HN|HR|HT|HU|ID|IE|IL|IM|IN|INFO|INT|IO|IQ|IR|IS|IT|JE|JM|JO|JOBS|JP|KE|KG|KH|KI|KM|KN|KP|KR|KW|KY|KZ|LA|LB|LC|LI|LK|LR|LS|LT|LU|LV|LY|MA|MC|MD|ME|MG|MH|MIL|MK|ML|MM|MN|MO|MOBI|MP|MQ|MR|MS|MT|MU|MUSEUM|MV|MW|MX|MY|MZ|NA|NAME|NC|NE|NET|NF|NG|NI|NL|NO|NP|NR|NU|NZ|OM|ORG|PA|PE|PF|PG|PH|PK|PL|PM|PN|PR|PRO|PS|PT|PW|PY|QA|RE|RO|RS|RU|RW|SA|SB|SC|SD|SE|SG|SH|SI|SJ|SK|SL|SM|SN|SO|SR|ST|SU|SV|SY|SZ|TC|TD|TEL|TF|TG|TH|TJ|TK|TL|TM|TN|TO|TP|TR|TRAVEL|TT|TV|TW|TZ|UA|UG|UK|US|UY|UZ|VA|VC|VE|VG|VI|VN|VU|WF|WS|XN--0ZWM56D|??|XN--11B5BS3A9AJ6G|???????|XN--80AKHBYKNJ4F|?????????|XN--9T4B11YI5A|???|XN--DEBA0AD|????|XN--G6W251D|??|XN--HGBK6AJ7F53BBA|???????|XN--HLCJ6AYA9ESC7A|???????|XN--JXALPDLP|??????|XN--KGBECHTV|??????|XN--ZCKZAH|???|YE|YT|YU|ZA|ZM|ZW|local|loc|onion)'.
            ')(?![\pN\pL\-\_])'.
        ')'.
        '(?:'.
            '(?:\:\d+)?'. //:port
            '(?:/[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@]*)?'. // /path
            '(?:\?[\pN\pL\$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"@\/]*)?'. // ?query string
            '(?:\#[\pN\pL$\[\]\,\!\(\)\.\:\-\_\+\/\=\&\;\%\~\*\$\+\'\"\@/\?\#]*)?'. // #fragment
        ')(?<![\?\.\,\#\,])'.
    ')'.
    '#ixu';
    
    preg_match_all($regex, $this->html, $matching_data); // match data between specificed tags 
    
    return $matching_data[1];
}
}
 
Keep in mind that first 65 lines are just something i added for smf stuff :D
User avatar
iankent
Forum Contributor
Posts: 333
Joined: Mon Nov 16, 2009 4:23 pm
Location: Wales, United Kingdom

Re: Regex help..

Post by iankent »

Also a reminder that the regex doesn't seem to work if copy and pasted from the forum, so if you wanted to test the code you'll need to download the rar unfortunately
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Regex help..

Post by ridgerunner »

<off-topic>
iankent wrote:... the regex doesn't seem to work if copy and pasted from the forum ...
Yes, very strange. It appears that when you post text into a CODE tag of type TEXT, any escaped single quotes are stripped of their escape (backslash). i.e. if you post: " \'single quoted\' " it is converted to " 'single quoted' ". Here is a test post:

Code: Select all

'No escape'
\'One escape\'
\\'Two escapes\\'
\\\'Three escapes\\\'
\\\\'Four escapes\\\\'
 
I would consider this to be a pretty serious forum BUG. Any text within a CODE=TEXT tag should not be modified!
</off-topic>
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Regex help..

Post by The_L »

Can anyone check this?? :/
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Regex help..

Post by ridgerunner »

The_L wrote:Hello,i wanna ask something about regex here...iankent and i made this script that would export rapidshare links from forum pages...It works very well when url looks like
http://rapidshare.com/doesnotmatter.rar.html
but when it comes to
http://rapidshare.com/doesnotmatter.rar (something else than html)
it wont print the resault...
Here is the script and Here is downloadable source code... (php file)

Can anyone make it work for .rar .zip and some other types of link?

thanks..:P
The script is rather long and I don't have time to figure out everything it does. I did however, take a look at the one (long) regex in there, and it does in fact match both the: http://rapidshare.com/doesnotmatter.rar.html and
http://rapidshare.com/doesnotmatter.rar URLs.

Not sure what else to say about why one prints and the other does not.
The_L
Forum Commoner
Posts: 64
Joined: Sun Nov 22, 2009 6:53 pm

Re: Regex help..

Post by The_L »

:..(((
Can anyone give me similar regex that would work? :/
Post Reply