regex from hell

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
stuffe
Forum Newbie
Posts: 4
Joined: Thu Nov 22, 2007 7:23 am

regex from hell

Post by stuffe »

Hey there. Im doing a simple web proxy in php, it works by requesting a page, and making all the urls in that page refer to the proxy page.

Eg. "www.google.com" becomes "proxy.php?http://www.google.com"

Heres my script so far:

PROXY.PHP:

Code: Select all

<?php
//requesting the page
$source_data = @file_get_contents($_SERVER['QUERY_STRING'], FALSE);

$url_parsed = parse_url($_SERVER['QUERY_STRING']);

//Deep Link eg. http://google.com
$source_data = preg_replace('/(?<!type)=(["\']?)([a-zA-Z]{3,5}:\/\/[a-zA-Z0-9_.]{2,}\.[a-zA-Z]{2,5}[\/]?)\1/', "=$1".$_SERVER['PHP_SELF']."?$2$1",$source_data);

//Relative link eg. pic.jpg
$source_data = preg_replace('/(?<!type)=(["\']?)([a-zA-Z0-9\/%_-]+\.[a-zA-Z0-9]{2,5})[\/]?\1/', "=$1".$_SERVER['PHP_SELF']."?".$url_parsed['scheme']."://".$url_parsed['host']."/".$dir."$2$1", $source_data);

//Relative path eg. /pics/userpics
$source_data = preg_replace('/(?<!type)=(["\']?)[\/]?([a-zA-Z0-9%_-]+[\/]?)+\1/',
"=$1".$_SERVER['PHP_SELF']."?".$url_parsed['scheme']."://".$url_parsed['host']."/".$dir."$2$1",$source_data);

echo $source_data;
?>
As you see its pretty complicated and its just acting strange, so i really hope some one can help.

You can try it like this PROXY.PHP?http://google.com (remember http://)

I have used a test page you can try it on: http://glbyvej.dk/links.htm

Btw (?<!type) is there to make sure the regex doesnt regard <script type="text/javascript"> or <link type="text/css"> as links.

I really hope some one can help, thanks.
Last edited by stuffe on Thu Nov 22, 2007 8:23 am, edited 1 time in total.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

So whats your question?
stuffe
Forum Newbie
Posts: 4
Joined: Thu Nov 22, 2007 7:23 am

Post by stuffe »

Can you find the error?
I spend a lot of time looking at it and i just cant find it.
I realize it is a pretty complex code when you didn't write it yourself.
Maybe some has a different url finding regex as that would be ok as well.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

its just acting strange
I don't want to have to figure out what your problem is, and I doubt people will help unless you can tell us :wink:

So what isn't working? What are you expecting to happen that isn't?

Adding debugging code can also be very helpful, i.e. var_dump() variables and see if they contain what they expected, and include your out here.

EDIT| You are also missing a call to parse_url() and setting $url_parsed .. try turning on error_reporting(E_ALL);
Last edited by John Cartwright on Thu Nov 22, 2007 8:16 am, edited 1 time in total.
stuffe
Forum Newbie
Posts: 4
Joined: Thu Nov 22, 2007 7:23 am

Post by stuffe »

I want "http://google.com" to be replaced by "proxy.php?http://google.com"
Its hard to explain whats wrong.
Try to visit my test page http://glbyvej.dk/links.htm
then save save the script as proxy.php and run "http://localhost/proxy.php?http://glbyvej.dk/links.htm".
You will see how many the links are not how they are supposed to be.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

I edited my last post, in case you didn't notice.
stuffe
Forum Newbie
Posts: 4
Joined: Thu Nov 22, 2007 7:23 am

Post by stuffe »

Sorry about that. its fixed now.
Any way try to compare the source codes on these two pages and you will see what i mean.

http://glbyvej.dk/links.htm
http://www.glbyvej.dk/proxy.php?http:// ... /links.htm
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

There are some injection possibilities via PHP_SELF in the code. Particularly, path exposure via error creation. There's also potential for someone to request any captured data bits.
Post Reply