Page 1 of 1

Need help with reg expression matching.

Posted: Sun Jan 25, 2009 1:02 am
by flexy123
Reg expressions always give me headaches.....

Ok, here the problem:

i have string which contains a URL, example:

...
http://news.google.com/news/url?sa=T&ct=us/3-0&fd=R&url=http://www.washingtonpost.com/wp-dyn/co ... 02814.html%3Fhpid%3Dsec-health&cid=12936257725&ei=Rg18SYnNHAZWK8uCvCw&usg=AFZQHAO3t-H7mPEcZUQKAWDxLzkA
...

I need to extract only the bold part, so i need an expression/code which parses the string and returns the URL which i have marked in bold.

What i marked in red is always the same and can serve as markers, the other stuff is dynamic. I am only interested in the bold URL, the rest can be stripped in the result.

G.

Re: Need help with reg expression matching.

Posted: Sun Jan 25, 2009 2:09 am
by requinix
parse_str can do exactly what you need. Just make sure there's only one bit that looks like a URL in that string and you'll be fine.
You can use parse_url first to remove everything before the URL begins but won't clean up the text afterwards. That'll only affect the last key/value pair but since you don't care about the "usg" it shouldn't be a problem.

Code: Select all

$text = <<<TEXT
Reg expressions always give me headaches.....
 
Ok, here the problem:
 
i have string which contains a URL, example:
 
...
http://news.google.com/news/url?sa=T&ct=us/3-0&fd=R&url=http://www.washingtonpost.com/wp-dyn/content/article/2009/01/23/AR2009012302814.html%3Fhpid%3Dsec-health&cid=12936257725&ei=Rg18SYnNHAZWK8uCvCw&usg=AFZQHAO3t-H7mPEcZUQKAWDxLzkA
...
 
I need to extract only the bold part, so i need an expression/code which parses the string and returns the URL which i have marked in bold.
 
What i marked in red is always the same and can serve as markers, the other stuff is dynamic. I am only interested in the bold URL, the rest can be stripped in the result.
 
G.
TEXT;
 
$url = parse_url($text, PHP_URL_QUERY);
parse_str($url, $GET);
print_r($GET);

Re: Need help with reg expression matching.

Posted: Sun Jan 25, 2009 7:15 am
by flexy123
thank you, but I am still having a problem

Let's say i am pulling web site, and the site contains whatever content, with

...
<A href="http://news.google.com/news/url?sa=T&ct ... &fd=R&url=http://www.washingtonpost.com/wp-dyn/co ... 01062.html&cid=1243573618&ei=xGJ8Sf5325QHAifCIAg&usg=AF643w5AmoH1CMw2_UJ5753S643A">
...
<A href="http://news.otherurl.com/news/url?sa=T& ... &fd=R&url=http://www.blah.com/wp-dyn/content/arti ... 01062.html&cid=1243573618&ei=xGJ2Sf5325QHAifCIAg&usg=AF643w5AmoH1CMw2_UJ5753S643A">
...
and similar URLs embedded all throughout the content.

What i want is to replace each occurence of such string throughout the whole page with the url which is in red, the part after the url=.

<A href="http://news.otherurl.com/news/url?sa=T& ... &fd=R&url=http://www.blah.com/wp-dyn/content/arti ... 01062.html&cid=1243573618&ei=xGJ2Sf5325QHAifCIAg&usg=AF643w5AmoH1CMw2_UJ5753S643A">

becomes ---->

<A href="http://www.blah.com/wp-dyn/content/arti ... 62.html"[b]>[/b]

So..i would need a regexp which finds links in the site/string, parses it, extracts the "url=xxxxxxxxx part and replaces each link with the red part, the rest of the link is not of interest.

Help very appreciated!