Page 1 of 2

Reading from a file and preg_match()

Posted: Tue May 01, 2007 2:42 pm
by Lethality

Code: Select all

$file = somefile.log
$fh = fopen($file, 'r');
$data = fread($fh, filesize($file));
I have a file with alot of text and I want to take out all strings starting with "http://" and store those strings in another file. I've been trying to figure out how to do this and I've been messing around with explode(), but with no success so far.

Basically:
Open file.
Read content.
Gather all links
Store them in new file.

Do anyone know how?

Posted: Tue May 01, 2007 2:59 pm
by feyd
file() or fopen()+loop+fread()/fgets() or file_get_contents(); preg_match_all()...

You'll need to find the RFC compliant pattern for URLs, which is floating around here somewhere...

Posted: Tue May 01, 2007 3:55 pm
by Lethality
Thanks, I suppose this is what I'm looking for.

I tried this

Code: Select all

$subject = $data;
$pattern = "/^(http:\/\/)?(ї^\/]+)/i";
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
I'm not sure if the pattern is correct, but print_r($matches) only prints "Array ( )" and no url.
Whats up with that?

Posted: Tue May 01, 2007 4:00 pm
by arturm
Your regular expression is totally wrong

go to: http://www.cuneytyilmaz.com/prog/jrx/
this is very good tool to test your regular expressions

Posted: Tue May 01, 2007 5:17 pm
by Lethality
arturm wrote:Your regular expression is totally wrong

go to: http://www.cuneytyilmaz.com/prog/jrx/
this is very good tool to test your regular expressions
Well, thanks its a nice tool. Though I'd appreciate it if anyone could post the right one. I've been looking (probably not hard enough), but I cannot find the one for urls.

Posted: Tue May 01, 2007 5:35 pm
by superdezign
Well, it's good practice for learning regex. Think about it. What do all urls have in common?

They don't ALWAYS have http://, so that should be optional (but if it isn't there, maybe you should add it). They have more than one name with a period, and end in com, net, org, info, edu, etc. They CAN have a slash, but if they have anything after the .com (or whatever), a slash has to be there first.

You'll get it.

Posted: Tue May 01, 2007 5:57 pm
by Lethality
Ok I figured this one works fine "((www|http)(\W+\S+[^).,:;?\]\} \r\n$]+))"
And I tested it with that tool, it works.

Though I'm getting this error when I'm trying to run it with my script.

Code: Select all

Warning: preg_match() [function.preg-match]: Unknown modifier ')'

Posted: Wed May 02, 2007 4:13 am
by Lethality
Lethality wrote:Ok I figured this one works fine "((www|http)(\W+\S+[^).,:;?\]\} \r\n$]+))"
And I tested it with that tool, it works.

Though I'm getting this error when I'm trying to run it with my script.

Code: Select all

Warning: preg_match() [function.preg-match]: Unknown modifier ')'
Does anyone know? It works with the tool, but the preg_match function gives me that Unknown modifier error.

Posted: Wed May 02, 2007 4:38 am
by stereofrog
Dont overcomplicate things. Does the following work for you?

Code: Select all

$lines = preg_grep(
   '~^http://~',
   file('somefile.log')
);

Posted: Wed May 02, 2007 8:23 am
by arturm
You have to escape some of the special characters like ( . ?

Code: Select all

preg_match("/((www|http)(\W+\S+[^\)\.,:;\?\]\} \r\n$]+))/i",$subject);
it should work like that

Posted: Thu May 03, 2007 9:25 am
by Lethality
Thanks for the help, it works now. I used the new pattern.

Code: Select all

$subject = $theData;
$pattern ="/((www|http)(\W+\S+[^\)\.,:;\?\]\}\r\n$]+))/i";
preg_match($pattern, $theData, $matches);
echo "URL: {$matches[0]}\n";

$file1 = "url.log";
$fh1 = fopen($file1, 'w') or die("Can't open file");
$urls = print_r ($matches[0], true);
fwrite($fh1, $urls);
fclose($fh1);
This will print and write only one of the urls in the array. Is there a way to get the max size of the array and make it print out all the links that exists in the file?
So far it only gets one link.

Posted: Thu May 03, 2007 9:49 am
by arturm
If you use preg_match_all() instead of preg_match() it will return you the array of all urls from the file

if you want to print them you can use print_r() or foreach()
Look at http://www.php.net for documentation how to do it.

Posted: Thu May 03, 2007 1:38 pm
by Lethality
Thanks again, I think most of this is solved now.

Posted: Fri May 04, 2007 9:51 am
by Lethality
The links are stored as an array like this (print_r stored): And I want each of them to be opened as separate tabs in the browser. Is that possible?

Code: Select all

Array
(
    [0] => http://www.randomsite.domain
    [1] => www.randomsite.domain
    [2] => http://randomsite.domain
)

Posted: Fri May 04, 2007 10:47 am
by feyd
Tab control is not within PHP's domain of influence. The only thing you can use is asking the browser to open a new window (which is not a standards complaint request now.)