Reading from a file and preg_match()

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Lethality
Forum Newbie
Posts: 16
Joined: Tue May 01, 2007 9:38 am

Reading from a file and preg_match()

Post by Lethality »

Code: Select all

$file = somefile.log
$fh = fopen($file, 'r');
$data = fread($fh, filesize($file));
I have a file with alot of text and I want to take out all strings starting with "http://" and store those strings in another file. I've been trying to figure out how to do this and I've been messing around with explode(), but with no success so far.

Basically:
Open file.
Read content.
Gather all links
Store them in new file.

Do anyone know how?
Last edited by Lethality on Wed May 02, 2007 4:15 am, edited 1 time in total.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

file() or fopen()+loop+fread()/fgets() or file_get_contents(); preg_match_all()...

You'll need to find the RFC compliant pattern for URLs, which is floating around here somewhere...
Lethality
Forum Newbie
Posts: 16
Joined: Tue May 01, 2007 9:38 am

Post by Lethality »

Thanks, I suppose this is what I'm looking for.

I tried this

Code: Select all

$subject = $data;
$pattern = "/^(http:\/\/)?(ї^\/]+)/i";
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
I'm not sure if the pattern is correct, but print_r($matches) only prints "Array ( )" and no url.
Whats up with that?
User avatar
arturm
Forum Commoner
Posts: 86
Joined: Fri Apr 13, 2007 8:29 am
Location: NY
Contact:

Post by arturm »

Your regular expression is totally wrong

go to: http://www.cuneytyilmaz.com/prog/jrx/
this is very good tool to test your regular expressions
Lethality
Forum Newbie
Posts: 16
Joined: Tue May 01, 2007 9:38 am

Post by Lethality »

arturm wrote:Your regular expression is totally wrong

go to: http://www.cuneytyilmaz.com/prog/jrx/
this is very good tool to test your regular expressions
Well, thanks its a nice tool. Though I'd appreciate it if anyone could post the right one. I've been looking (probably not hard enough), but I cannot find the one for urls.
User avatar
superdezign
DevNet Master
Posts: 4135
Joined: Sat Jan 20, 2007 11:06 pm

Post by superdezign »

Well, it's good practice for learning regex. Think about it. What do all urls have in common?

They don't ALWAYS have http://, so that should be optional (but if it isn't there, maybe you should add it). They have more than one name with a period, and end in com, net, org, info, edu, etc. They CAN have a slash, but if they have anything after the .com (or whatever), a slash has to be there first.

You'll get it.
Lethality
Forum Newbie
Posts: 16
Joined: Tue May 01, 2007 9:38 am

Post by Lethality »

Ok I figured this one works fine "((www|http)(\W+\S+[^).,:;?\]\} \r\n$]+))"
And I tested it with that tool, it works.

Though I'm getting this error when I'm trying to run it with my script.

Code: Select all

Warning: preg_match() [function.preg-match]: Unknown modifier ')'
Lethality
Forum Newbie
Posts: 16
Joined: Tue May 01, 2007 9:38 am

Post by Lethality »

Lethality wrote:Ok I figured this one works fine "((www|http)(\W+\S+[^).,:;?\]\} \r\n$]+))"
And I tested it with that tool, it works.

Though I'm getting this error when I'm trying to run it with my script.

Code: Select all

Warning: preg_match() [function.preg-match]: Unknown modifier ')'
Does anyone know? It works with the tool, but the preg_match function gives me that Unknown modifier error.
User avatar
stereofrog
Forum Contributor
Posts: 386
Joined: Mon Dec 04, 2006 6:10 am

Post by stereofrog »

Dont overcomplicate things. Does the following work for you?

Code: Select all

$lines = preg_grep(
   '~^http://~',
   file('somefile.log')
);
User avatar
arturm
Forum Commoner
Posts: 86
Joined: Fri Apr 13, 2007 8:29 am
Location: NY
Contact:

Post by arturm »

You have to escape some of the special characters like ( . ?

Code: Select all

preg_match("/((www|http)(\W+\S+[^\)\.,:;\?\]\} \r\n$]+))/i",$subject);
it should work like that
Lethality
Forum Newbie
Posts: 16
Joined: Tue May 01, 2007 9:38 am

Post by Lethality »

Thanks for the help, it works now. I used the new pattern.

Code: Select all

$subject = $theData;
$pattern ="/((www|http)(\W+\S+[^\)\.,:;\?\]\}\r\n$]+))/i";
preg_match($pattern, $theData, $matches);
echo "URL: {$matches[0]}\n";

$file1 = "url.log";
$fh1 = fopen($file1, 'w') or die("Can't open file");
$urls = print_r ($matches[0], true);
fwrite($fh1, $urls);
fclose($fh1);
This will print and write only one of the urls in the array. Is there a way to get the max size of the array and make it print out all the links that exists in the file?
So far it only gets one link.
User avatar
arturm
Forum Commoner
Posts: 86
Joined: Fri Apr 13, 2007 8:29 am
Location: NY
Contact:

Post by arturm »

If you use preg_match_all() instead of preg_match() it will return you the array of all urls from the file

if you want to print them you can use print_r() or foreach()
Look at http://www.php.net for documentation how to do it.
Lethality
Forum Newbie
Posts: 16
Joined: Tue May 01, 2007 9:38 am

Post by Lethality »

Thanks again, I think most of this is solved now.
Last edited by Lethality on Fri May 04, 2007 9:53 am, edited 1 time in total.
Lethality
Forum Newbie
Posts: 16
Joined: Tue May 01, 2007 9:38 am

Post by Lethality »

The links are stored as an array like this (print_r stored): And I want each of them to be opened as separate tabs in the browser. Is that possible?

Code: Select all

Array
(
    [0] => http://www.randomsite.domain
    [1] => www.randomsite.domain
    [2] => http://randomsite.domain
)
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Tab control is not within PHP's domain of influence. The only thing you can use is asking the browser to open a new window (which is not a standards complaint request now.)
Post Reply