Need some help with a pattern match... Please help :-(
Posted: Fri Aug 06, 2004 9:04 pm
Hi,
I'm writing a program in perl which open documents where the filenames are provided as arguments in the command line.
The documents are html files. The program is supposed to go through the files and extract all the links in the file to make a summary.
I've got a pattern match which I can see no problems with (it allows whitespace where html would allow whitespce too) but it's not working as well as I hoped.
Here is the file it is reading:
and this is the output.... the first two links seem to get strung together into one long string????? But all the others work fine. I tried rewriting the html file but it still does the same thing?
Thanks in advance
I'm writing a program in perl which open documents where the filenames are provided as arguments in the command line.
The documents are html files. The program is supposed to go through the files and extract all the links in the file to make a summary.
I've got a pattern match which I can see no problems with (it allows whitespace where html would allow whitespce too) but it's not working as well as I hoped.
Here is the file it is reading:
and my perl code (designed for command line but may run in browser)....<html>
<head>
<title>
This is a test page with links in it
</title>
</head>
<body>
<div style="border:2px solid darkred; margin:20px; padding:20px; width:500px; color:darkblue; font-family:verdana; font-size:12pt" align="left">
Just testing some <a href="links.php">links</a> on this page which will be totally crammed chocoblock with link like <a href="yippee/anotherlink.html">this</a>.
<p>
This is a <a href="new_link.php#line2">link</a> to a .php file that doesn't even exist so don't click the link!
<p>
This one <a href="http://www.healthpointuk.com">here</a> is a link to one of my websites at http://healthpointuk.com.
<p>
And this last <a href="random_link.htm"><b>Link</b></a> is just some random link to another none existent file.
</body>
</html>
Code: Select all
#!/perl/bin/perl -w
# Take the command line arguments and store in hash
%arg_filenames = ();
$i = 0;
foreach $command_args (@ARGV) {
$argname = $command_args;
$arg_filenames{"file $i"} = $argname;
++$i;
}
# Open each file in turn for reading
foreach my $filenames (values %arg_filenames) {
if (-e $filenames) { # -e means "file exists". So only open if file exists
open (DATA, "< $filenames"); # Open file
while (<DATA>) { # Loop over all lines
if (/<a\s\s*?href\s*?=\s*?"(.+)"\s*?>/gi) { # Problematic pattern match
print ("$1 is a link on page "$filenames"\n");
}
}
} else { # Just an error message so the user can fix the problem
print ("\n:: "$filenames" does not exist in this directory ::\n");
}
}
__END__Can anyone see what's wrong with my pattern match? I am quite new to this.links.php">links</a> on this page which will be totally crammed chocoblock with link like <a href="yippee/anotherlink.html is a link on page "index.html"
new_link.php#line2 is a link on page "index.html"
http://www.healthpointuk.com is a link on page "index.html"
random_link.htm is a link on page "index.html"
Thanks in advance