Page 1 of 1

Need some help with a pattern match... Please help :-(

Posted: Fri Aug 06, 2004 9:04 pm
by Chris Corbyn
Hi,

I'm writing a program in perl which open documents where the filenames are provided as arguments in the command line.

The documents are html files. The program is supposed to go through the files and extract all the links in the file to make a summary.

I've got a pattern match which I can see no problems with (it allows whitespace where html would allow whitespce too) but it's not working as well as I hoped.

Here is the file it is reading:
<html>
<head>
<title>
This is a test page with links in it
</title>
</head>
<body>
<div style="border:2px solid darkred; margin:20px; padding:20px; width:500px; color:darkblue; font-family:verdana; font-size:12pt" align="left">
Just testing some <a href="links.php">links</a> on this page which will be totally crammed chocoblock with link like <a href="yippee/anotherlink.html">this</a>.
<p>
This is a <a href="new_link.php#line2">link</a> to a .php file that doesn't even exist so don't click the link!
<p>
This one <a href="http://www.healthpointuk.com">here</a> is a link to one of my websites at http://healthpointuk.com.
<p>
And this last <a href="random_link.htm"><b>Link</b></a> is just some random link to another none existent file.
</body>
</html>
and my perl code (designed for command line but may run in browser)....

Code: Select all

#!/perl/bin/perl -w

# Take the command line arguments and store in hash
%arg_filenames = ();
$i = 0;
foreach $command_args (@ARGV) &#123;
	$argname = $command_args;
	$arg_filenames&#123;"file $i"&#125; = $argname;
	++$i;
&#125;

# Open each file in turn for reading
foreach my $filenames (values %arg_filenames) &#123;
	if (-e $filenames) &#123;  # -e means "file exists". So only open if file exists
		open (DATA, "< $filenames");  # Open file
		while (<DATA>) &#123;  # Loop over all lines
			if (/<a\s\s*?href\s*?=\s*?"(.+)"\s*?>/gi) &#123;  # Problematic pattern match
				print ("$1 is a link on page "$filenames"\n");
			&#125;
		&#125;
	&#125; else &#123;  # Just an error message so the user can fix the problem
		print ("\n:: "$filenames" does not exist in this directory ::\n");
	&#125;
&#125;
__END__
and this is the output.... the first two links seem to get strung together into one long string????? But all the others work fine. I tried rewriting the html file but it still does the same thing?
links.php">links</a> on this page which will be totally crammed chocoblock with link like <a href="yippee/anotherlink.html is a link on page "index.html"
new_link.php#line2 is a link on page "index.html"
http://www.healthpointuk.com is a link on page "index.html"
random_link.htm is a link on page "index.html"
Can anyone see what's wrong with my pattern match? I am quite new to this.

Thanks in advance :-)

Posted: Fri Aug 06, 2004 9:10 pm
by Joe
Personally I am not too good with perl. However, just to offer a helping hand I have this tutorial:

Regular Expressions In Perl

Posted: Fri Aug 06, 2004 9:15 pm
by Chris Corbyn
Thanks I'm pretty confident I know how to do pattern matching but I just can't spot my error so I thought I'd seek help from the guys in here....

Thanks anyway

Posted: Fri Aug 06, 2004 9:19 pm
by Joe
No problem, and good luck ;)

Posted: Fri Aug 06, 2004 9:25 pm
by Chris Corbyn
Hmmm.. I see what it's doing but i can't see why. It's reading from the first " (double quotes) in the first link and SHOULD stop at the last double quotes but it doesn't... it carries on until the last double quotes in the second link. ???? Why and why doesn't it do that for the others? Odd...

Posted: Sat Aug 07, 2004 2:16 am
by timvw
I don't think you should use a regular expression for that.

Have a look at HTML::Parse and HTML::FormatText etc...