Page 1 of 1

html parsing and regular expressions

Posted: Fri Jun 14, 2002 5:50 pm
by Galahad
I'm trying to parse an html document to find the links. I don't want to get javascript (href="javascript:...") or style sheet (href="*.css") links, though. I using preg_match_all but having trouble with the regular expressions. I'm afraid I'm not much of a regular expression hacker, but here's what I have:

Code: Select all

preg_match_all("/(?Ui)hrefї ]*=ї ]*("|').*(?!javascript:).*(?!\.css)\2/", $line, $temp);
Then I do a foreach with the results. When I use the above code it always fails. I can get it to work if I simplify the expression to:

Code: Select all

preg_match_all("/(?Ui)hrefї ]*=ї ]*".*"/", $line, $temp);
I would like to have it work with either href="..." or href='...' and ignore javascripts and style sheets. Any ideas? Thanks for the help.

Posted: Fri Jun 14, 2002 7:07 pm
by volka
try

Code: Select all

$pattern = '/href\s*=\s*&#1111;''"](?!javascript:)(.*)(?<!css)&#1111;''"]/';
as pattern for preg_match_all

Posted: Mon Jun 17, 2002 11:26 am
by Galahad
Volka, thanks for the reply. Here's my updated pattern:

Code: Select all

$pattern = "/(?Ui)href\s*=\s*&#1111;"'](?!javascript:)(.*)(?<!css)&#1111;"']/";
It no longer matches javascript stuff. There are a few things which won't work quite right though. For instance, it still matches the .css file on my test pages.

Also, if you do the quote matching with two [''"], then it will match a ' with a ". If the html is something like href="javascript:popUp('Hello')" it will only match href="javascript:popUp('. I'm not sure this is a serious problem, because I can't really come up with a situation where javascript is not involved. However, I would like it do deal with it correctly just for completeness. I want it to stop at whichever quote style it started with. I know that you can refer to groups with something like \1, but I've never gotten it to work in PHP. How do you do it correctly?

I have the (?Ui) in there so that it does a case-insensitive, ungreedy search. Some people write href, while others write HREF. I need it to be ungreedy because if the line consistes of: <a href="..."><img src="image.gif"></a> it will match all the way through the final " in the image tag. Thanks again for the help.

Posted: Tue Jun 18, 2002 6:27 pm
by Galahad
Ok, here's my latest pattern:

Code: Select all

$pattern = "/(?Ui)href\s*=\s*&#1111;"'](?!javascript:)(.*)\.(php|html)&#1111;"']/";
I had to switch from matching everything that wasn't a .css files because it was picking up executables and zipped files, too. The only php or html files is much better, I should have done that from the start.

Anyway, I still haven't been able to figure out the grouping thing. Here's how I've tried that:

Code: Select all

$pattern = "/(?Ui)href\s*=\s*(&#1111;"'])(?!javascript:)(.*)\.(php|html)\1/";
When I try this with either \1 or \2, it fails to match anything. You can read about back references here. If anyone can make it work or tell me what I'm doing wrong let me know, I would greatly appreciate it. Thanks.

Posted: Wed Jun 19, 2002 1:09 pm
by Galahad
Ok, I posted to another forum and someone found my error. Because I was typing the pattern in a string variable, I needed to use '\\1' not just '\1'. If I had written the pattern in the preg_match_all call directly, it would not have been a problem. Here's the code that works:

Code: Select all

$pattern = "/(?Ui)href\s*=\s*(&#1111;"'])(?!javascript:)(.*)\.(php|html)\\1/";
I just thought some of you might like to know what was going on. I hope this is helpful.