html parsing and regular expressions

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
Galahad
Forum Contributor
Posts: 111
Joined: Fri Jun 14, 2002 5:50 pm

html parsing and regular expressions

Post by Galahad »

I'm trying to parse an html document to find the links. I don't want to get javascript (href="javascript:...") or style sheet (href="*.css") links, though. I using preg_match_all but having trouble with the regular expressions. I'm afraid I'm not much of a regular expression hacker, but here's what I have:

Code: Select all

preg_match_all("/(?Ui)hrefї ]*=ї ]*("|').*(?!javascript:).*(?!\.css)\2/", $line, $temp);
Then I do a foreach with the results. When I use the above code it always fails. I can get it to work if I simplify the expression to:

Code: Select all

preg_match_all("/(?Ui)hrefї ]*=ї ]*".*"/", $line, $temp);
I would like to have it work with either href="..." or href='...' and ignore javascripts and style sheets. Any ideas? Thanks for the help.
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

try

Code: Select all

$pattern = '/href\s*=\s*&#1111;''"](?!javascript:)(.*)(?<!css)&#1111;''"]/';
as pattern for preg_match_all
Galahad
Forum Contributor
Posts: 111
Joined: Fri Jun 14, 2002 5:50 pm

Post by Galahad »

Volka, thanks for the reply. Here's my updated pattern:

Code: Select all

$pattern = "/(?Ui)href\s*=\s*&#1111;"'](?!javascript:)(.*)(?<!css)&#1111;"']/";
It no longer matches javascript stuff. There are a few things which won't work quite right though. For instance, it still matches the .css file on my test pages.

Also, if you do the quote matching with two [''"], then it will match a ' with a ". If the html is something like href="javascript:popUp('Hello')" it will only match href="javascript:popUp('. I'm not sure this is a serious problem, because I can't really come up with a situation where javascript is not involved. However, I would like it do deal with it correctly just for completeness. I want it to stop at whichever quote style it started with. I know that you can refer to groups with something like \1, but I've never gotten it to work in PHP. How do you do it correctly?

I have the (?Ui) in there so that it does a case-insensitive, ungreedy search. Some people write href, while others write HREF. I need it to be ungreedy because if the line consistes of: <a href="..."><img src="image.gif"></a> it will match all the way through the final " in the image tag. Thanks again for the help.
Galahad
Forum Contributor
Posts: 111
Joined: Fri Jun 14, 2002 5:50 pm

Post by Galahad »

Ok, here's my latest pattern:

Code: Select all

$pattern = "/(?Ui)href\s*=\s*&#1111;"'](?!javascript:)(.*)\.(php|html)&#1111;"']/";
I had to switch from matching everything that wasn't a .css files because it was picking up executables and zipped files, too. The only php or html files is much better, I should have done that from the start.

Anyway, I still haven't been able to figure out the grouping thing. Here's how I've tried that:

Code: Select all

$pattern = "/(?Ui)href\s*=\s*(&#1111;"'])(?!javascript:)(.*)\.(php|html)\1/";
When I try this with either \1 or \2, it fails to match anything. You can read about back references here. If anyone can make it work or tell me what I'm doing wrong let me know, I would greatly appreciate it. Thanks.
Galahad
Forum Contributor
Posts: 111
Joined: Fri Jun 14, 2002 5:50 pm

Post by Galahad »

Ok, I posted to another forum and someone found my error. Because I was typing the pattern in a string variable, I needed to use '\\1' not just '\1'. If I had written the pattern in the preg_match_all call directly, it would not have been a problem. Here's the code that works:

Code: Select all

$pattern = "/(?Ui)href\s*=\s*(&#1111;"'])(?!javascript:)(.*)\.(php|html)\\1/";
I just thought some of you might like to know what was going on. I hope this is helpful.
Post Reply