Page 1 of 1

pages ending in htm and html

Posted: Sat Sep 03, 2005 8:34 pm
by jaymoore_299
I'm using this pattern

http[://]+[A-Za-z0-9 _ . / \- \\]+htm(l)?

to find all webpages that end with htm and htm(l). What types of valid urls ending in htm or html would this pattern miss?

Posted: Sat Sep 03, 2005 8:49 pm
by feyd
you have errors in your regex pattern. It'll match these:

Code: Select all

http::::asdf.html
http://html
http://google.com/htm/mark/bork            (incorrectly matches "http://google.com/htm")
http://google.com htm/mark/bork            (incorrectly matches "http://google.com htm")
along with all url's with query strings or hashes.

Posted: Sat Sep 03, 2005 9:09 pm
by jaymoore_299
I'm using this to extract links from a page so most likely most will be validated. But for the second one, if the user named one of his directories htm or html, "http://google.com/htm/mark/bork" this would still be a valid url.

How would I exclude a space or a / ?

I tried this
http[://]+[A-Za-z0-9 _ . / \- \\]+htm(l)?^/

but this doesn't seem to match even this http://www.yahoo.com/index.htm

Posted: Sat Sep 03, 2005 9:32 pm
by feyd
you didn't understand what I said about the google ones.. only up to /htm is matched. The full url is not matched.

Are you trying to extract all possible url's or only ones that are being used simply? I posted this a while ago which may help you, although it may be overload as well. :)