Page 1 of 1
pages ending in htm and html
Posted: Sat Sep 03, 2005 8:34 pm
by jaymoore_299
I'm using this pattern
http[://]+[A-Za-z0-9 _ . / \- \\]+htm(l)?
to find all webpages that end with htm and htm(l). What types of valid urls ending in htm or html would this pattern miss?
Posted: Sat Sep 03, 2005 8:49 pm
by feyd
you have errors in your regex pattern. It'll match these:
Code: Select all
http::::asdf.html
http://html
http://google.com/htm/mark/bork (incorrectly matches "http://google.com/htm")
http://google.com htm/mark/bork (incorrectly matches "http://google.com htm")
along with all url's with query strings or hashes.
Posted: Sat Sep 03, 2005 9:09 pm
by jaymoore_299
I'm using this to extract links from a page so most likely most will be validated. But for the second one, if the user named one of his directories htm or html, "
http://google.com/htm/mark/bork" this would still be a valid url.
How would I exclude a space or a / ?
I tried this
http[://]+[A-Za-z0-9 _ . / \- \\]+htm(l)?^/
but this doesn't seem to match even this
http://www.yahoo.com/index.htm
Posted: Sat Sep 03, 2005 9:32 pm
by feyd
you didn't understand what I said about the google ones.. only up to /htm is matched. The full url is not matched.
Are you trying to extract all possible url's or only ones that are being used simply? I posted
this a while ago which may help you, although it may be overload as well.
