I'm using this pattern
http[://]+[A-Za-z0-9 _ . / \- \\]+htm(l)?
to find all webpages that end with htm and htm(l). What types of valid urls ending in htm or html would this pattern miss?
pages ending in htm and html
Moderator: General Moderators
-
jaymoore_299
- Forum Contributor
- Posts: 128
- Joined: Wed May 11, 2005 6:40 pm
- Contact:
- feyd
- Neighborhood Spidermoddy
- Posts: 31559
- Joined: Mon Mar 29, 2004 3:24 pm
- Location: Bothell, Washington, USA
you have errors in your regex pattern. It'll match these:
along with all url's with query strings or hashes.
Code: Select all
http::::asdf.html
http://html
http://google.com/htm/mark/bork (incorrectly matches "http://google.com/htm")
http://google.com htm/mark/bork (incorrectly matches "http://google.com htm")-
jaymoore_299
- Forum Contributor
- Posts: 128
- Joined: Wed May 11, 2005 6:40 pm
- Contact:
I'm using this to extract links from a page so most likely most will be validated. But for the second one, if the user named one of his directories htm or html, "http://google.com/htm/mark/bork" this would still be a valid url.
How would I exclude a space or a / ?
I tried this
http[://]+[A-Za-z0-9 _ . / \- \\]+htm(l)?^/
but this doesn't seem to match even this http://www.yahoo.com/index.htm
How would I exclude a space or a / ?
I tried this
http[://]+[A-Za-z0-9 _ . / \- \\]+htm(l)?^/
but this doesn't seem to match even this http://www.yahoo.com/index.htm
- feyd
- Neighborhood Spidermoddy
- Posts: 31559
- Joined: Mon Mar 29, 2004 3:24 pm
- Location: Bothell, Washington, USA
you didn't understand what I said about the google ones.. only up to /htm is matched. The full url is not matched.
Are you trying to extract all possible url's or only ones that are being used simply? I posted this a while ago which may help you, although it may be overload as well.
Are you trying to extract all possible url's or only ones that are being used simply? I posted this a while ago which may help you, although it may be overload as well.