pages ending in htm and html

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
jaymoore_299
Forum Contributor
Posts: 128
Joined: Wed May 11, 2005 6:40 pm
Contact:

pages ending in htm and html

Post by jaymoore_299 »

I'm using this pattern

http[://]+[A-Za-z0-9 _ . / \- \\]+htm(l)?

to find all webpages that end with htm and htm(l). What types of valid urls ending in htm or html would this pattern miss?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

you have errors in your regex pattern. It'll match these:

Code: Select all

http::::asdf.html
http://html
http://google.com/htm/mark/bork            (incorrectly matches "http://google.com/htm")
http://google.com htm/mark/bork            (incorrectly matches "http://google.com htm")
along with all url's with query strings or hashes.
jaymoore_299
Forum Contributor
Posts: 128
Joined: Wed May 11, 2005 6:40 pm
Contact:

Post by jaymoore_299 »

I'm using this to extract links from a page so most likely most will be validated. But for the second one, if the user named one of his directories htm or html, "http://google.com/htm/mark/bork" this would still be a valid url.

How would I exclude a space or a / ?

I tried this
http[://]+[A-Za-z0-9 _ . / \- \\]+htm(l)?^/

but this doesn't seem to match even this http://www.yahoo.com/index.htm
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

you didn't understand what I said about the google ones.. only up to /htm is matched. The full url is not matched.

Are you trying to extract all possible url's or only ones that are being used simply? I posted this a while ago which may help you, although it may be overload as well. :)
Post Reply