I'm writing a spider that crawls several thousand websites to finds events to populate a calendar (notably conferences). It's open source.
I can accurately identify whether a webpage refers to events based on keywords (I did a logistic regression with 100-150 terms (eg words), using 20,000 webpages).
Now my problem is that there are typically multiple webpages for each event. If there is only one event per domain then I could use the highest scoring webpage per domain, but what if there are two or more events? How can I identify duplicates so that I can discard them?
I have some ideas, but I don't think they will work. I'm stumped!
Ideas
-Similartext() or Levenshtein(): problem is that a lot of webpage text is similar for all webpages, due to the use of templates. Also two webpages can be about an event but have very different text (one webpage might trigger because it is about registration and another might trigger on a list of workshops).
-Finding the event name - hard, especially if it doesn't have the word "conference" in it.
-Shingling - http://en.wikipedia.org/wiki/W-shingling - might have same problems as similartext. I'm not sure how I'd find meaningful four word terms too. Google uses something like this to detect duplicate content.
-Event Date - hard to detect (especially day/month). There might be only a year or no date at all.
-Spidered Date - the date my spider found the webpage. If an event had all its pages made on the same date, then I can assume that any new webpages that my spider finds are for a new event. Or if I wait several months and then find them. This won't work for events that are listed on the same url. So far this is my favorite idea.
-Subdirectories - sometimes the pages for an event are in the same subdirectory, however not reliably (domain.org/conference).
Any suggestions?
Note: this is a general programming question, however I am using php and didn't see anywhere else that I could post it on this forum.
Thanks!
Aaron
Identifying Similar Text - Similar Webpages
Moderator: General Moderators