HTTP search class

Ye' old general discussion board. Basically, for everything that isn't covered elsewhere. Come here to shoot the breeze, shoot your mouth off, or whatever suits your fancy.
This forum is not for asking programming related questions.

Moderator: General Moderators

Post Reply
alex.barylski
DevNet Evangelist
Posts: 6267
Joined: Tue Dec 21, 2004 5:00 pm
Location: Winnipeg

HTTP search class

Post by alex.barylski »

Need a class which returns a list of URL's, alongside relevance, etc...

The catch is...I need it to work like a "real" search engine...starting at a specified directory or file and scanning all crawling the site, excluding files & directories specified in a config file...

And no, Google API won't work...

Also needs to be implemented in strictly PHP...and can't rely on MySQL FULLTEXT search...

PHPDig doesn't sound like a very good option...for one...cuz it returns a formatted list of results...as opposed to a generic array which I can then use and format the results accordingly....and it seems to reply on MySQL heavily...I dunno what they mean by Flat file support....cuz I couldn't find anything in the docs about how to use that instead of SQL...

In anycase...am I dreaming??? Or is somehting available??? :)

Cheers :)
josh
DevNet Master
Posts: 4872
Joined: Wed Feb 11, 2004 3:23 pm
Location: Palm beach, Florida

Post by josh »

You will need some sort of database with indexing if you don't want horrible performance. If listing documents in relevance based on keyword occurrence is what you want it shouldn't be too hard.

Just set up a table that has the id of the page, the name of the keyword, and the number of occurrences. when you need to hit that you just select the distinct pages, along with the number of occurrences... You can go out and make another query to grab the content of the pages it fetched so you can put a 1 paragraph excerpt of the page.

Do you need help with the crawler, the script that would index it or the search script?


Also can you use mysql boolean searches? Out of curiosity why is fulltext ruled out?
Post Reply