how search engines work?
Moderator: General Moderators
how search engines work?
I wonder where search engines get list of urls to crawl? From a local database or they have ability to crawl all web by themselves?"Is source code of a search engine available for public?
Thanks.
A.
Thanks.
A.
- Kieran Huggins
- DevNet Master
- Posts: 3635
- Joined: Wed Dec 06, 2006 4:14 pm
- Location: Toronto, Canada
- Contact:
Google is a particularly fascinating system, they actually model their crawling algorithm on natural phenomena. They use a system not unlike "swarm" or "flock intelligence" to crawl and rank web pages. They did a write-up on it here: http://www.google.com/technology/pigeonrank.html
If you search around (i.e. "Google it"), Google have published lots of papers and "tech talks" abut their technologies.
If you search around (i.e. "Google it"), Google have published lots of papers and "tech talks" abut their technologies.
-
deadoralive
- Forum Commoner
- Posts: 28
- Joined: Tue Nov 06, 2007 1:24 pm
Ha ha haKieran Huggins wrote:Google is a particularly fascinating system, they actually model their crawling algorithm on natural phenomena. They use a system not unlike "swarm" or "flock intelligence" to crawl and rank web pages. They did a write-up on it here: http://www.google.com/technology/pigeonrank.html
If you search around (i.e. "Google it"), Google have published lots of papers and "tech talks" abut their technologies.
Never read such a stupid paper.Kieran Huggins wrote: http://www.google.com/technology/pigeonrank.html
Oh yeah, I have my pigeons code for me all the time... :-"Kieran Huggins wrote:http://www.google.com/technology/pigeonrank.html
- Jonah Bron
- DevNet Master
- Posts: 2764
- Joined: Thu Mar 15, 2007 6:28 pm
- Location: Redding, California
Check out this wikipedia page for some crawler examples. http://en.wikipedia.org/wiki/Category:Free_web_crawlers
If you're comfortable with Java then Nutch has quite a well developed, page-rank orientated crawler implementation. It's quite confusing to follow the code though as it uses Hadoop, Apache's implementation of Googles map-reduce distribution method. It makes you head go
If you're comfortable with Java then Nutch has quite a well developed, page-rank orientated crawler implementation. It's quite confusing to follow the code though as it uses Hadoop, Apache's implementation of Googles map-reduce distribution method. It makes you head go