Page 1 of 1
how search engines work?
Posted: Thu Nov 08, 2007 3:06 pm
by alxkn
I wonder where search engines get list of urls to crawl? From a local database or they have ability to crawl all web by themselves?"Is source code of a search engine available for public?
Thanks.
A.
Posted: Thu Nov 08, 2007 5:39 pm
by feyd
Both. Generally, no. The concepts behind them are, often, from scientific papers.
Posted: Thu Nov 08, 2007 7:06 pm
by alxkn
Scientific papers are not a problem for me. Do you know what kind of papers and what kind journals they are published?
Thanks in advance.
A.
Posted: Fri Nov 09, 2007 1:44 am
by Kieran Huggins
Google is a particularly fascinating system, they actually model their crawling algorithm on natural phenomena. They use a system not unlike "swarm" or "flock intelligence" to crawl and rank web pages. They did a write-up on it here:
http://www.google.com/technology/pigeonrank.html
If you search around (i.e. "Google it"), Google have published lots of papers and "tech talks" abut their technologies.
Posted: Fri Nov 09, 2007 4:07 am
by deadoralive
Kieran Huggins wrote:Google is a particularly fascinating system, they actually model their crawling algorithm on natural phenomena. They use a system not unlike "swarm" or "flock intelligence" to crawl and rank web pages. They did a write-up on it here:
http://www.google.com/technology/pigeonrank.html
If you search around (i.e. "Google it"), Google have published lots of papers and "tech talks" abut their technologies.
Ha ha ha

Gotta love april fools
Posted: Fri Nov 09, 2007 1:05 pm
by alxkn
Never read such a stupid paper.

Posted: Fri Nov 09, 2007 5:29 pm
by JellyFish
Oh yeah, I have my pigeons code for me all the time... :-"
Posted: Fri Nov 09, 2007 6:03 pm
by Jonah Bron
Mine makes a great latte.

Posted: Mon Nov 19, 2007 4:15 am
by bubblenut
Check out this wikipedia page for some crawler examples.
http://en.wikipedia.org/wiki/Category:Free_web_crawlers
If you're comfortable with Java then
Nutch has quite a well developed, page-rank orientated crawler implementation. It's quite confusing to follow the code though as it uses Hadoop, Apache's implementation of Googles map-reduce distribution method. It makes you head go
