Page 1 of 1

how search engines work?

Posted: Thu Nov 08, 2007 3:06 pm
by alxkn
I wonder where search engines get list of urls to crawl? From a local database or they have ability to crawl all web by themselves?"Is source code of a search engine available for public?

Thanks.
A.

Posted: Thu Nov 08, 2007 5:39 pm
by feyd
Both. Generally, no. The concepts behind them are, often, from scientific papers.

Posted: Thu Nov 08, 2007 7:06 pm
by alxkn
Scientific papers are not a problem for me. Do you know what kind of papers and what kind journals they are published?

Thanks in advance.
A.

Posted: Fri Nov 09, 2007 1:44 am
by Kieran Huggins
Google is a particularly fascinating system, they actually model their crawling algorithm on natural phenomena. They use a system not unlike "swarm" or "flock intelligence" to crawl and rank web pages. They did a write-up on it here: http://www.google.com/technology/pigeonrank.html

If you search around (i.e. "Google it"), Google have published lots of papers and "tech talks" abut their technologies.

Posted: Fri Nov 09, 2007 4:07 am
by deadoralive
Kieran Huggins wrote:Google is a particularly fascinating system, they actually model their crawling algorithm on natural phenomena. They use a system not unlike "swarm" or "flock intelligence" to crawl and rank web pages. They did a write-up on it here: http://www.google.com/technology/pigeonrank.html

If you search around (i.e. "Google it"), Google have published lots of papers and "tech talks" abut their technologies.
Ha ha ha :-) Gotta love april fools

Posted: Fri Nov 09, 2007 1:05 pm
by alxkn
Never read such a stupid paper. :lol:

Posted: Fri Nov 09, 2007 5:29 pm
by JellyFish
Oh yeah, I have my pigeons code for me all the time... :-"

Posted: Fri Nov 09, 2007 6:03 pm
by Jonah Bron
Mine makes a great latte. :wink:

Posted: Mon Nov 19, 2007 4:15 am
by bubblenut
Check out this wikipedia page for some crawler examples. http://en.wikipedia.org/wiki/Category:Free_web_crawlers

If you're comfortable with Java then Nutch has quite a well developed, page-rank orientated crawler implementation. It's quite confusing to follow the code though as it uses Hadoop, Apache's implementation of Googles map-reduce distribution method. It makes you head go 8O