Page 1 of 1
How to avoid manual upload and grab contents automatically
Posted: Wed Dec 13, 2006 8:24 am
by crazytopu
Hi,
Just wondered if anyone implemented this or know the mechanism. I just want to know the technology of how sites like google grab contents from different souces (an excerpt of the contents i.e Headlines of news etc and a link of its source) automatically.
Is there any manual upload involved or everything is taken care by the code which makes the process automated?
Is it possible use PHP to do that? Can someone point me to any such tutorial/article?
Cheers,
Posted: Wed Dec 13, 2006 8:31 am
by feyd
Google's bots fetch the information from the news services under contract if memory serves.
Yes, PHP can do it, but it's not exactly engineered to be an internet search engine. If you want this sort of functionality, I would suggest seeking RSS feeds of the information you wish to load. Make sure you have their permission to use it. If you cannot find such feeds I would suggest using a compiled tool specifically engineered for this. There are many out there. One of the more basic is wget, which is overall just a really basic command-line spider.
Posted: Thu Dec 14, 2006 4:56 am
by ml01172
Look for web-spiders, web-bots, robots file, and probably socket-functions on the Internet.
To my knowledge, there is no such thing as "Automatic upload" in the things you're talking about, rather cruising through sites, parsing their contents and doing the following: writing the found data to a database, and recursively doing the same thing for all the links found in it. Of course, this is just the basic principle, since recursively following links can't be iterated, but a form of forking processes is needed, but this also tends to go to far away since exponential function is too fast. Great lot of mathematics and optimization needed, if you're doing it for exercise, than OK; however if you're trying to make another search engine of your own that'd be popular and useful, maybe you should give it a second thought?

Posted: Thu Dec 14, 2006 9:16 am
by crazytopu
Thanks for your input. Well, I was not thinking of doing it merely for excersice. I am after making a search engine! My plan is to make a Bangla (Bengali) search engine and the portal I am going to make for this language will also display the latest news the way google shows. But I will limit it to just Bangla sources.
I am so new to this whole thing but I will give my best shot. I bought the domain, built a simple holding page and now doing as much research as possible. But I am so busy with uni and job that I may have to end up giving years to build this site. Any suggestion/help along this long journey would be greatly appriciated.
visit this if you are little curious :
http://www.kikorben.com
Posted: Thu Dec 14, 2006 10:52 am
by onion2k
Posted: Thu Dec 14, 2006 11:27 am
by crazytopu
I have been to the link you have and at the end of the article it says :
Note: This page was posted for April Fool's Day - 2002
Is this what they really use or ........?
Posted: Thu Dec 14, 2006 12:41 pm
by onion2k
crazytopu wrote:Is this what they really use or ........?
Of course they don't really use pigeons. That would be completely ridiculous.
They use puffins.
Posted: Thu Dec 14, 2006 1:41 pm
by John Cartwright
crazytopu wrote:I have been to the link you have and at the end of the article it says :
Note: This page was posted for April Fool's Day - 2002
Is this what they really use or ........?
I hope your joking

RSS feed
Posted: Thu Dec 21, 2006 10:26 am
by crazytopu
I know this is not a RSS feed forum but since it's relavant to my original post I thought I would discuss it here.
Someone here pointed to RSS feed and I basically give it a go to understand what rss can do. Apology if it sounds very stupid.
I have created this rss file with just one item.
http://www.kikorben.com/rss2.htm
When you click the item's link it takes you here
http://www.kikorben.com/test.html
But the original content used to read
nhspurchasing website is going to be changed soon.
when I created the rss. But with the change of the source file the content now reads
okay, it has already been changed.
. The change does not reflect here
http://www.kikorben.com/rss2.htm.
So, if rss doesnot mean to reflect any change when the source file is changed, why do we use rss for then? just for standard formatting? Is not plain html enough for that? All it seem to do is publish some headlines, corresponding links to the source and a summary of the description.
1 . I know somewhere I am making a misjudgement, but what is this?
2. How can I simply achieve what I tried to achieve?
Thx
Posted: Thu Dec 21, 2006 10:37 am
by onion2k
You've completely missed the point of RSS. It's for syndicating data (articles usually) between websites. It has absolutely nothing to do with pages being updated, or page formatting. It's designed to allow people to display links to the latest articles from a website on their own site, or in a reader.
http://en.wikipedia.org/wiki/RSS_%28file_format%29