How to avoid manual upload and grab contents automatically

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
crazytopu
Forum Contributor
Posts: 259
Joined: Fri Nov 07, 2003 12:43 pm
Location: London, UK
Contact:

How to avoid manual upload and grab contents automatically

Post by crazytopu »

Hi,

Just wondered if anyone implemented this or know the mechanism. I just want to know the technology of how sites like google grab contents from different souces (an excerpt of the contents i.e Headlines of news etc and a link of its source) automatically.

Is there any manual upload involved or everything is taken care by the code which makes the process automated?

Is it possible use PHP to do that? Can someone point me to any such tutorial/article?

Cheers,
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Google's bots fetch the information from the news services under contract if memory serves.

Yes, PHP can do it, but it's not exactly engineered to be an internet search engine. If you want this sort of functionality, I would suggest seeking RSS feeds of the information you wish to load. Make sure you have their permission to use it. If you cannot find such feeds I would suggest using a compiled tool specifically engineered for this. There are many out there. One of the more basic is wget, which is overall just a really basic command-line spider.
ml01172
Forum Newbie
Posts: 4
Joined: Mon Dec 11, 2006 3:55 am

Post by ml01172 »

Look for web-spiders, web-bots, robots file, and probably socket-functions on the Internet.

To my knowledge, there is no such thing as "Automatic upload" in the things you're talking about, rather cruising through sites, parsing their contents and doing the following: writing the found data to a database, and recursively doing the same thing for all the links found in it. Of course, this is just the basic principle, since recursively following links can't be iterated, but a form of forking processes is needed, but this also tends to go to far away since exponential function is too fast. Great lot of mathematics and optimization needed, if you're doing it for exercise, than OK; however if you're trying to make another search engine of your own that'd be popular and useful, maybe you should give it a second thought? :)
crazytopu
Forum Contributor
Posts: 259
Joined: Fri Nov 07, 2003 12:43 pm
Location: London, UK
Contact:

Post by crazytopu »

Thanks for your input. Well, I was not thinking of doing it merely for excersice. I am after making a search engine! My plan is to make a Bangla (Bengali) search engine and the portal I am going to make for this language will also display the latest news the way google shows. But I will limit it to just Bangla sources.

I am so new to this whole thing but I will give my best shot. I bought the domain, built a simple holding page and now doing as much research as possible. But I am so busy with uni and job that I may have to end up giving years to build this site. Any suggestion/help along this long journey would be greatly appriciated.

visit this if you are little curious : http://www.kikorben.com
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

crazytopu
Forum Contributor
Posts: 259
Joined: Fri Nov 07, 2003 12:43 pm
Location: London, UK
Contact:

Post by crazytopu »

I have been to the link you have and at the end of the article it says :

Note: This page was posted for April Fool's Day - 2002
Is this what they really use or ........?
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

crazytopu wrote:Is this what they really use or ........?
Of course they don't really use pigeons. That would be completely ridiculous.

They use puffins.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

crazytopu wrote:I have been to the link you have and at the end of the article it says :

Note: This page was posted for April Fool's Day - 2002
Is this what they really use or ........?
I hope your joking :roll:
crazytopu
Forum Contributor
Posts: 259
Joined: Fri Nov 07, 2003 12:43 pm
Location: London, UK
Contact:

RSS feed

Post by crazytopu »

I know this is not a RSS feed forum but since it's relavant to my original post I thought I would discuss it here.

Someone here pointed to RSS feed and I basically give it a go to understand what rss can do. Apology if it sounds very stupid.


I have created this rss file with just one item.

http://www.kikorben.com/rss2.htm

When you click the item's link it takes you here

http://www.kikorben.com/test.html

But the original content used to read
nhspurchasing website is going to be changed soon.
when I created the rss. But with the change of the source file the content now reads
okay, it has already been changed.
. The change does not reflect here http://www.kikorben.com/rss2.htm.

So, if rss doesnot mean to reflect any change when the source file is changed, why do we use rss for then? just for standard formatting? Is not plain html enough for that? All it seem to do is publish some headlines, corresponding links to the source and a summary of the description.

1 . I know somewhere I am making a misjudgement, but what is this?

2. How can I simply achieve what I tried to achieve?


Thx
User avatar
onion2k
Jedi Mod
Posts: 5263
Joined: Tue Dec 21, 2004 5:03 pm
Location: usrlab.com

Post by onion2k »

You've completely missed the point of RSS. It's for syndicating data (articles usually) between websites. It has absolutely nothing to do with pages being updated, or page formatting. It's designed to allow people to display links to the latest articles from a website on their own site, or in a reader.

http://en.wikipedia.org/wiki/RSS_%28file_format%29
Post Reply