Page 1 of 1

CRON JOBS: Syncronize thousand feeds

Posted: Wed Nov 30, 2005 5:18 am
by mdiazrub
Hi All,

I´m trying to implement a blog reader here in Spain (like bloglines.com). My problem is the process I use to update news from the diferent blog feeds (actually 1400). This process is a Cron Job at server.

My problem is that the process takes more than twenty minutes to finish, and it grows as blog number grows.

The script does something like this:

- Go to BLOGS table and for each element, get the feed xml from the original site.
- I parse each XML to get the news inside.
- For each news, i make a "SELECT" to MySQL to confirm if is a new or old post.
- If is new post, i make the "INSERT" to mysql.

My question is if, somebody can explain me how to do this faster or more efficiently.

¿Write the INSERTs into a text file, and throw all at the end of the script?
¿Use threads? ¿How?
¿Maybe is better PERL or another technology to implement this process?

Thank you very much.

Posted: Wed Nov 30, 2005 6:03 am
by AGISB
I first would write this in C or in Perl to speed things up at processing.

The process is highly dependant on internet speed and on the response times of the blog servers.

You could split the bloglist into several cgi processes as I suspect that the server load is not what causes the delay.


Memory management might also be a solution as using mysql statements over and over in the same script uses up all memory in the end. You can free the memory after each blog to try this.

Posted: Wed Nov 30, 2005 6:06 am
by mdiazrub
I think the problem is that i retrieve 1400 xml files from the internet one by one, this cost the 20 minutes, cause the queries are very simple.

How can separate blogs in diferent procesess?

Posted: Wed Nov 30, 2005 6:17 am
by AGISB
mdiazrub wrote: How can separate blogs in diferent procesess?
Simply split the blog list into several lists. for each list call your srcipt again

Posted: Wed Nov 30, 2005 6:21 am
by mdiazrub
Maybe i can make a QUERY to database and get all blogs, then put them in an array.

Then split the array in N portions.

But, Can I throw diferent threads of the sync script from the php i execute as cron job? How?

Thanks a lot

Posted: Wed Nov 30, 2005 6:44 am
by AGISB
Simply make different versions of it and call it each

Posted: Wed Nov 30, 2005 6:48 am
by mdiazrub
I dont understand you.

If I create for example 4 diferent cron jobs (one for each segment) I can´t know which portion of the blog table runs each one.

Explain it a little more plzz

Thks

Posted: Wed Nov 30, 2005 7:28 am
by AGISB
sure you can.

Lets say you got a list of blog urls either as text file or in a database

so 1. instance runs those from 1 to 200, 2. 201 to 400 etc.

You could even write to the database which block each instance already runs and take the next one etc.