Page 1 of 1

EXTRACTING INFORMATION

Posted: Sun May 07, 2006 11:21 am
by Nashtrump
Hi there,

im currently building a program which scrapes/extracts odds from different online bookmakers and stores the information to a mysql database.

Does anyone know of a fast way of doing this?
Is anyone an expert on web scraping or web mining?

Kind regards

nash

Posted: Sun May 07, 2006 11:23 am
by feyd
regular expressions are often used to scrape information.

Posted: Sun May 07, 2006 11:28 am
by Nashtrump
Hi there,

yes i know some informatio about regular expressions.

I need a quick way to extract information from sites.
I have around 500 urls to scrape.

Do you have any idea how to do this quickly?

Posted: Sun May 07, 2006 12:50 pm
by s.dot

Posted: Sun May 07, 2006 2:04 pm
by Buddha443556
Nashtrump wrote:Hi there,

yes i know some informatio about regular expressions.

I need a quick way to extract information from sites.
I have around 500 urls to scrape.

Do you have any idea how to do this quickly?
With that many urls you probably need a way to grab the data from multiple sites at once, one site at a time is just too slow.

Posted: Mon May 08, 2006 1:38 am
by Nashtrump
Hi Budda..


Yes!! thats exactly the sort of thing i need!!

Any idea how to extract multiple sites?? Ive currently beenn extracting one site at a time using the file_get_contents($link) method.

I would like to extract multiple sites. Please can you let me know how you would do it. (OR IF ANYONE ELSE HAS BETTER IDEAS!!)

Thanks

Nash

Posted: Mon May 08, 2006 10:16 am
by Buddha443556
Nashtrump wrote:I would like to extract multiple sites. Please can you let me know how you would do it.
I prefer Perl for this type of problem, PHP would be at the bottom of my list just before Assembly.

Process Control Functions - *nix required
Program Execution Functions - exec()

I normally separate the data gathering from the text processing. The data is stored in files and the text processing is done on the stored data. If you preserve the stored data then you can process it as many times as you need to without hammering the sites.

The multi-threaded data gathering can be separated into parts: supervisor and worker. The supervisor gets a list of urls, creates workers to gather the data, and keeps track of how many workers are active at one time. The workers get the data from the url, store it in a file and die (hopefully). Normally I use between 4 to 6 workers but you maybe able to use more depending on your OS (Server Edition of Windows and *nix have better TCP stacks) and resources.

You really need to take care not to hammer a server with this type of program. Your creating a robot so make sure you obey the robot.txt files.

Posted: Mon May 08, 2006 12:47 pm
by Nashtrump
Ah so im completely using the wrong program!! Ah thats why its taking so bl00dy long then!!

Thanks for your advice!!

Posted: Mon May 08, 2006 12:49 pm
by Nashtrump
Dont suppose you can recommend any tutorial sites to get me started on this?

Thanks

Nash

Posted: Mon May 08, 2006 2:35 pm
by timvw
Certainly with a uniprocessor system you're better of to do asynchronous io and select over the sockets... Will perform much better and keeps you away from the deadlock problems that are inevitably attached to multithreading..

anyway, i'f i'm not mistaken curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). And these days php streams can do it too...

Posted: Mon May 08, 2006 2:48 pm
by Nashtrump
Hi Timvw...

Dont laugh but i have absolutly no idea what you just said!!

uniprocessor system ??

asynchronous io and select over the sockets?

curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). ??

php streams??

Dont suppose you can explain these terms and if you have any tutorials for these?

Sorry to sound dumb but ive only been learning php for about a month!

Regards,

Nash

Posted: Mon May 08, 2006 2:54 pm
by timvw
Nashtrump wrote: I have absolutly no idea what you just said!!
Happy reading ;)

uniprocessor
async io
curl_multi_exec
php streams

Posted: Mon May 08, 2006 3:27 pm
by Buddha443556
Nashtrump wrote:Dont suppose you can recommend any tutorial sites to get me started on this?
http://www.devx.com/webdev/Article/21909/1954
http://perldoc.perl.org/perlthrtut.html
Certainly with a uniprocessor system you're better of to do asynchronous io and select over the sockets... Will perform much better and keeps you away from the deadlock problems that are inevitably attached to multithreading..

anyway, i'f i'm not mistaken curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). And these days php streams can do it too...
I think your right multi-threaded programming should be avoided if possible. If, for no other reason, to avoid the steep learning curve which I tend to forget about. :oops:
timvw wrote:
Nashtrump wrote: I have absolutly no idea what you just said!!
Happy reading ;)

uniprocessor
async io
curl_multi_exec
php streams
Don't forget ...
Deadlock

Posted: Tue May 09, 2006 1:44 am
by Nashtrump
Hi Guys,

To be perfectly honest i would prefer to stick to Php as i think jumping into Perl now would confuse the hell outta me.

Will using PHP streams give me as fast a result as Perl?

If so can you tell me what Php streams are? Ive searched on Google and ive not recieved a decent answer?

Regarsd

Nash

Posted: Tue May 09, 2006 1:51 am
by timvw
If you had read the slides on the last link i posted you could have answered the question about php streams yourself..

Btw, i don't think the performance differences between a php and a perl client would be significant.. But the implementation that handles many requests concurrently instead of doing them sequentially will be the most significant difference.