EXTRACTING INFORMATION
Moderator: General Moderators
EXTRACTING INFORMATION
Hi there,
im currently building a program which scrapes/extracts odds from different online bookmakers and stores the information to a mysql database.
Does anyone know of a fast way of doing this?
Is anyone an expert on web scraping or web mining?
Kind regards
nash
im currently building a program which scrapes/extracts odds from different online bookmakers and stores the information to a mysql database.
Does anyone know of a fast way of doing this?
Is anyone an expert on web scraping or web mining?
Kind regards
nash
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
- Buddha443556
- Forum Regular
- Posts: 873
- Joined: Fri Mar 19, 2004 1:51 pm
With that many urls you probably need a way to grab the data from multiple sites at once, one site at a time is just too slow.Nashtrump wrote:Hi there,
yes i know some informatio about regular expressions.
I need a quick way to extract information from sites.
I have around 500 urls to scrape.
Do you have any idea how to do this quickly?
Hi Budda..
Yes!! thats exactly the sort of thing i need!!
Any idea how to extract multiple sites?? Ive currently beenn extracting one site at a time using the file_get_contents($link) method.
I would like to extract multiple sites. Please can you let me know how you would do it. (OR IF ANYONE ELSE HAS BETTER IDEAS!!)
Thanks
Nash
Yes!! thats exactly the sort of thing i need!!
Any idea how to extract multiple sites?? Ive currently beenn extracting one site at a time using the file_get_contents($link) method.
I would like to extract multiple sites. Please can you let me know how you would do it. (OR IF ANYONE ELSE HAS BETTER IDEAS!!)
Thanks
Nash
- Buddha443556
- Forum Regular
- Posts: 873
- Joined: Fri Mar 19, 2004 1:51 pm
I prefer Perl for this type of problem, PHP would be at the bottom of my list just before Assembly.Nashtrump wrote:I would like to extract multiple sites. Please can you let me know how you would do it.
Process Control Functions - *nix required
Program Execution Functions - exec()
I normally separate the data gathering from the text processing. The data is stored in files and the text processing is done on the stored data. If you preserve the stored data then you can process it as many times as you need to without hammering the sites.
The multi-threaded data gathering can be separated into parts: supervisor and worker. The supervisor gets a list of urls, creates workers to gather the data, and keeps track of how many workers are active at one time. The workers get the data from the url, store it in a file and die (hopefully). Normally I use between 4 to 6 workers but you maybe able to use more depending on your OS (Server Edition of Windows and *nix have better TCP stacks) and resources.
You really need to take care not to hammer a server with this type of program. Your creating a robot so make sure you obey the robot.txt files.
Certainly with a uniprocessor system you're better of to do asynchronous io and select over the sockets... Will perform much better and keeps you away from the deadlock problems that are inevitably attached to multithreading..
anyway, i'f i'm not mistaken curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). And these days php streams can do it too...
anyway, i'f i'm not mistaken curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). And these days php streams can do it too...
Hi Timvw...
Dont laugh but i have absolutly no idea what you just said!!
uniprocessor system ??
asynchronous io and select over the sockets?
curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). ??
php streams??
Dont suppose you can explain these terms and if you have any tutorials for these?
Sorry to sound dumb but ive only been learning php for about a month!
Regards,
Nash
Dont laugh but i have absolutly no idea what you just said!!
uniprocessor system ??
asynchronous io and select over the sockets?
curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). ??
php streams??
Dont suppose you can explain these terms and if you have any tutorials for these?
Sorry to sound dumb but ive only been learning php for about a month!
Regards,
Nash
Happy readingNashtrump wrote: I have absolutly no idea what you just said!!
uniprocessor
async io
curl_multi_exec
php streams
- Buddha443556
- Forum Regular
- Posts: 873
- Joined: Fri Mar 19, 2004 1:51 pm
http://www.devx.com/webdev/Article/21909/1954Nashtrump wrote:Dont suppose you can recommend any tutorial sites to get me started on this?
http://perldoc.perl.org/perlthrtut.html
I think your right multi-threaded programming should be avoided if possible. If, for no other reason, to avoid the steep learning curve which I tend to forget about.Certainly with a uniprocessor system you're better of to do asynchronous io and select over the sockets... Will perform much better and keeps you away from the deadlock problems that are inevitably attached to multithreading..
anyway, i'f i'm not mistaken curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). And these days php streams can do it too...
Don't forget ...timvw wrote:Happy readingNashtrump wrote: I have absolutly no idea what you just said!!
uniprocessor
async io
curl_multi_exec
php streams
Deadlock
Hi Guys,
To be perfectly honest i would prefer to stick to Php as i think jumping into Perl now would confuse the hell outta me.
Will using PHP streams give me as fast a result as Perl?
If so can you tell me what Php streams are? Ive searched on Google and ive not recieved a decent answer?
Regarsd
Nash
To be perfectly honest i would prefer to stick to Php as i think jumping into Perl now would confuse the hell outta me.
Will using PHP streams give me as fast a result as Perl?
If so can you tell me what Php streams are? Ive searched on Google and ive not recieved a decent answer?
Regarsd
Nash
If you had read the slides on the last link i posted you could have answered the question about php streams yourself..
Btw, i don't think the performance differences between a php and a perl client would be significant.. But the implementation that handles many requests concurrently instead of doing them sequentially will be the most significant difference.
Btw, i don't think the performance differences between a php and a perl client would be significant.. But the implementation that handles many requests concurrently instead of doing them sequentially will be the most significant difference.