EXTRACTING INFORMATION

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

Post Reply
Nashtrump
Forum Newbie
Posts: 13
Joined: Wed May 03, 2006 1:56 pm

EXTRACTING INFORMATION

Post by Nashtrump »

Hi there,

im currently building a program which scrapes/extracts odds from different online bookmakers and stores the information to a mysql database.

Does anyone know of a fast way of doing this?
Is anyone an expert on web scraping or web mining?

Kind regards

nash
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

regular expressions are often used to scrape information.
Nashtrump
Forum Newbie
Posts: 13
Joined: Wed May 03, 2006 1:56 pm

Post by Nashtrump »

Hi there,

yes i know some informatio about regular expressions.

I need a quick way to extract information from sites.
I have around 500 urls to scrape.

Do you have any idea how to do this quickly?
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
Buddha443556
Forum Regular
Posts: 873
Joined: Fri Mar 19, 2004 1:51 pm

Post by Buddha443556 »

Nashtrump wrote:Hi there,

yes i know some informatio about regular expressions.

I need a quick way to extract information from sites.
I have around 500 urls to scrape.

Do you have any idea how to do this quickly?
With that many urls you probably need a way to grab the data from multiple sites at once, one site at a time is just too slow.
Nashtrump
Forum Newbie
Posts: 13
Joined: Wed May 03, 2006 1:56 pm

Post by Nashtrump »

Hi Budda..


Yes!! thats exactly the sort of thing i need!!

Any idea how to extract multiple sites?? Ive currently beenn extracting one site at a time using the file_get_contents($link) method.

I would like to extract multiple sites. Please can you let me know how you would do it. (OR IF ANYONE ELSE HAS BETTER IDEAS!!)

Thanks

Nash
User avatar
Buddha443556
Forum Regular
Posts: 873
Joined: Fri Mar 19, 2004 1:51 pm

Post by Buddha443556 »

Nashtrump wrote:I would like to extract multiple sites. Please can you let me know how you would do it.
I prefer Perl for this type of problem, PHP would be at the bottom of my list just before Assembly.

Process Control Functions - *nix required
Program Execution Functions - exec()

I normally separate the data gathering from the text processing. The data is stored in files and the text processing is done on the stored data. If you preserve the stored data then you can process it as many times as you need to without hammering the sites.

The multi-threaded data gathering can be separated into parts: supervisor and worker. The supervisor gets a list of urls, creates workers to gather the data, and keeps track of how many workers are active at one time. The workers get the data from the url, store it in a file and die (hopefully). Normally I use between 4 to 6 workers but you maybe able to use more depending on your OS (Server Edition of Windows and *nix have better TCP stacks) and resources.

You really need to take care not to hammer a server with this type of program. Your creating a robot so make sure you obey the robot.txt files.
Nashtrump
Forum Newbie
Posts: 13
Joined: Wed May 03, 2006 1:56 pm

Post by Nashtrump »

Ah so im completely using the wrong program!! Ah thats why its taking so bl00dy long then!!

Thanks for your advice!!
Nashtrump
Forum Newbie
Posts: 13
Joined: Wed May 03, 2006 1:56 pm

Post by Nashtrump »

Dont suppose you can recommend any tutorial sites to get me started on this?

Thanks

Nash
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

Certainly with a uniprocessor system you're better of to do asynchronous io and select over the sockets... Will perform much better and keeps you away from the deadlock problems that are inevitably attached to multithreading..

anyway, i'f i'm not mistaken curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). And these days php streams can do it too...
Nashtrump
Forum Newbie
Posts: 13
Joined: Wed May 03, 2006 1:56 pm

Post by Nashtrump »

Hi Timvw...

Dont laugh but i have absolutly no idea what you just said!!

uniprocessor system ??

asynchronous io and select over the sockets?

curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). ??

php streams??

Dont suppose you can explain these terms and if you have any tutorials for these?

Sorry to sound dumb but ive only been learning php for about a month!

Regards,

Nash
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

Nashtrump wrote: I have absolutly no idea what you just said!!
Happy reading ;)

uniprocessor
async io
curl_multi_exec
php streams
User avatar
Buddha443556
Forum Regular
Posts: 873
Joined: Fri Mar 19, 2004 1:51 pm

Post by Buddha443556 »

Nashtrump wrote:Dont suppose you can recommend any tutorial sites to get me started on this?
http://www.devx.com/webdev/Article/21909/1954
http://perldoc.perl.org/perlthrtut.html
Certainly with a uniprocessor system you're better of to do asynchronous io and select over the sockets... Will perform much better and keeps you away from the deadlock problems that are inevitably attached to multithreading..

anyway, i'f i'm not mistaken curl has (or should get) a feature to perform multiple requests at the same time (multi_client??). And these days php streams can do it too...
I think your right multi-threaded programming should be avoided if possible. If, for no other reason, to avoid the steep learning curve which I tend to forget about. :oops:
timvw wrote:
Nashtrump wrote: I have absolutly no idea what you just said!!
Happy reading ;)

uniprocessor
async io
curl_multi_exec
php streams
Don't forget ...
Deadlock
Nashtrump
Forum Newbie
Posts: 13
Joined: Wed May 03, 2006 1:56 pm

Post by Nashtrump »

Hi Guys,

To be perfectly honest i would prefer to stick to Php as i think jumping into Perl now would confuse the hell outta me.

Will using PHP streams give me as fast a result as Perl?

If so can you tell me what Php streams are? Ive searched on Google and ive not recieved a decent answer?

Regarsd

Nash
timvw
DevNet Master
Posts: 4897
Joined: Mon Jan 19, 2004 11:11 pm
Location: Leuven, Belgium

Post by timvw »

If you had read the slides on the last link i posted you could have answered the question about php streams yourself..

Btw, i don't think the performance differences between a php and a perl client would be significant.. But the implementation that handles many requests concurrently instead of doing them sequentially will be the most significant difference.
Post Reply