Page 1 of 1

DATA MINING - BOY THIS IS DIFFICULT

Posted: Tue Mar 11, 2003 10:00 am
by romeo
I need some help with datamining.. there are a series of pages (hundreds) that contain a table with data i need to strip out...


I think (this is where opinions come in)
so i need to fopen the page dynamicly (have it rea da text file of all the page URLs) ereg the <table cellspacing 2... /table> (i think they all have that in common) .. ereg each of its rows and insert into a database...

DOES ANYONE HAVE EXPERIENCE DOING THIS? Any scripts out there that can help...

I think I understand the logic but I definatly don't understand how :)

MANY THANKS AND A BEER TO WHOEVER HELPS :)

Posted: Tue Mar 11, 2003 11:26 am
by Stoker
to capture data from a bunch of webpages I would definetely use perl instead of PHP, and in most cases if it is a one time thing, using regular expressions are the easiest solution, although if there are a lot of oddities in the data it may require quite a complex expression..

Posted: Tue Mar 11, 2003 8:12 pm
by Sky
I've done this before. Not in exactly your circumstances though... If you could give a examle of what you want to extract and what you want to do it? :?: (BTW, i'm erratic in visiting here, so this may benifit others more than me :p)

Posted: Wed Mar 12, 2003 6:45 am
by will
i've done several scripts like this...
one copied the entired KJV bible from bible.gospelcom.net
one grabbed the cover art of every game on amazon.com
one got information about indie bands from a site


while they all use the same basic logic which you've already described, i wrote each one pretty much from scratch b/c this type of thing is so specific to the type of data you're retrieving and the site you're getting it from. I'm not really sure what to suggest, since you already seem to have the basic idea of how to do it. (I'll still look and see if i can find my old code, although most of it is quite messy as it was edited throughout the project as needed)


as a side note to everyone else... what do you think of the ethical / legal issues of doing this? of course, there are very legitimate uses of this type of technique, but not all are. one can argue that "if you put it on the web, it's free for the taking", while another could bring up issues of intellectual property. just curious what you all thought, esp those that have been in IT for a while.

Posted: Wed Mar 12, 2003 8:23 am
by Stoker
Any document, html page, book, picture, or whatever is copyright and restricted by law unless otherwise stated, the copyright is held until 100 years after the authors death if I remember correctly..

In my mind, any web content is free to personal download/use etc, but you cannot post it without their permission. I don't think it is unethical to quote a phrase from some page and refer to it, but to make a page saying "Here is all the games you can get from amazon" inclduing images and descriptions would be to cross the line without Amazon's permission..

Posted: Wed Mar 12, 2003 11:47 am
by judge.DK
i need help in for exactly the same purpose.

i need to get data from a page like http://www.battle.net/war3/ladder/war3- ... =Lordaeron

and pull stuff like player name, solo game stats (win, loss, exp), last game, etc.

any ideas on how i can do this? i've seen a similar thing in action on http://www.replayers.com/?action=ladder-tracker and was wondering how he did that.

Posted: Wed Mar 19, 2003 6:04 pm
by romeo
any news will?

Posted: Wed Mar 19, 2003 6:15 pm
by Stoker
The latest Linux Magazine has an article on how to use a perl-module for such, it was focused more on how to make it fill out forms and get content from that, but in general it would be pretty usable I think.. It will probably be on their website in a month or two

Posted: Wed Mar 19, 2003 6:24 pm
by romeo
sweet
thanks for the news

Posted: Fri Mar 21, 2003 7:08 am
by judge.DK
i've gotten a script working which pulls data from multiple pages at the same time. the number of pages queried depends on the number of members currently subscribed to my board in a certain user group.

http://www.dk-clan.net/tracker.php

i've got problems with speed though, since it apparently checks each page 1 at a time (each queried page is ~70k) til it's done looping a number of times equal to what was set above.

takes about ~4secs/queried page before it sends the output to the browser. so at around 15 queries, it takes close to a minute.

any suggestions on running the fopen() and the rest of the loop in parallel simultaneously?