Legal scraping, not sure where to start!
Moderator: General Moderators
Legal scraping, not sure where to start!
Hey,
I was just given permission to pull rankings infromation from this page
http://gunz.ijji.com/ranking/individual.nhn
I did some reading on regex but I'm not sure where to start!
Basically, I need to pull the individual ranking information and save it to a database and update once per day so I dont query their server too much.
Where do I start?
I was just given permission to pull rankings infromation from this page
http://gunz.ijji.com/ranking/individual.nhn
I did some reading on regex but I'm not sure where to start!
Basically, I need to pull the individual ranking information and save it to a database and update once per day so I dont query their server too much.
Where do I start?
-
kingconnections
- Forum Contributor
- Posts: 137
- Joined: Thu Jul 14, 2005 4:28 pm
so you would start with something like this:
This code looks for the title of the page. You would have to adjust the expression to what you need.
The $content variable holds the entire context of the page, The $matches array holds anything that matched the title search.
edited cause i didn't explain that very well.
The matches array will contain multiple elements: ie
$matches[0] = will contain the text that matched the full pattern
$matches[1] = will have the text that matched the first captured parenthesized subpattern
http://us2.php.net/manual/en/function.preg-match.php
Code: Select all
$content = file_get_contents('http://gunz.ijji.com/ranking/individual.nhn');
preg_match('#<title>(.*?)</title>#s', $content, $matches);This code looks for the title of the page. You would have to adjust the expression to what you need.
The $content variable holds the entire context of the page, The $matches array holds anything that matched the title search.
edited cause i didn't explain that very well.
The matches array will contain multiple elements: ie
$matches[0] = will contain the text that matched the full pattern
$matches[1] = will have the text that matched the first captured parenthesized subpattern
http://us2.php.net/manual/en/function.preg-match.php
- feyd
- Neighborhood Spidermoddy
- Posts: 31559
- Joined: Mon Mar 29, 2004 3:24 pm
- Location: Bothell, Washington, USA
Nope.
Zero will be the entire piece matched, one will be the contents of the container.
Standards dictate that only one <title> tag may occur within the document (specifically as a child of the <head> container) if memory serves. If you insist on finding all possible tags within a given string use preg_match_all() which will result in a slightly different result array.
Zero will be the entire piece matched, one will be the contents of the container.
Standards dictate that only one <title> tag may occur within the document (specifically as a child of the <head> container) if memory serves. If you insist on finding all possible tags within a given string use preg_match_all() which will result in a slightly different result array.
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
- Ollie Saunders
- DevNet Master
- Posts: 3179
- Joined: Tue May 24, 2005 6:01 pm
- Location: UK
This is a sample of the html result I get back from the page that I'm trying to regex information from:
My original code was:
Then I realized that the variables didn't always match because my results were sometimes coming from different places and out of order... so then I tried...
Which had 0 results. What am I doing wrong?
Code: Select all
<td class="list_data_col02_c"><span class="guild_name">MØNST€®</span></td>
<td class="list_data_col03_c"><span>20</span></td>
<td class="list_data_col04_c"><span>497,903</span></td>
<td class="list_data_col05_c"><span>869/1020
(46%)
</span></td>Code: Select all
$content = file_get_contents('http://gunz.ijji.com/ranking/individual.nhn');
preg_match_all('#<td class="list_data_col02_c"><span class="guild_name">(.*?)</span></td>#s', $content, $matches);
preg_match_all('#<td class="list_data_col03_c"><span>(.*?)</span></td>#s', $content, $level);
preg_match_all('#<td class="list_data_col04_c"><span>(.*?)</span></td>#s', $content, $exp);
preg_match_all('#<td class="list_data_col05_c"><span>(.*?)/#s', $content, $kills);
preg_match_all('#/(.*?)</span></td>#s', $content, $deaths);
preg_match_all('#\((.*?)%\)
</span></td>#s', $content, $pc);
for($i=0;$i<20;$i++){
$match = $matches[1][$i];
$level1 = $level[1][$i];
$exp1 = $exp[1][$i];
$kills1 = $kills[1][$i];
$deaths1 = $deaths[1][$i];
$pc1 = $pc[1][$i];
echo"<P>$i $match<br/>L: $level1<br />E: $exp1<br />K: $kills1<br />D: $deaths1<br />PC: $pc1";
//save to db
}Code: Select all
preg_match_all('#<td class="list_data_col02_c"><span class="guild_name">(.*?)</span></td>
<td class="list_data_col03_c"><span>(.*?)</span></td>
<td class="list_data_col04_c"><span>(.*?)</span></td>
<td class="list_data_col05_c"><span>(.*?)/(.*?)
\((.*?)%\)
</span></td>#s', $content, $matches);
for($i=0;$i<20;$i++){
$match = $matches[1][$i];
$level1 = $matches[2][$i];
$exp1 = $matches[3][$i];
$kills1 = $matches[4][$i];
$deaths1 = $matches[5][$i];
$pc1 = $matches[6][$i];
echo"<P>$i $match<br/>L: $level1<br />E: $exp1<br />K: $kills1<br />D: $deaths1<br />PC: $pc1";- RobertGonzalez
- Site Administrator
- Posts: 14293
- Joined: Tue Sep 09, 2003 6:04 pm
- Location: Fremont, CA, USA