Page 1 of 2

Legal scraping, not sure where to start!

Posted: Thu Dec 28, 2006 8:13 am
by Citizen
Hey,

I was just given permission to pull rankings infromation from this page

http://gunz.ijji.com/ranking/individual.nhn

I did some reading on regex but I'm not sure where to start!

Basically, I need to pull the individual ranking information and save it to a database and update once per day so I dont query their server too much.

Where do I start?

Posted: Thu Dec 28, 2006 8:29 am
by kingconnections
so you would start with something like this:

Code: Select all

$content = file_get_contents('http://gunz.ijji.com/ranking/individual.nhn'); 
preg_match('#<title>(.*?)</title>#s', $content, $matches);

This code looks for the title of the page. You would have to adjust the expression to what you need.

The $content variable holds the entire context of the page, The $matches array holds anything that matched the title search.

edited cause i didn't explain that very well.

The matches array will contain multiple elements: ie
$matches[0] = will contain the text that matched the full pattern
$matches[1] = will have the text that matched the first captured parenthesized subpattern

http://us2.php.net/manual/en/function.preg-match.php

Posted: Thu Dec 28, 2006 9:36 am
by Citizen
Thanks! I'll start with that and begin testing and read more!

-Cit

Edit:

Question, does $matches[0]...[1] contain the numbered results for each multiple result?

So if I had a page like

<title>Hello 1</title>
<title>Hello 2</title>

the [0] result would be the 1
and [1] would the be 2?

Posted: Thu Dec 28, 2006 10:57 am
by feyd
Nope.

Zero will be the entire piece matched, one will be the contents of the container.

Standards dictate that only one <title> tag may occur within the document (specifically as a child of the <head> container) if memory serves. If you insist on finding all possible tags within a given string use preg_match_all() which will result in a slightly different result array.

Posted: Thu Dec 28, 2006 11:30 am
by Citizen
Thanks guys, I did some reading on preg_match_all and got it working.

The only problem I've run into is how to find a number between parenthesis.

(56%)

->

((.*?)%)

How can I escape that first (?

Posted: Thu Dec 28, 2006 11:42 am
by feyd
escapes are done with a backslash immediately followed by the metacharacter. For example, \(. It's best to escape each of the metacharacters, whether currently required or not, just for best practice and possible future proofing.

Posted: Thu Dec 28, 2006 11:43 am
by Ollie Saunders
This thread should probably be filed under regex.

\(\d+%\)
will match (56%)

Posted: Thu Dec 28, 2006 11:52 am
by Citizen
ole wrote:This thread should probably be filed under regex.

\(\d+%\)
will match (56%)
I meant to get a result of "56" without the parenthesis.

Posted: Thu Dec 28, 2006 12:03 pm
by Ollie Saunders
oh right.

\((\d+)%\)

edit: I might have got the wrong end of the stick again in which case you're after this:
(\d+)%

Posted: Thu Dec 28, 2006 12:14 pm
by Citizen
Thanks!

Another question:

How do I skip a section of code?

for instance,

num/den

If num and den are both dynamic numbers, and I want to find only den, how do I skip over num?

Posted: Thu Dec 28, 2006 12:28 pm
by feyd
Try some stuff. Post what you try and your results.

Posted: Thu Dec 28, 2006 9:42 pm
by Citizen
This is a sample of the html result I get back from the page that I'm trying to regex information from:

Code: Select all

<td class="list_data_col02_c"><span class="guild_name">MØNST€®</span></td>

								<td class="list_data_col03_c"><span>20</span></td>
								<td class="list_data_col04_c"><span>497,903</span></td>
								<td class="list_data_col05_c"><span>869/1020 
								
								(46%)
								
								</span></td>
My original code was:

Code: Select all

$content = file_get_contents('http://gunz.ijji.com/ranking/individual.nhn');

preg_match_all('#<td class="list_data_col02_c"><span class="guild_name">(.*?)</span></td>#s', $content, $matches); 
preg_match_all('#<td class="list_data_col03_c"><span>(.*?)</span></td>#s', $content, $level); 
preg_match_all('#<td class="list_data_col04_c"><span>(.*?)</span></td>#s', $content, $exp); 
preg_match_all('#<td class="list_data_col05_c"><span>(.*?)/#s', $content, $kills); 
preg_match_all('#/(.*?)</span></td>#s', $content, $deaths); 
preg_match_all('#\((.*?)%\)
								
								</span></td>#s', $content, $pc); 
for($i=0;$i<20;$i++){

	$match = $matches[1][$i];
	$level1 = $level[1][$i];
	$exp1 = $exp[1][$i];
	$kills1 = $kills[1][$i];
	$deaths1 = $deaths[1][$i];
	$pc1 = $pc[1][$i];

	echo"<P>$i $match<br/>L: $level1<br />E: $exp1<br />K: $kills1<br />D: $deaths1<br />PC: $pc1";

        //save to db
}
Then I realized that the variables didn't always match because my results were sometimes coming from different places and out of order... so then I tried...

Code: Select all

preg_match_all('#<td class="list_data_col02_c"><span class="guild_name">(.*?)</span></td>

								<td class="list_data_col03_c"><span>(.*?)</span></td>
								<td class="list_data_col04_c"><span>(.*?)</span></td>
								<td class="list_data_col05_c"><span>(.*?)/(.*?) 
								
								\((.*?)%\)
								
								</span></td>#s', $content, $matches); 

for($i=0;$i<20;$i++){

	$match = $matches[1][$i];
	$level1 = $matches[2][$i];
	$exp1 = $matches[3][$i];
	$kills1 = $matches[4][$i];
	$deaths1 = $matches[5][$i];
	$pc1 = $matches[6][$i];

	echo"<P>$i $match<br/>L: $level1<br />E: $exp1<br />K: $kills1<br />D: $deaths1<br />PC: $pc1";
Which had 0 results. What am I doing wrong?

Posted: Fri Dec 29, 2006 2:43 pm
by Citizen
Anyone post in the Regex section? :)

Posted: Fri Dec 29, 2006 3:48 pm
by wildwobby
Im a complete n00b with RegEx, but could it happen to do with the quotes in the html in the preg_match_all() function? Try escaping them

Posted: Fri Dec 29, 2006 4:16 pm
by RobertGonzalez
A lot of us post here. :wink: