Legal scraping, not sure where to start!

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Citizen
Forum Contributor
Posts: 300
Joined: Wed Jul 20, 2005 10:23 am

Legal scraping, not sure where to start!

Post by Citizen »

Hey,

I was just given permission to pull rankings infromation from this page

http://gunz.ijji.com/ranking/individual.nhn

I did some reading on regex but I'm not sure where to start!

Basically, I need to pull the individual ranking information and save it to a database and update once per day so I dont query their server too much.

Where do I start?
kingconnections
Forum Contributor
Posts: 137
Joined: Thu Jul 14, 2005 4:28 pm

Post by kingconnections »

so you would start with something like this:

Code: Select all

$content = file_get_contents('http://gunz.ijji.com/ranking/individual.nhn'); 
preg_match('#<title>(.*?)</title>#s', $content, $matches);

This code looks for the title of the page. You would have to adjust the expression to what you need.

The $content variable holds the entire context of the page, The $matches array holds anything that matched the title search.

edited cause i didn't explain that very well.

The matches array will contain multiple elements: ie
$matches[0] = will contain the text that matched the full pattern
$matches[1] = will have the text that matched the first captured parenthesized subpattern

http://us2.php.net/manual/en/function.preg-match.php
Citizen
Forum Contributor
Posts: 300
Joined: Wed Jul 20, 2005 10:23 am

Post by Citizen »

Thanks! I'll start with that and begin testing and read more!

-Cit

Edit:

Question, does $matches[0]...[1] contain the numbered results for each multiple result?

So if I had a page like

<title>Hello 1</title>
<title>Hello 2</title>

the [0] result would be the 1
and [1] would the be 2?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Nope.

Zero will be the entire piece matched, one will be the contents of the container.

Standards dictate that only one <title> tag may occur within the document (specifically as a child of the <head> container) if memory serves. If you insist on finding all possible tags within a given string use preg_match_all() which will result in a slightly different result array.
Citizen
Forum Contributor
Posts: 300
Joined: Wed Jul 20, 2005 10:23 am

Post by Citizen »

Thanks guys, I did some reading on preg_match_all and got it working.

The only problem I've run into is how to find a number between parenthesis.

(56%)

->

((.*?)%)

How can I escape that first (?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

escapes are done with a backslash immediately followed by the metacharacter. For example, \(. It's best to escape each of the metacharacters, whether currently required or not, just for best practice and possible future proofing.
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

This thread should probably be filed under regex.

\(\d+%\)
will match (56%)
Citizen
Forum Contributor
Posts: 300
Joined: Wed Jul 20, 2005 10:23 am

Post by Citizen »

ole wrote:This thread should probably be filed under regex.

\(\d+%\)
will match (56%)
I meant to get a result of "56" without the parenthesis.
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

oh right.

\((\d+)%\)

edit: I might have got the wrong end of the stick again in which case you're after this:
(\d+)%
Citizen
Forum Contributor
Posts: 300
Joined: Wed Jul 20, 2005 10:23 am

Post by Citizen »

Thanks!

Another question:

How do I skip a section of code?

for instance,

num/den

If num and den are both dynamic numbers, and I want to find only den, how do I skip over num?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Try some stuff. Post what you try and your results.
Citizen
Forum Contributor
Posts: 300
Joined: Wed Jul 20, 2005 10:23 am

Post by Citizen »

This is a sample of the html result I get back from the page that I'm trying to regex information from:

Code: Select all

<td class="list_data_col02_c"><span class="guild_name">MØNST€®</span></td>

								<td class="list_data_col03_c"><span>20</span></td>
								<td class="list_data_col04_c"><span>497,903</span></td>
								<td class="list_data_col05_c"><span>869/1020 
								
								(46%)
								
								</span></td>
My original code was:

Code: Select all

$content = file_get_contents('http://gunz.ijji.com/ranking/individual.nhn');

preg_match_all('#<td class="list_data_col02_c"><span class="guild_name">(.*?)</span></td>#s', $content, $matches); 
preg_match_all('#<td class="list_data_col03_c"><span>(.*?)</span></td>#s', $content, $level); 
preg_match_all('#<td class="list_data_col04_c"><span>(.*?)</span></td>#s', $content, $exp); 
preg_match_all('#<td class="list_data_col05_c"><span>(.*?)/#s', $content, $kills); 
preg_match_all('#/(.*?)</span></td>#s', $content, $deaths); 
preg_match_all('#\((.*?)%\)
								
								</span></td>#s', $content, $pc); 
for($i=0;$i<20;$i++){

	$match = $matches[1][$i];
	$level1 = $level[1][$i];
	$exp1 = $exp[1][$i];
	$kills1 = $kills[1][$i];
	$deaths1 = $deaths[1][$i];
	$pc1 = $pc[1][$i];

	echo"<P>$i $match<br/>L: $level1<br />E: $exp1<br />K: $kills1<br />D: $deaths1<br />PC: $pc1";

        //save to db
}
Then I realized that the variables didn't always match because my results were sometimes coming from different places and out of order... so then I tried...

Code: Select all

preg_match_all('#<td class="list_data_col02_c"><span class="guild_name">(.*?)</span></td>

								<td class="list_data_col03_c"><span>(.*?)</span></td>
								<td class="list_data_col04_c"><span>(.*?)</span></td>
								<td class="list_data_col05_c"><span>(.*?)/(.*?) 
								
								\((.*?)%\)
								
								</span></td>#s', $content, $matches); 

for($i=0;$i<20;$i++){

	$match = $matches[1][$i];
	$level1 = $matches[2][$i];
	$exp1 = $matches[3][$i];
	$kills1 = $matches[4][$i];
	$deaths1 = $matches[5][$i];
	$pc1 = $matches[6][$i];

	echo"<P>$i $match<br/>L: $level1<br />E: $exp1<br />K: $kills1<br />D: $deaths1<br />PC: $pc1";
Which had 0 results. What am I doing wrong?
Citizen
Forum Contributor
Posts: 300
Joined: Wed Jul 20, 2005 10:23 am

Post by Citizen »

Anyone post in the Regex section? :)
wildwobby
Forum Commoner
Posts: 66
Joined: Sat Jul 01, 2006 8:35 pm

Post by wildwobby »

Im a complete n00b with RegEx, but could it happen to do with the quotes in the html in the preg_match_all() function? Try escaping them
User avatar
RobertGonzalez
Site Administrator
Posts: 14293
Joined: Tue Sep 09, 2003 6:04 pm
Location: Fremont, CA, USA

Post by RobertGonzalez »

A lot of us post here. :wink:
Post Reply