Page 1 of 2

Parsing a html site and using some content ?

Posted: Sun Nov 21, 2004 12:34 am
by optik
hello I've been looking for a sample for a couple of hours now not to mention few months before also but never found waht i needed always find some more advanced things but can't find the basics and i'm not too advanced in php yet so hopefully you guys can help
i'm trying to parse a server status page to tell if it's up or down so i need something to get the html code use it to find some string and use that lline and few lines after it or if possible selecting the <tr> range i've done similar things on batch script for html so thats where the idea is maybe in php it's much different if so give me an idea at least,
thank you all ideas/sugjustions welcome

Posted: Sun Nov 21, 2004 1:49 am
by rehfeld

Code: Select all

$doc = file_get_contents('http://foo.org');

sscanf($doc, "Status: %s", $status);

echo $status;
youll need to learn how to use sscanf, thats likely going to be the easiest way to get the status. as long as you always know what text will be in front of the actual status, sscaf will work good.

if the document was:

<html>
<table>
<td> blah foo bar Status: online blah blah fooo

</html>

the above code will work, and echo $status would output "online"

Posted: Sun Nov 21, 2004 10:42 am
by optik
nice thanks a lot i'm on it hehe

Posted: Sun Nov 21, 2004 10:50 am
by John Cartwright
you can also use [php_man]preg_match[/php_man] and [php_man]preg_replace[/php_man] but it isnt recommend for beginners :P

Posted: Sun Nov 21, 2004 11:44 am
by optik
actually that helps me a little more been playing with that sscanf for like 30 mins can't get any results and with the match already got one of what i'm looking for tnks , i've worked a bit with mysql , tcl and all parsing in that area so those chars aren't scary heh

Posted: Sun Nov 21, 2004 11:48 am
by John Cartwright
Maybe you can be a bit more specific on what you are trying to match so I cann help you with the regex

Also, you should also look into [php_man]preg_match_all[/php_man][php_man] if you are searching for multiple things[/php_man]

Posted: Sun Nov 21, 2004 11:59 am
by optik
i am looking to parse http://lobby.soldat.pl:13073/index.html
and get all the details where there's 66.17.183.250:65000 in that table to get all player # everything like that but it's just something i want to learn some php with also so if you got time to parse some one thing in that format wiould be great to start off for me , i never really parsed anything with php so far

Posted: Sun Nov 21, 2004 12:15 pm
by timvw
now, if you remove all the \r and \n and \s+ the matching should go quite easily.

to help you find the correct regular expression, you can use:
http://www.samuelfullman.com/team/php/t ... ster_p.php

Posted: Sun Nov 21, 2004 1:58 pm
by rehfeld
i cant find "66.17.183.250:65000" anyway on that page.

is it only going to appear sometimes?


could you pick the name of something thats actually on the page, and then give us examples of what you want to parse out of it?

Posted: Sun Nov 21, 2004 2:01 pm
by optik
sry but one more quastion btw that exp tester is good but i have

/<a href=\"soldat:\/\/66.17.183.250:65000\/\"><font color=\"#79E958\"><b>\|Optik's Server\|<\/b><\/font><\/a><\/td>/i

and i want to match \|Optik's Server\| without writing in the name so it could be dynamic, i've been trying to find specifier list or soemething i could use also tried [a-z''\|] but no luck kind of lost is there a list somewhere of what i could be using or such ?

also for the before post you said remove \r \n \s+ would i be using preg_replace for that ?

Posted: Sun Nov 21, 2004 2:03 pm
by optik
rehfeld wrote:i cant find "66.17.183.250:65000" anyway on that page.

is it only going to appear sometimes?


could you pick the name of something thats actually on the page, and then give us examples of what you want to parse out of it?
that's in one of the links on the page so it's only in the source code not on the visual page so i need to take the info by my servers' ip:port and then get the details about it so if i run few servers it would also work and i could change name and etc.. and still would go ok, kind of trying to make one for other users also so it would be universal and you'd only need ip:port of your server

Posted: Sun Nov 21, 2004 2:13 pm
by John Cartwright
untested (plus I dont know much about regrx )

Code: Select all

/<a href="soldat:\/\/66.17.183.250:65000\/"><font color="#79E958"><b>\(&#1111;A-Za-z]+)<\/b><\/font><\/a><\/td>/i

Posted: Sun Nov 21, 2004 2:33 pm
by timvw

Code: Select all

/<a href="soldat:\/\/66.17.183.250:65000\/"><font color="#79E958"><b>(.*?)<\/b><\/font><\/a><\/td>/i
with preg_match this will returned the matched stuff (things between ( and ) ) in $matches.
also for the before post you said remove \r \n \s+ would i be using preg_replace for that ?
i said that because it would allow you to matchsomething like

<tr><td>(.*?)</td><td>(.*?)</td>.....</tr>

Posted: Sun Nov 21, 2004 2:39 pm
by optik
yeah i knew what you meant just wasn't sure which command i would use hehe and will try those others in a sec

Posted: Sun Nov 21, 2004 3:26 pm
by optik
need some more help hehe i figured out how to get the string that i wanted that contains all the info i need i thought this would be easier than parsing all separetly so just get the table i want and then parse that part less cpu usage too i guess anyways so i got a string now need to find out how to parse it when i have it lost once more so if you could help would be good

Code: Select all

/<ahref="soldat:\/\/66.17.183.250:65000\/"><fontcolor="#79E958"><b>(.*?)<\/b><\/font><\/a><\/td><tdwidth="37\%">(.*?)<\/td><tdwidth="8\%">(.*?)<\/td><tdwidth="14\%">(.*?)<\/td><tdwidth="12\%">(.*?)\/(.*?)<\/td><tdwidth="8\%">(.*?)<\/td>/i
returns something like

Code: Select all

<ahref="soldat://66.17.183.250:65000/"><fontcolor="#79E958"><b>|Optik'sServer|</b></font></a></td><tdwidth="37%"></td><tdwidth="8%">CTF</td><tdwidth="14%">ctf_Dropdown</td><tdwidth="12%">0/12</td><tdwidth="8%">1.2.1*</td>
and wondering how i could take data from it like '|Optik'sServer|' from it or any other because i tried other way before it gets me the whole string so when i echo it it justs adds the string which isn't really good so would like to parse out exact data then format it as wanted but don't really know how i can specify the place where it should be but not sure how to use it