Parsing Table with PHP Simple HTML DOM Parser

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

kaje
Forum Newbie
Posts: 11
Joined: Sat Feb 28, 2009 12:01 pm

Parsing Table with PHP Simple HTML DOM Parser

Post by kaje »

I'll start off by saying that I don't have much experience with php but I am creating an iPhone app and need a way to parse data out of a table on a website and then format it in to a mobile / iPhone optimized site for easier viewing. In researching a way to do this, I found that using PHP Simple HTML DOM Parser looks to be the easiest way. However, being so new to php, I am having a problem setting up the php file to read the contents of the table.

Here is the website with the table I am trying to parse:

http://tinyurl.com/bafm8

Pretty much what I'm wanting to do is format that onto my own website as a simple table without all of the graphics and extra information/links. In following some examples on the sourceforge page linked above, this is what I've tried playing around with so far just to see if I can get the data pulled from the table but it doesn't appear to work at all (blank page).

Code: Select all

<?php
include_once('simple_html_dom.php');
 
$html = file_get_html('http://www.okstate.com/SportSelect.dbml?DB_OEM_ID=200&KEY=&SPID=143&SPSID=1420');
$es = $html->find('table.odd td')->plaintext;
 
?>
Any help would be appreciated!
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by php_east »

some tips...

1. try and dump the $html

var_dump($html);
before you use it in
$es = $html->find('table.odd td')->plaintext;

and see what unfolds...

2. the API doesn't say if the URL has to be encoded, but it's worth a try just in case..

Code: Select all

 
$url = urlencode('http://www.okstate.com/SportSelect.dbml?DB_OEM_ID=200&KEY=&SPID=143&SPSID=1420');
$html = file_get_html($url);
 
kaje
Forum Newbie
Posts: 11
Joined: Sat Feb 28, 2009 12:01 pm

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by kaje »

I tried the var_dump($html); and got very long page that looked like this:
object(simple_html_dom)#1 (18) { ["root"]=> object(simple_html_dom_node)#2 (8) { ["nodetype"]=> int(5) ["tag"]=> string(4) "root" ["attr"]=> array(0) { } ["children"]=> array(2) { [0]=> object(simple_html_dom_node)#4 (8) { ["nodetype"]=> int(6) ["tag"]=> string(7) "unknown" ["attr"]=> array(0) { } ["children"]=> array(0) { } ["nodes"]=> array(0) { } ["parent"]=> object(simple_html_dom_node)#2 (8) { ["nodetype"]=> int(5) ["tag"]=> string(4) "root" ["attr"]=> array(0) { } ["children"]=> array(2) { [0]=> object(simple_html_dom_node)#4 (8) { ["nodetype"]=> int(6) ["tag"]=> string(7) "unknown" ["attr"]=> array(0) { } ["children"]=> array(0) { } ["nodes"]=> array(0) { } ["parent"]=> *RECURSION* ["_"]=> array(2) { [0]=> int(2) [4]=> string(102) "" } ["dom:private"]=> object(simple_html_dom)#1 (18) { ["root"]=> *RECURSION* ["nodes"]=> array(1234) { [0]=> *RECURSION* [1]=> object(simple_html_dom_node)#3 (8) { ["nodetype"]=> int(3) ["tag"]=> string(4) "text" ["attr"]=> array(0) { } ["children"]=> array(0) { } ["nodes"]=> array(0) { } ["parent"]=> *RECURSION* ["_"]=> array(1) { [4]=> string(1) " " } ["dom:private"]=> *RECURSION* } [2]=> *RECURSION* [3]=> object(simple_html_dom_node)#5 (8) { ["nodetype"]=> int(3) ["tag"]=> string(4) "text" ["attr"]=> array(0) { } ["children"]=> array(0) { } ["nodes"]=> array(0) { } ["parent"]=> *RECURSION* ["_"]=> array(1) { [4]=> string(1) " " } ["dom:private"]=> *RECURSION* } [4]=> object(simple_html_dom_node)#6 (8) { ["nodetype"]=> int(1) ["tag"]=> string(4) "html" ["attr"]=> array(0) { } ["children"]=> array(2) { [0]=> object(simple_html_dom_node)#8 (8) { ["nodetype"]=> int(1) ["tag"]=> string(4) "head" ["attr"]=> array(0) { } ["children"]=> array(31) { [0]=> object(simple_html_dom_node)#10 (8) { ["nodetype"]=> int(1) ["tag"]=> string(4) "meta" ["attr"]=> array(2) { ["http-equiv"]=> string(13) "Cache-Control" ["content"]=> string(8) "no-cache" } ["children"]=> array(0) { } ["nodes"]=> array(0) { } ["parent"]=> object(simple_html_dom_node)#8 (8) { ["nodetype"]=> int(1) ["tag"]=> string(4) "head" ["attr"]=> array(0) { } ["children"]=> array(31) { [0]=> object(simple_html_dom_node)#10 (8) { ["nodetype"]=> int(1) ["tag"]=> string(4) "meta" ["attr"]=> array(2) { ["http-equiv"]=> string(13) "Cache-Control" ["content"]=> string(8) "no-cache" } ["children"]=> array(0) { } ["nodes"]=> array(0) { } ["parent"]=> *RECURSION* ["_"]=> array(4) { [0]=> int(8) [2]=> array(2) { [0]=> int(0) [1]=> int(0) } [3]=> array(2) { [0]=> array(3) { [0]=> string(1) " " [1]=> string(0) "" [2]=> string(0) "" } [1]=> array(3) { [0]=> string(1) " " [1]=> string(0) "" [2]=> string(0) "" } } [7]=> string(0) "" } ["dom:private"]=> *RECURSION* } [1]=> object(simple_html_dom_node)#12 (8) { ["nodetype"]=> int(1) ["tag"]=> string(4) "meta" ["attr"]=> array(2) { ["http-equiv"]=> string(6) "Pragma" ["content"]=> string(8) "no-cache" } ["children"]=> array(0) { } ["nodes"]=> array(0) { } ["parent"]=> *RECURSION* ["_"]=> array(4) { [0]=> int(10) [2]=> array(2) { [0]=> int(0) [1]=> int(0) } [3]=> array(2) { [0]=> array(3) { [0]=> string(1) " " [1]=> string(0) "" [2]=> string(0) "" } [1]=> array(3) { [0]=> string(1) " " [1]=> string(0)
I added the encoding instructions in #2 and got a similar page that was much much shorter.
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by php_east »

certainly you have some data from the fetch, but it seems the body is missing, what i can see in the mumbo jumbo somewhere is a head tag. but no body, hence your find returns nothing.

i would say maybe take a backstep before using php dom, to trying to get the pages in correctly first. the html dom parser has two ways, of using its internal fetch (file_get_html)
or by using sending html to its parser via a new parser object.

i would think getting the html on your own, then passing it to the parse would be a lot easier,
as you would be able to diagnose each step should anything be wrong.

step (1) fetch the html
-----------------------
there are many ways to fetch text via PHP, file_get_contens is one, using cURL library is another. i would recommend the cURL library method.

there is a simple example at http://www.jonasjohn.de/snippets/php/curl-example.htm
with it, you just go

$contents = DownloadUrl($Url);

( or you can use any other curl examples which there are quite many avialable online )

at this stage, you can check and ensure that you *are* getting html from the remote site
and that all is fine with html.

then send it to the html to the dom parser, using its object method.


step 2 - parse the html using the dom parser
-------------------------------------------
the API gives an example, and from that, it would be as follows...

( assuing you have included the parser somewhere before )

$html = new simple_html_dom();

// Load HTML from a string
$html->load($contents);

and from then on do the finds and whatever else with the dom.


the idea is if you already have valid text from the beginning, and your 'finds' in the dom reveals
nothing, then it is easy to suspect there may be bugs n the dom parser itself. especially so if you initial $contents is all cleally filled with the data/tables you wanted to fetch.

using the doms internal fetch mechanism complicates matters as you would not know if the complete html had been fetched correctly, and from your dump above, the dom parser does not appear to have done so correctly.

an alternative is to use php file_get_contents, which does the same job but you have only
limited control over sessions. the advantage of cURL is that you can also do posts and gets
and this would come in handy for site which requires post data ( search engines for example ).

anyways, good luck, hope you get some happy results. i have never used the dom parser, i am relying on its documentation. they seem to have quite good complete documentation, so that helps a lot.

i would go over to the dom parsers forum if ,
1. you have data in your fetch,
2. the parser cannot find it for some reason.

otherwise, the problem is still at your desk i think.

hope this helps.
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by php_east »

i wish to add that if cURL and file_get_contents does not yield any tables as you would get using
a browser, it means the site table is loaded using xml-rpc (ajax) and you will not get anything using any fetch mechanism that does not run JavaScripts, which browsers do and hence able to pull in the data for the tables.
kaje
Forum Newbie
Posts: 11
Joined: Sat Feb 28, 2009 12:01 pm

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by kaje »

Thanks a lot for your time and help, php_east.

I have been trying to follow along with what you were saying but not getting anything working.

Do I put the function at the beginning of the php tag like below?

Code: Select all

<?php
include_once('simple_html_dom.php');
 
function DownloadUrl($Url){
 
    // is curl installed?
    if (!function_exists('curl_init')){ 
        die('CURL is not installed!');
    }
 
    // create a new curl resource
    $ch = curl_init();
 
    /*
    Here you find more options for curl:
    http://www.php.net/curl_setopt
    */
 
    // set URL to download
    curl_setopt($ch, CURLOPT_URL, $Url);
 
    // set referer:
    curl_setopt($ch, CURLOPT_REFERER, "http://www.okstate.com/SportSelect.dbml?DB_OEM_ID=200&KEY=&SPID=143&SPSID=1420");
 
    // user agent:
    curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
 
    // remove header? 0 = yes, 1 = no
    curl_setopt($ch, CURLOPT_HEADER, 0);
 
    // should curl return or print the data? true = return, false = print
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
 
    // timeout in seconds
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
 
    // download the given URL, and return output
    $output = curl_exec($ch);
 
    // close the curl resource, and free system resources
    curl_close($ch);
 
    // print output
    return $output;
}
 
$contents = DownloadUrl($Url);
Also, you mentioned that after doing this I should be able to check that I'm getting HTML from the site? How do I do that? I've tried echo $contents; but the page is still blank. I've also added the following to try and see if I get any results after the $contents = DownloadUrl($Url);

Code: Select all

$html = new simple_html_dom();
$html->load($contents);
$es = $html->find('table.odd td');
echo $es;
I also want to make another note that I am using MAMP as a local server to run the code on and it may be possible that cURL isn't setup on it?
Last edited by kaje on Mon Mar 02, 2009 8:40 am, edited 1 time in total.
semlar
Forum Commoner
Posts: 61
Joined: Fri Feb 20, 2009 10:45 pm

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by semlar »

Protip: echo <pre> tags around your var_dump/print_r/array outputs.
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by php_east »

i will put notes where needed.
kaje wrote: Do I put the function at the beginning of the php tag like below?
yes, that looks fine, though i would put the function well below so as that the codes are easier to read at the top. less clutter.
kaje wrote:     curl_setopt($ch, CURLOPT_REFERER, "http://www.okstate.com/SportSelect.dbml ... SPSID=1420");
the referrer here refers to where you came from as in a normal browsing context.
if you click a link on site A, and this leads to site B, your referrer will contain info on site A.
in our case however, there isn't such site, so you could put anything in the CURLOPT_REFERER option, the one you put is fine, only that it looks a bit odd.
you could put "my desk" there, it does not matter. this part is optional, and is not critical to the workings of the net in general.
kaje wrote: Also, you mentioned that after doing this I should be able to check that I'm getting HTML from the site? How do I do that? I've tried echo $contents; but the page is still blank.
in the last part, where you put, $contents = DownloadUrl($Url);
the $url has to be the site you want the html from, in this case probably
"http://www.okstate.com/SportSelect.dbml ... SPSID=1420".

so it would be

Code: Select all

 
$Url="http://www.okstate.com/SportSelect.dbml?DB_OEM_ID=200&KEY=&SPID=143&SPSID=1420";
$contents = DownloadUrl($Url);
 
OR

Code: Select all

 
$contents = DownloadUrl("http://www.okstate.com/SportSelect.dbml?DB_OEM_ID=200&KEY=&SPID=143&SPSID=1420");
 
you can test the function first using $contents = DownloadUrl("http://www.google.com");
or something similar, just to gain confidence that this function is doing its job pulling in contents for you.

echo $contents should then display the fetched html contents.
kaje wrote: I've also added the following to try and see if I get any results after the $contents =
DownloadUrl($Url);

Code: Select all

$html = new simple_html_dom();
$html->load($contents);
$es = $html->find('table.odd td');
echo $es;
ah, no, if you have no contents, you are not going to get anything on this part at all. so lets get thru the first hurdle. leave it alone first, it is fine as it is, waiting for $contents. so what we are aiming is to get $contents filled with contents first. the thing is, once you have $contents from the football site or something, you can then forget about all the previous part of the script, knowing it does its job, and focus on this few lines of html dom part, where you were getting blanks. if you have contents, and the dom parse output is blank, you know you need to focus only on this part, which means you are aleready somewhat 50% down the road to success.

the advantage of this menthod is that, worst come to worst, if the dom parser it problematic, you could dump it in favor of some other solution or some other parser, with the $contents already in hand. that is the asset.

for example, you could actually make a few lines of regex to get just what you wanted from the $contents, without even needing a dom parser or any additional class.
kaje wrote: I also want to make another note that I am using MAMP as a local server to run the code on and it may be possible that cURL isn't setup on it?
no, you would be informed well before hand if so from this part of the script

Code: Select all

 
  // is curl installed?
     if (!function_exists('curl_init')){
         die('CURL is not installed!');
     }
 
that you not get any such message means everything is fine on your local.
kaje
Forum Newbie
Posts: 11
Joined: Sat Feb 28, 2009 12:01 pm

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by kaje »

Ah, OK. I added the $Url = ""; and echo'd it onto the page and it showed it so that was the trick!

Now that I know I'm getting something in $content, I've been playing around in particular with the $html->find() stuff and can't get anything to parse out. Here is an example of part of the HTML and the code I'm using:


HTML:

Code: Select all

    <TR>
 
        <TD NOWRAP CLASS="odd" scope="row">
            &nbsp;<FONT CLASS="highlight">
            Sat, Sep 05</FONT>
        </TD>
 
        <TD CLASS="odd"  >
            &nbsp;<FONT CLASS="highlight">Georgia</FONT>
        </TD>
 

PHP:

Code: Select all

$html = new simple_html_dom();
$html->load($contents);
echo $html->find('td.odd', 0)->innertext . "<br>";
Is this not pulling the data because it's in the <font> tag? I was worried about this so I changed it to

Code: Select all

echo $html->find('font.highlight', 0)->innertext . "<br>";
But still a blank page. Hmmm.


Regarding regex, when I was originally researching about parsing, I saw a lot of pages saying not to use regex to parse.

Again, thank you, Thank You, THANK YOU! for all of the help you are offering me.
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by php_east »

you're on your own with regards to dom parser as this is the first time i've heard about its existence. so i have not been down that road much.

that said, i can't help noticing that your find and your input are of a different case.
i'm not certain if the dom parser is case sensitive but if so, it would not find "td" given "TD".
slim chance of it being case sensitive, however, it's perhaps worth a simple test check.
kaje wrote:Regarding regex, when I was originally researching about parsing, I saw a lot of pages saying not to use regex to parse.
nah, don't listen to them :mrgreen:

i should thank you too, i will be taking the same road for a different purpose later on at my leisure time, and your experience with the html dom parser would help guide me which way to take. we're even. :drunk:
kaje
Forum Newbie
Posts: 11
Joined: Sat Feb 28, 2009 12:01 pm

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by kaje »

Thank you very much! I got it working with:

Code: Select all

echo ("<table class='solid' width='320'>");
echo $html->find('table', 7)->innertext;
echo ("</table>");
There's one more thing that I need to do before I should be able to take it from here and that is this table has some images and links I'd like to remove. Is there a way to exclude images and links from the $content or $html prior sending the data to the parser?
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by php_east »

ah, there you go, see what i mean when i says don't listen to them :mrgreen:

a regex to remove all images and links from $contents should do it.
i may be able to find some old regexes i used to remove tags, if so i will post.
later mate.
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by php_east »

ok , here is a function i called tag replace, which is my old code i dug up. i have adapted it slightly for our use, the function will now remove any tags given. ( in theory )

it's something i put together quickly, untested in the adapted form, but the theory is that you give it a tag and the contents and it will remove all of them from the contents.

to use it, go

$contents = Generic_Tag_Replace($contents,'img'); // to remove images
$contents = Generic_Tag_Replace($contents,'a'); // to remove images

the regex itself uses the standard form <tag >..................</tag>
there are other forms for html, which is <tag ....................../>
which is not catered for in this original regex. ( $regex = "#(<".$tag."\b[^>]*>)(.*)(</".$tag.">)#is"; )
so i would suggest you pay around with it.

what i can think of now is
$regex = "#(<".$tag."\b[^>]*>)(.*)(/>)#is";

but it may be better to have a regex that can do an OR so that both forms are catered for
i will experiment a bit, but some regex experts may be able to plug in the correct 'ors' by just looking at this post, hopefully.

c u later.

Code: Select all

 
function Generic_Tag_Replace( $contents, $tag ) 
{
$tag        = trim($tag);
$regex      = "#(<".$tag."\b[^>]*>)(.*)(</".$tag.">)#is";
 
$new_tag    = '';
 
preg_match_all($regex, $contents, $matches, PREG_SET_ORDER);
foreach ($matches as $val) 
        {
        /* note : 
        * full tag is           val[0]
        * tag itself is         val[1] 
        * contents of tag       val[2] 
        * tag closure is        val[3] 
        */
 
        // find and replace         
        $find       = $val[0];
        $contents   = str_replace($find, $new_tag, $contents);
        }
 
return ($contents);
} // end Generic_Tag_Replace
 
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by php_east »

kaje wrote:Thank you very much! I got it working with:

Code: Select all

echo ("<table class='solid' width='320'>");
echo $html->find('table', 7)->innertext;
echo ("</table>");
is the dom parser case sensitive ?
semlar
Forum Commoner
Posts: 61
Joined: Fri Feb 20, 2009 10:45 pm

Re: Parsing Table with PHP Simple HTML DOM Parser

Post by semlar »

strip_tags($string,$tags_to_keep)
Post Reply