Page 2 of 2

Re: Parsing Table with PHP Simple HTML DOM Parser

Posted: Tue Mar 03, 2009 4:46 pm
by php_east
semlar wrote:strip_tags($string,$tags_to_keep)
not so sure about this. that woud be *a lot* of tags in $tags_to_keep, basically all but img's and a's. if PHP has a "tags to eliminate" option, it would be great.
maybe we should propose to php. also, it strips the tags but leave the text inside, may not be what kaje wants.

Re: Parsing Table with PHP Simple HTML DOM Parser

Posted: Tue Mar 03, 2009 5:00 pm
by semlar
You're right, and that is a good idea. I assumed there would be a limited number of tags they wanted to keep.

Re: Parsing Table with PHP Simple HTML DOM Parser

Posted: Tue Mar 03, 2009 5:27 pm
by kaje
php_east wrote:ok , here is a function i called tag replace, which is my old code i dug up. i have adapted it slightly for our use, the function will now remove any tags given. ( in theory )

it's something i put together quickly, untested in the adapted form, but the theory is that you give it a tag and the contents and it will remove all of them from the contents.

to use it, go

$contents = Generic_Tag_Replace($contents,'img'); // to remove images
$contents = Generic_Tag_Replace($contents,'a'); // to remove images

the regex itself uses the standard form <tag >..................</tag>
there are other forms for html, which is <tag ....................../>
which is not catered for in this original regex. ( $regex = "#(<".$tag."\b[^>]*>)(.*)(</".$tag.">)#is"; )
so i would suggest you pay around with it.

what i can think of now is
$regex = "#(<".$tag."\b[^>]*>)(.*)(/>)#is";

but it may be better to have a regex that can do an OR so that both forms are catered for
i will experiment a bit, but some regex experts may be able to plug in the correct 'ors' by just looking at this post, hopefully.

c u later.

Code: Select all

 
function Generic_Tag_Replace( $contents, $tag ) 
{
$tag        = trim($tag);
$regex      = "#(<".$tag."\b[^>]*>)(.*)(</".$tag.">)#is";
 
$new_tag    = '';
 
preg_match_all($regex, $contents, $matches, PREG_SET_ORDER);
foreach ($matches as $val) 
        {
        /* note : 
        * full tag is           val[0]
        * tag itself is         val[1] 
        * contents of tag       val[2] 
        * tag closure is        val[3] 
        */
 
        // find and replace         
        $find       = $val[0];
        $contents   = str_replace($find, $new_tag, $contents);
        }
 
return ($contents);
} // end Generic_Tag_Replace
 
Thanks I'll play around with it!

Re: Parsing Table with PHP Simple HTML DOM Parser

Posted: Tue Mar 03, 2009 5:28 pm
by kaje
php_east wrote:
kaje wrote:Thank you very much! I got it working with:

Code: Select all

echo ("<table class='solid' width='320'>");
echo $html->find('table', 7)->innertext;
echo ("</table>");
is the dom parser case sensitive ?

Nope. I originally got it to work with the individual cells but then figured it'd be easier if I just pulled out the entire table then used some HTML/CSS to get it to fit how I wanted.

Re: Parsing Table with PHP Simple HTML DOM Parser

Posted: Wed Mar 04, 2009 12:40 am
by php_east
the function Generic_Tag_Replace is not suitable as is. you probably found that out already.
i will post a more suitable one later.

Re: Parsing Table with PHP Simple HTML DOM Parser

Posted: Wed Mar 04, 2009 1:23 am
by php_east
ok, got it...

Code: Select all

 
$regex      = "#(<".$tag."\b[^>]*>)(.*)(</".$tag.">|/>)#i";
 
this is the regex that will find tags. i have tried this on the 'a' tag and the h3 tag, by extension of logic it should also work for img tags.

just replace the regex. the changes are subtle but important.

Code: Select all

 
1. the original regex ends with #is, it is intended for multiline operations.
this is changed to #i only, catering for single lines but case insensitive.
 
2. and OR for tags with a short cut /> end closure is added in and catered for now
 
i have tested this on a sample html page, the result is no links after two passes.
i.e.
$contents = Generic_Tag_Replace( $contents, 'a' );
$contents = Generic_Tag_Replace( $contents, 'h3' );

the complete new set is further down below. so with this, you can strip out entire stretches of tags you don't want in the way.

another way and perhaps and easier way is to use the a similar method to pick up only tables or table tags. the same preg_match can be used. experiment a bit. what you get then after feeding the function is then html stripped of all tags except tables.

that would be nice and clean for you to work on the dom parser.
so my guess is you'd be on your way to reconstructing the original html into a form suitable for use in your case. i think this will come in handy for many such pages with contents
and where cosmetics is of lesser concern, the data is more of interest.

what i'd like to know is if this html parser able to also do a xml parse, because newfeeds are also a source of good info, and perhaps need re-assembly and re-feeding with such method you are using.

all the best in your project then.

ciao.

Code: Select all

 
function Generic_Tag_Replace( $contents, $tag )
 {
$tag        = trim($tag);
$regex      = "#(<".$tag."\b[^>]*>)(.*)(</".$tag.">|/>)#i";
 
 
$new_tag    = '';
 
preg_match_all($regex, $contents, $matches, PREG_SET_ORDER);
foreach ($matches as $val)
         {
         /* note :
         * full tag is           val[0]
         * tag itself is         val[1]
         * contents of tag       val[2]
         * tag closure is        val[3]
         */
  
         // find and replace        
         $find       = $val[0];
         $contents   = str_replace($find, $new_tag, $contents);
         }
  
 return ($contents);
 } // end Generic_Tag_Replace
 

Re: Parsing Table with PHP Simple HTML DOM Parser

Posted: Wed Mar 04, 2009 1:39 am
by php_east
i thought i'm done, but at the last moment, i found another 'trap'.
sometimes, people do not close tags, especially image tags. this can cause the preg match to not match.

so this modified regex below vaters for such a case, where at the very least, they must close it with > else the html will fail entirely. browser are knwo to be extremely lenient when it comes to html parsing so the tags show up correctly. so when we code for html process, we should do the same, i.e. have lots of mercy for the html programmer.

Code: Select all

$regex      = "#(<".$tag."\b[^>]*>)(.*)(</".$tag.">|/>|[^>])#i";
incidentally, the html i am using to test these functions is quite interesting in itself. it consist of an explanation of html, and examples of html, which means there is html embedded inside html, something the regex must work hard to achieve accuracy. this is the page
http://sheldonbrown.com/web_sample1.html

Re: Parsing Table with PHP Simple HTML DOM Parser

Posted: Tue Mar 24, 2009 10:37 am
by PM2008
Good work you guys. Good scripts.

In addition, I would use biterscripting. It seems to have some functionality that makes certain tasks easier - such as mass updates to a large number of web pages. Do take a look at a script that comes as part of it at http://www.biterscripting.com/SS_RemoveTags.html . It remove the specified tags. If you want to try it, download it from http://www.biterscripting.com , then follow installation instructions at http://www.biterscripting.com/install.html . It installs in minutes.

Patrick

Re: Parsing Table with PHP Simple HTML DOM Parser

Posted: Tue Dec 07, 2010 2:17 pm
by lin
hello dear friends,

i am new to programming - i want to parser with php simple html dom parser.
can you tell me how to apply this great technique to the - probably easy pages here:

http://schulnetz.nibis.de/db/schulen/sc ... 481&lschb=
http://www.schulministerium.nrw.de/BP/S ... pDO=116439

i get easy scripts up and runnig - but i think it is a bit complex to run a table-parser

I need some starting points...

Love to hear from you....

regards
lin