ok, got it...
Code: Select all
$regex = "#(<".$tag."\b[^>]*>)(.*)(</".$tag.">|/>)#i";
this is the regex that will find tags. i have tried this on the 'a' tag and the h3 tag, by extension of logic it should also work for img tags.
just replace the regex. the changes are subtle but important.
Code: Select all
1. the original regex ends with #is, it is intended for multiline operations.
this is changed to #i only, catering for single lines but case insensitive.
2. and OR for tags with a short cut /> end closure is added in and catered for now
i have tested this on a sample html page, the result is no links after two passes.
i.e.
$contents = Generic_Tag_Replace( $contents, 'a' );
$contents = Generic_Tag_Replace( $contents, 'h3' );
the complete new set is further down below. so with this, you can strip out entire stretches of tags you don't want in the way.
another way and perhaps and easier way is to use the a similar method to pick up only tables or table tags. the same preg_match can be used. experiment a bit. what you get then after feeding the function is then html stripped of all tags except tables.
that would be nice and clean for you to work on the dom parser.
so my guess is you'd be on your way to reconstructing the original html into a form suitable for use in your case. i think this will come in handy for many such pages with contents
and where cosmetics is of lesser concern, the data is more of interest.
what i'd like to know is if this html parser able to also do a xml parse, because newfeeds are also a source of good info, and perhaps need re-assembly and re-feeding with such method you are using.
all the best in your project then.
ciao.
Code: Select all
function Generic_Tag_Replace( $contents, $tag )
{
$tag = trim($tag);
$regex = "#(<".$tag."\b[^>]*>)(.*)(</".$tag.">|/>)#i";
$new_tag = '';
preg_match_all($regex, $contents, $matches, PREG_SET_ORDER);
foreach ($matches as $val)
{
/* note :
* full tag is val[0]
* tag itself is val[1]
* contents of tag val[2]
* tag closure is val[3]
*/
// find and replace
$find = $val[0];
$contents = str_replace($find, $new_tag, $contents);
}
return ($contents);
} // end Generic_Tag_Replace