Need help in parsing htm documents

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
christianbale
Forum Newbie
Posts: 1
Joined: Sun Feb 19, 2012 4:51 pm

Need help in parsing htm documents

Post by christianbale »

Hi all,
I need help in parsing a htm documents. Please find my code below

<?

function strip_html_tags( $text )
{
$text = preg_replace(
array(
// Remove invisible content
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
// Add line breaks before and after blocks
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',"$0", "$0", "$0", "$0", "$0", "$0","$0", "$0",), $text );
// strip_tags removes the remaining html tags
return strip_tags( $text);
}

function strip_random_characters( $text )
{
//This function removes all the rest of the special characters

$special_characters = preg_replace(array("/(?![.=$'€%-])\p{P}/","[^-\w\d\s\.=$'€%]", ), array(" "," ",), $text);

//$special_characters = preg_replace(array("[^-\w\d\s\.=$'€%]", ), array(" ",), $text);


$data = str_replace(array(".",",", "/","^","(",")","'","-","0","1","2","3","4","5","6","7","8","9","×","¢","‘","¨","™","ª","à","¤","®","¥","€","Ð","Ñ","Š","»","°","Ä","Œ","Ã","±","§","•","¤","¥","€","¿","¡","‡",), array(" "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "," "," "," "," "," "," "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", ), $special_characters );

return $data;
}


function upper_to_lower($string) {

//This function converts everything to lower case...
$doc = str_replace(array("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",), array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",), $string);

return $doc;
}


This code removes html tags completely. But still I'm getting special characters in the processed document. " ⠲ â ³n  ⠲ â ³wï ï  n  wï lpg=pa amp dq=% only+fjord+on+the+east+coast% v=onepage amp q=% only% fjord% on% " are the some of the characters in the processed document. I want to remove these special characters.

Any help will be appreciated! Thanks
xtiano77
Forum Commoner
Posts: 72
Joined: Tue Sep 22, 2009 10:53 am
Location: Texas

Re: Need help in parsing htm documents

Post by xtiano77 »

Chris,

I don't know what is your level of expertise with PHP, certainly I am not the world's expert; however, have you tried the "strip_tags( )" function? I hope I didn't miss your point, but based on your post, it seems that all you want is to remove HTML tags from a parsed document, which is what the previously mentioned function does. It removes any HTML, XML and PHP tags from the input passed to the function, therefore making it easier to look for "<%", "<%=" or any the few left afterwards. Just my two cents.
Post Reply