Page 1 of 1

Need help in parsing htm documents

Posted: Sun Feb 19, 2012 4:53 pm
by christianbale
Hi all,
I need help in parsing a htm documents. Please find my code below

<?

function strip_html_tags( $text )
{
$text = preg_replace(
array(
// Remove invisible content
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
// Add line breaks before and after blocks
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',"$0", "$0", "$0", "$0", "$0", "$0","$0", "$0",), $text );
// strip_tags removes the remaining html tags
return strip_tags( $text);
}

function strip_random_characters( $text )
{
//This function removes all the rest of the special characters

$special_characters = preg_replace(array("/(?![.=$'€%-])\p{P}/","[^-\w\d\s\.=$'€%]", ), array(" "," ",), $text);

//$special_characters = preg_replace(array("[^-\w\d\s\.=$'€%]", ), array(" ",), $text);


$data = str_replace(array(".",",", "/","^","(",")","'","-","0","1","2","3","4","5","6","7","8","9","×","¢","‘","¨","™","ª","à","¤","®","¥","€","Ð","Ñ","Š","»","°","Ä","Œ","Ã","±","§","•","¤","¥","€","¿","¡","‡",), array(" "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "," "," "," "," "," "," "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", ), $special_characters );

return $data;
}


function upper_to_lower($string) {

//This function converts everything to lower case...
$doc = str_replace(array("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",), array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",), $string);

return $doc;
}


This code removes html tags completely. But still I'm getting special characters in the processed document. " ⠲ â ³n  ⠲ â ³wï ï  n  wï lpg=pa amp dq=% only+fjord+on+the+east+coast% v=onepage amp q=% only% fjord% on% " are the some of the characters in the processed document. I want to remove these special characters.

Any help will be appreciated! Thanks

Re: Need help in parsing htm documents

Posted: Sun Feb 19, 2012 6:40 pm
by xtiano77
Chris,

I don't know what is your level of expertise with PHP, certainly I am not the world's expert; however, have you tried the "strip_tags( )" function? I hope I didn't miss your point, but based on your post, it seems that all you want is to remove HTML tags from a parsed document, which is what the previously mentioned function does. It removes any HTML, XML and PHP tags from the input passed to the function, therefore making it easier to look for "<%", "<%=" or any the few left afterwards. Just my two cents.