Hi all,
I need help in parsing a htm documents. Please find my code below
<?
function strip_html_tags( $text )
{
$text = preg_replace(
array(
// Remove invisible content
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
// Add line breaks before and after blocks
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',"$0", "$0", "$0", "$0", "$0", "$0","$0", "$0",), $text );
// strip_tags removes the remaining html tags
return strip_tags( $text);
}
function strip_random_characters( $text )
{
//This function removes all the rest of the special characters
$special_characters = preg_replace(array("/(?![.=$'€%-])\p{P}/","[^-\w\d\s\.=$'€%]", ), array(" "," ",), $text);
//$special_characters = preg_replace(array("[^-\w\d\s\.=$'€%]", ), array(" ",), $text);
$data = str_replace(array(".",",", "/","^","(",")","'","-","0","1","2","3","4","5","6","7","8","9","×","¢","‘","¨","™","ª","à","¤","®","¥","€","Ð","Ñ","Š","»","°","Ä","Œ","Ã","±","§","•","¤","¥","€","¿","¡","‡",), array(" "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "," "," "," "," "," "," "," "," "," "," "," ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", ), $special_characters );
return $data;
}
function upper_to_lower($string) {
//This function converts everything to lower case...
$doc = str_replace(array("A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z",), array("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",), $string);
return $doc;
}
This code removes html tags completely. But still I'm getting special characters in the processed document. " ⠲ â ³n  ⠲ â ³wï ï  n  wï lpg=pa amp dq=% only+fjord+on+the+east+coast% v=onepage amp q=% only% fjord% on% " are the some of the characters in the processed document. I want to remove these special characters.
Any help will be appreciated! Thanks
Need help in parsing htm documents
Moderator: General Moderators
-
christianbale
- Forum Newbie
- Posts: 1
- Joined: Sun Feb 19, 2012 4:51 pm
Re: Need help in parsing htm documents
Chris,
I don't know what is your level of expertise with PHP, certainly I am not the world's expert; however, have you tried the "strip_tags( )" function? I hope I didn't miss your point, but based on your post, it seems that all you want is to remove HTML tags from a parsed document, which is what the previously mentioned function does. It removes any HTML, XML and PHP tags from the input passed to the function, therefore making it easier to look for "<%", "<%=" or any the few left afterwards. Just my two cents.
I don't know what is your level of expertise with PHP, certainly I am not the world's expert; however, have you tried the "strip_tags( )" function? I hope I didn't miss your point, but based on your post, it seems that all you want is to remove HTML tags from a parsed document, which is what the previously mentioned function does. It removes any HTML, XML and PHP tags from the input passed to the function, therefore making it easier to look for "<%", "<%=" or any the few left afterwards. Just my two cents.