Page 1 of 1

Convert HTML to text - keeping Links intact

Posted: Sat Jun 18, 2005 1:19 pm
by anjanesh
Im using the search and replace patterns from the preg_replace example in the php.net manual to convert HTML to text.

But I want all the links intact.

So I changed the second line of

Code: Select all

$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                 '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                 '@([\r\n])[\s]+@',                // Strip out white space
                 '@&(quot|#34);@i',                // Replace HTML entities
                 '@&(amp|#38);@i',
                 '@&(lt|#60);@i',
                 '@&(gt|#62);@i',
                 '@&(nbsp|#160);@i',
                 '@&(iexcl|#161);@i',
                 '@&(cent|#162);@i',
                 '@&(pound|#163);@i',
                 '@&(copy|#169);@i',
                 '@&#(\d+);@e');                    // evaluate as php
to

Code: Select all

'@<[^a][^\/a][\/\!]*?[^<>]*?>@si', // Strip out HTML tags EXCEPT <a></a>
but isnt working well.
I even tried [^a\/a] but this is treated as
a,\/,a
and not as
a,\/a

Any way to get \/a treated as one ?

Thanks

Posted: Sat Jun 18, 2005 1:37 pm
by Skara

Code: Select all

'@<[^a(?:\/a)][\/\!]*?[^<>]*@si', // Strip out HTML tags EXCEPT <a></a>
Edit: heh, oops. fixed mistake.

Posted: Sat Jun 18, 2005 1:44 pm
by anjanesh
Skara - All tags are existing - <td>, <nobr> </b> etc etc.

Posted: Sat Jun 18, 2005 10:48 pm
by Todd_Z

Code: Select all

$contents = function_to_get_source( $url );

// Skara's Regex

// Strip the unwanted closing tags
$regex = "#</[^a][^>]*>#i";
$contents = preg_replace( $regex, NULL, $contents );
	
// Strip the hyperlinks with no content
$regex = "#<a[^>]*></a>#i";
$contents = preg_replace( $regex, NULL, $contents );
	
// Strip unwanted whitespace
$regex = "#\s{2,}#i";
$contents = preg_replace( $regex, " ", $contents );

// Tabs are annoying if you are going to be parsing this info, easier for all whitespace to be spaces.
$contents = str_replace( "\t", " ", $contents );

	// I thought that this would add a space when code shows </a>next sentence, but it turns it into </a> ext sentence [corrections are welcome!]
	// Add space after </a> tags
	$regex = "#</a>[^\s]#i";
	$contents = preg_replace( $regex, "</a> ", $contents );
That should help you out a tad bit.

Posted: Sun Jun 19, 2005 12:23 pm
by Skara
Well, I didn't look at any of the regex except the question he had. ^^;

as ^'s but the / is optional.

Code: Select all

$string = preg_replace('#</?[^a][^>]*>#si','',$string);
as for this:

Code: Select all

$regex = "#</a>[^\s]#i";
    $contents = preg_replace( $regex, "</a> ", $contents );

Code: Select all

$contents = preg_replace('#</a>([^\s])#si', '</a> \\1', $contents);
you just needed to move the ^\s over. ;)

Posted: Sun Jun 19, 2005 5:16 pm
by Todd_Z
Ah, thanks for the correction (I'm still just a regex n00b)