Convert HTML to text - keeping Links intact

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Convert HTML to text - keeping Links intact

Post by anjanesh »

Im using the search and replace patterns from the preg_replace example in the php.net manual to convert HTML to text.

But I want all the links intact.

So I changed the second line of

Code: Select all

$search = array ('@<script[^>]*?>.*?</script>@si', // Strip out javascript
                 '@<[\/\!]*?[^<>]*?>@si',          // Strip out HTML tags
                 '@([\r\n])[\s]+@',                // Strip out white space
                 '@&(quot|#34);@i',                // Replace HTML entities
                 '@&(amp|#38);@i',
                 '@&(lt|#60);@i',
                 '@&(gt|#62);@i',
                 '@&(nbsp|#160);@i',
                 '@&(iexcl|#161);@i',
                 '@&(cent|#162);@i',
                 '@&(pound|#163);@i',
                 '@&(copy|#169);@i',
                 '@&#(\d+);@e');                    // evaluate as php
to

Code: Select all

'@<[^a][^\/a][\/\!]*?[^<>]*?>@si', // Strip out HTML tags EXCEPT <a></a>
but isnt working well.
I even tried [^a\/a] but this is treated as
a,\/,a
and not as
a,\/a

Any way to get \/a treated as one ?

Thanks
User avatar
Skara
Forum Regular
Posts: 703
Joined: Sat Mar 12, 2005 7:13 pm
Location: US

Post by Skara »

Code: Select all

'@<[^a(?:\/a)][\/\!]*?[^<>]*@si', // Strip out HTML tags EXCEPT <a></a>
Edit: heh, oops. fixed mistake.
User avatar
anjanesh
DevNet Resident
Posts: 1679
Joined: Sat Dec 06, 2003 9:52 pm
Location: Mumbai, India

Post by anjanesh »

Skara - All tags are existing - <td>, <nobr> </b> etc etc.
User avatar
Todd_Z
Forum Regular
Posts: 708
Joined: Thu Nov 25, 2004 9:53 pm
Location: U Michigan

Post by Todd_Z »

Code: Select all

$contents = function_to_get_source( $url );

// Skara's Regex

// Strip the unwanted closing tags
$regex = "#</[^a][^>]*>#i";
$contents = preg_replace( $regex, NULL, $contents );
	
// Strip the hyperlinks with no content
$regex = "#<a[^>]*></a>#i";
$contents = preg_replace( $regex, NULL, $contents );
	
// Strip unwanted whitespace
$regex = "#\s{2,}#i";
$contents = preg_replace( $regex, " ", $contents );

// Tabs are annoying if you are going to be parsing this info, easier for all whitespace to be spaces.
$contents = str_replace( "\t", " ", $contents );

	// I thought that this would add a space when code shows </a>next sentence, but it turns it into </a> ext sentence [corrections are welcome!]
	// Add space after </a> tags
	$regex = "#</a>[^\s]#i";
	$contents = preg_replace( $regex, "</a> ", $contents );
That should help you out a tad bit.
User avatar
Skara
Forum Regular
Posts: 703
Joined: Sat Mar 12, 2005 7:13 pm
Location: US

Post by Skara »

Well, I didn't look at any of the regex except the question he had. ^^;

as ^'s but the / is optional.

Code: Select all

$string = preg_replace('#</?[^a][^>]*>#si','',$string);
as for this:

Code: Select all

$regex = "#</a>[^\s]#i";
    $contents = preg_replace( $regex, "</a> ", $contents );

Code: Select all

$contents = preg_replace('#</a>([^\s])#si', '</a> \\1', $contents);
you just needed to move the ^\s over. ;)
User avatar
Todd_Z
Forum Regular
Posts: 708
Joined: Thu Nov 25, 2004 9:53 pm
Location: U Michigan

Post by Todd_Z »

Ah, thanks for the correction (I'm still just a regex n00b)
Post Reply