Page 1 of 1

What's left after strip_tags ?

Posted: Tue Dec 18, 2007 2:44 pm
by impulse()
I'm writing a script that connects to forum thread, grabs all the output and then strips the tags to leave me with just the text from the website. I hoping to make an RSS feed out of this once everything if complete. The only problem is that after strip_tags() has been used I have masses of empty lines. I'm not entirely sure how to remove just those lines. I've tried removing any '\n' but that also removes the legitimate new lines.

If somebody has tried something similar to this before then could you point me in the right direction with string functions. The best idea I have at the moment it to put the text into an array, loop through every element and for the first line with '\n' that is found, create a variable, then if the next line is also '\n' and the variable is set then remove that line.

Any other suggestions very welcome.

Posted: Tue Dec 18, 2007 2:52 pm
by impulse()
strlen() has just reveled what I wasn't expecting....

Posted: Wed Dec 19, 2007 6:55 pm
by Ambush Commander
preg_replace. Find all newline sequences greater than or equal to 2, and then replace it with a single newline.

Re: What's left after strip_tags ?

Posted: Thu Dec 20, 2007 12:32 am
by s.dot
impulse() wrote:I'm writing a script that connects to forum thread, grabs all the output and then strips the tags to leave me with just the text from the website. I hoping to make an RSS feed out of this once everything if complete.
Do you have permission to do this from the web site owner? :P
impulse() wrote:The only problem is that after strip_tags() has been used I have masses of empty lines. I'm not entirely sure how to remove just those lines. I've tried removing any '\n' but that also removes the legitimate new lines.
strip_tags() may also remove legitimate data if the HTML is malformed, I think feyd has posted a "smart" strip_tags() function.
impulse() wrote:If somebody has tried something similar to this before then could you point me in the right direction with string functions. The best idea I have at the moment it to put the text into an array, loop through every element and for the first line with '\n' that is found, create a variable, then if the next line is also '\n' and the variable is set then remove that line.

Any other suggestions very welcome.
The guy above me hit the nail right on the head. ;)

Code: Select all

$text = preg_replace('/' . preg_quote(PHP_EOL) . '{2,}/m', PHP_EOL, $text);
Something like that. :) I'm not sure you want to use PHP_EOL, but I've never dealt with \n in a pattern before.

Posted: Wed Dec 26, 2007 5:08 pm
by impulse()
Do you have permission to do this from the web site owner?
I didn't realize it could be a problem, with the contents being available publically.

I've tried your suggestion and also included the "smart" strip-tags from the 'Useful Posts' forum but it only removes the html tags.
The text goes through these 2 functions before I echo out the output:

Code: Select all

$text = preg_replace('#</?.*?>#','',$text);
$text = preg_replace('/' . preg_quote(PHP_EOL) . '{2,}/m', PHP_EOL, $text);
Was hoping you could explain the parameters passed to preg_replace.