I'm writing a script that connects to forum thread, grabs all the output and then strips the tags to leave me with just the text from the website. I hoping to make an RSS feed out of this once everything if complete. The only problem is that after strip_tags() has been used I have masses of empty lines. I'm not entirely sure how to remove just those lines. I've tried removing any '\n' but that also removes the legitimate new lines.
If somebody has tried something similar to this before then could you point me in the right direction with string functions. The best idea I have at the moment it to put the text into an array, loop through every element and for the first line with '\n' that is found, create a variable, then if the next line is also '\n' and the variable is set then remove that line.
Any other suggestions very welcome.
What's left after strip_tags ?
Moderator: General Moderators
- Ambush Commander
- DevNet Master
- Posts: 3698
- Joined: Mon Oct 25, 2004 9:29 pm
- Location: New Jersey, US
Re: What's left after strip_tags ?
Do you have permission to do this from the web site owner?impulse() wrote:I'm writing a script that connects to forum thread, grabs all the output and then strips the tags to leave me with just the text from the website. I hoping to make an RSS feed out of this once everything if complete.
strip_tags() may also remove legitimate data if the HTML is malformed, I think feyd has posted a "smart" strip_tags() function.impulse() wrote:The only problem is that after strip_tags() has been used I have masses of empty lines. I'm not entirely sure how to remove just those lines. I've tried removing any '\n' but that also removes the legitimate new lines.
The guy above me hit the nail right on the head.impulse() wrote:If somebody has tried something similar to this before then could you point me in the right direction with string functions. The best idea I have at the moment it to put the text into an array, loop through every element and for the first line with '\n' that is found, create a variable, then if the next line is also '\n' and the variable is set then remove that line.
Any other suggestions very welcome.
Code: Select all
$text = preg_replace('/' . preg_quote(PHP_EOL) . '{2,}/m', PHP_EOL, $text);Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
-
impulse()
- Forum Regular
- Posts: 748
- Joined: Wed Aug 09, 2006 8:36 am
- Location: Staffordshire, UK
- Contact:
I didn't realize it could be a problem, with the contents being available publically.Do you have permission to do this from the web site owner?
I've tried your suggestion and also included the "smart" strip-tags from the 'Useful Posts' forum but it only removes the html tags.
The text goes through these 2 functions before I echo out the output:
Code: Select all
$text = preg_replace('#</?.*?>#','',$text);
$text = preg_replace('/' . preg_quote(PHP_EOL) . '{2,}/m', PHP_EOL, $text);