Strip Tags from a Website

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
lenton
Forum Commoner
Posts: 49
Joined: Sun Jun 20, 2010 6:45 am

Strip Tags from a Website

Post by lenton »

I'm currently creating a crawler that needs to process the sentences of many website it visits. For me to do this I need to first of all remove the tags of a website which is the bit I'm stuck on perfecting.

This is an example website I have to deal with:
http://pastie.org/3122620

I have used strip_tags() but that sometimes doesn't get rid of JavaScript and other things.

If you can remove all HTML, CSS and JavaScript from that webpage and show me how to do I would be very greatfull, thanks!
User avatar
twinedev
Forum Regular
Posts: 984
Joined: Tue Sep 28, 2010 11:41 am
Location: Columbus, Ohio

Re: Strip Tags from a Website

Post by twinedev »

Code: Select all

$strCode = preg_replace('%<script.*?</script>%si','',$strCode);
$strCode = preg_replace('%<style.*?</style>%si','',$strCode);
$strCode = preg_replace('%<[^>]+>%si','',$strCode);
Post Reply