Page 1 of 1

Strip Tags from a Website

Posted: Tue Jan 03, 2012 10:31 pm
by lenton
I'm currently creating a crawler that needs to process the sentences of many website it visits. For me to do this I need to first of all remove the tags of a website which is the bit I'm stuck on perfecting.

This is an example website I have to deal with:
http://pastie.org/3122620

I have used strip_tags() but that sometimes doesn't get rid of JavaScript and other things.

If you can remove all HTML, CSS and JavaScript from that webpage and show me how to do I would be very greatfull, thanks!

Re: Strip Tags from a Website

Posted: Tue Jan 03, 2012 11:19 pm
by twinedev

Code: Select all

$strCode = preg_replace('%<script.*?</script>%si','',$strCode);
$strCode = preg_replace('%<style.*?</style>%si','',$strCode);
$strCode = preg_replace('%<[^>]+>%si','',$strCode);