Problem: (feel free to skip down to the bottom line if you are willing to accept I have tried a bunch of stuff and made a real swell effort)
I'm installing an onsite site spider/searchengine. (sphider to be exact) I want to use this search for product pages only. I thought this would be fairly simple since they end in .html and all the category pages end with a slash /
- I messed around trying to learn ways to do it with robots.txt but there doesn't seem to be a way there.
- Then I tried using the scripts tags...I forget...something like this <!--spidername_noindex--> stuff not to be indexed <!--/spidername_noindex--> But the site is a perl script and those tags get eaten up before it spits out the html.
- Then there is option 3, robot meta tags, but you can't specify a user agent and I want Google to spider categories.
Bottom Line (you skipped all my whining didn't ya?
There is a form in Sphider for perl type regex expressions
URLs must include:
and
URLs must not include:
So is there a regex that would match all urls ending with a slash? I'm guessing there is and I'll pop that in the must not include box. Between hitting the wall on this and my ISP being flaky all day, I'm starting to go a little bonkers...