Page 1 of 1

My spider needs your help

Posted: Fri Jan 18, 2008 9:36 pm
by jazzylee77
Hi. Congrats, the first thread in this forum ranks number one in Google serps for regex forum. That means you have to help me. "It's the law" (said with deadpan seriousness)

Problem: (feel free to skip down to the bottom line if you are willing to accept I have tried a bunch of stuff and made a real swell effort)
I'm installing an onsite site spider/searchengine. (sphider to be exact) I want to use this search for product pages only. I thought this would be fairly simple since they end in .html and all the category pages end with a slash /
  • I messed around trying to learn ways to do it with robots.txt but there doesn't seem to be a way there.
  • Then I tried using the scripts tags...I forget...something like this <!--spidername_noindex--> stuff not to be indexed <!--/spidername_noindex--> But the site is a perl script and those tags get eaten up before it spits out the html.
  • Then there is option 3, robot meta tags, but you can't specify a user agent and I want Google to spider categories.
So after spending half a day learning how to do things that won't work I'm now a few hours into trying to learn how to do this with regex. I've read about 10 different complete tutorials without seeing a real answer, though I tried a few ideas anyway just to see what would happen.

Bottom Line (you skipped all my whining didn't ya? :) )
There is a form in Sphider for perl type regex expressions
URLs must include:
and
URLs must not include:

So is there a regex that would match all urls ending with a slash? I'm guessing there is and I'll pop that in the must not include box. Between hitting the wall on this and my ISP being flaky all day, I'm starting to go a little bonkers... :crazy:

Re: My spider needs your help

Posted: Fri Jan 18, 2008 9:51 pm
by jazzylee77
Never mind :)

I found another answer after giving it a fresh look. (this happens to me a lot)

I simply had to put .html in the must include box. Too freaking obvious for me I guess!

Ah... anyway now I can return to my normal life...
...where did I put that?