My spider needs your help

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
jazzylee77
Forum Newbie
Posts: 2
Joined: Fri Jan 18, 2008 8:56 pm

My spider needs your help

Post by jazzylee77 »

Hi. Congrats, the first thread in this forum ranks number one in Google serps for regex forum. That means you have to help me. "It's the law" (said with deadpan seriousness)

Problem: (feel free to skip down to the bottom line if you are willing to accept I have tried a bunch of stuff and made a real swell effort)
I'm installing an onsite site spider/searchengine. (sphider to be exact) I want to use this search for product pages only. I thought this would be fairly simple since they end in .html and all the category pages end with a slash /
  • I messed around trying to learn ways to do it with robots.txt but there doesn't seem to be a way there.
  • Then I tried using the scripts tags...I forget...something like this <!--spidername_noindex--> stuff not to be indexed <!--/spidername_noindex--> But the site is a perl script and those tags get eaten up before it spits out the html.
  • Then there is option 3, robot meta tags, but you can't specify a user agent and I want Google to spider categories.
So after spending half a day learning how to do things that won't work I'm now a few hours into trying to learn how to do this with regex. I've read about 10 different complete tutorials without seeing a real answer, though I tried a few ideas anyway just to see what would happen.

Bottom Line (you skipped all my whining didn't ya? :) )
There is a form in Sphider for perl type regex expressions
URLs must include:
and
URLs must not include:

So is there a regex that would match all urls ending with a slash? I'm guessing there is and I'll pop that in the must not include box. Between hitting the wall on this and my ISP being flaky all day, I'm starting to go a little bonkers... :crazy:
jazzylee77
Forum Newbie
Posts: 2
Joined: Fri Jan 18, 2008 8:56 pm

Re: My spider needs your help

Post by jazzylee77 »

Never mind :)

I found another answer after giving it a fresh look. (this happens to me a lot)

I simply had to put .html in the must include box. Too freaking obvious for me I guess!

Ah... anyway now I can return to my normal life...
...where did I put that?
Post Reply