404 / blocked URL robots question

Ye' old general discussion board. Basically, for everything that isn't covered elsewhere. Come here to shoot the breeze, shoot your mouth off, or whatever suits your fancy.
This forum is not for asking programming related questions.

Moderator: General Moderators

Post Reply
simonmlewis
DevNet Master
Posts: 4435
Joined: Wed Oct 08, 2008 3:39 pm
Location: United Kingdom
Contact:

404 / blocked URL robots question

Post by simonmlewis »

In Google Webmasters you can "fetch as Google" to see if a URL is fine, or if it is being correctly blocked, for example, blocking those with & in then.
Why would the 404 Not Found list, have a bunch of URLs that have & in them?

Is it a fact that Google shows you what URLs are being requested, and when, and if they are going to 404 - even if they are being blocked by their crawlers?

So you can see a) what is being blocked and b) what consumers are being taken to.

Or - is there something fundamentally wrong if the crawlers are being blocked for a & URL but still appearing in the 404s list?
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: 404 / blocked URL robots question

Post by requinix »

What are you talking about with this "&" thing? You mean ampersands in a query string? There's nothing wrong with URLs like that.
simonmlewis
DevNet Master
Posts: 4435
Joined: Wed Oct 08, 2008 3:39 pm
Location: United Kingdom
Contact:

Re: 404 / blocked URL robots question

Post by simonmlewis »

We block them for our own reasons. Partly to avoid duplicate urls.
But that isn't the point of the question.
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: 404 / blocked URL robots question

Post by requinix »

You have a history of asking odd questions.

For example, you just said that you're blocking them, but one of your original questions was why the 404 list would have them. The answer should be obvious: because you're blocking them. That's such an obvious answer I have to wonder if I am correctly understanding your question in the first place.

Then you ask about whether Google is showing you requested URLs. Saying "if they are going to 404" sounds like you think Google can predict what will happen. They can't. They'll crawl your site and log what happens, and you can see parts of that log.

My conclusion is that you don't know what your blocks are doing. Ask the people who set up those blocks how they are working.
simonmlewis
DevNet Master
Posts: 4435
Joined: Wed Oct 08, 2008 3:39 pm
Location: United Kingdom
Contact:

Re: 404 / blocked URL robots question

Post by simonmlewis »

I know exactly what I am asking.

We are using robots to block certain URLs from Google's crawlers. If I fetch a certain URL as Google, it's confirmed as 'Blocked'.
In which case, why is Google logging these URLs... *at all*... ?? Surely the Robots it stopping it from even tracking them.... from even seeing them.

Is that a clearer way of asking this question?
.... but one of your original questions was why the 404 list would have them. The answer should be obvious: because you're blocking them.
Why would it have a list, if we are blocking them?

If we have URLs that ARE real, but we don't want Google from Caching them, they won't appear in the 404s, as they are in fact live.

I thought putting in blocks on robots.txt was a way of telling Google to outright ignore a URL or a URL with a particular character/pattern in it. In which case, why does it show up in 404s... AT ALL.
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: 404 / blocked URL robots question

Post by requinix »

simonmlewis wrote:I know exactly what I am asking.
I'm sure you do. But I don't.
simonmlewis wrote:We are using robots to block certain URLs from Google's crawlers. If I fetch a certain URL as Google, it's confirmed as 'Blocked'.
In which case, why is Google logging these URLs... *at all*... ?? Surely the Robots it stopping it from even tracking them.... from even seeing them.
Well, you did just tell it to. And it said "blocked". So there's your answer.
simonmlewis wrote:Why would it have a list, if we are blocking them?
Blocking them doesn't prevent them from knowing about the URLs existence. It just tells them not to crawl or index the page.
simonmlewis wrote:If we have URLs that ARE real, but we don't want Google from Caching them, they won't appear in the 404s, as they are in fact live.
Okay... So use robots.txt or a <meta> to prevent Google from caching them.
simonmlewis wrote:I thought putting in blocks on robots.txt was a way of telling Google to outright ignore a URL or a URL with a particular character/pattern in it. In which case, why does it show up in 404s... AT ALL.
Alright, that sounds a little different than what I thought.

What you're saying is that you have particular pages which are currently 404ing (pages which will "go live" sometime in the future), and are seeing in your server access logs Googlebot try to crawl them even though they're listed in a robots.txt? Can you post the domain, or at least the contents of the robots.txt and the problematic URLs?

Or. Are you saying that you're using Google's Webmaster Tools/Search Console and it is reporting that when Google tries to crawl/index your domain it is getting 404s?
Or. Are you saying that you're using GWT/SC to inspect or analyze or do something to these particular URLs and the tool reports they are returning 404s?

You keep saying "Google" this and "Google" that, but there are many "Google" things out there that could be relevant. Be specific.
simonmlewis
DevNet Master
Posts: 4435
Joined: Wed Oct 08, 2008 3:39 pm
Location: United Kingdom
Contact:

Re: 404 / blocked URL robots question

Post by simonmlewis »

Pardon me for being blunt, but I'm not sure why this is such a difficult thing to grasp.

I have URLs that I won't want to be cache or seen by Google. We told that if the URL, for example as & in it, then adding that to Robots will stop it crawling those URLs.

Perfect. Exactly what we want.

Yet, it still crawls those pages, as they are showing us as 404s as those URLs are now dead. At some point, the Robots file was ignored (or damaged), and so rather a lot of pages were opened up to Google, and now there is a massive 404 list.

We figured that by having the robots in place correctly, it would not cache them anymore. And yet suddenly on another side, it's cached a TON of them, even with the robot file correctly in place. Those URLs if I visit them correctly go to 404s, but why are they showing up in Webmasters if we are telling Google not to cache them.

Hence my first question, is the 404 list a generic set of 404 pages that Google has found, but NOT CACHED because of the robots file?
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: 404 / blocked URL robots question

Post by requinix »

simonmlewis wrote:Pardon me for being blunt, but I'm not sure why this is such a difficult thing to grasp.
My main difficulty is that you keep saying "blocked", but that word has no real technical meaning. You also keep referring to Google and I'm still not 100% that you mean Googlebot.
simonmlewis wrote:I have URLs that I won't want to be cache or seen by Google.
You cannot prevent Google - or anybody - from seeing a URL. Referring to my statement above, you mean you don't want Googlebot to crawl or index those pages?
simonmlewis wrote:We told that if the URL, for example as & in it, then adding that to Robots will stop it crawling those URLs.
Potentially, yes.
simonmlewis wrote:Yet, it still crawls those pages, as they are showing us as 404s as those URLs are now dead. At some point, the Robots file was ignored (or damaged), and so rather a lot of pages were opened up to Google, and now there is a massive 404 list.

We figured that by having the robots in place correctly, it would not cache them anymore. And yet suddenly on another side, it's cached a TON of them, even with the robot file correctly in place. Those URLs if I visit them correctly go to 404s, but why are they showing up in Webmasters if we are telling Google not to cache them.
How long ago did you fix the robots.txt? Has Google recrawled your site since then?

And it would be very helpful so I'll ask again: what is an example URL, and what is the contents of your robots.txt/what is the live domain?
simonmlewis wrote:Hence my first question, is the 404 list a generic set of 404 pages that Google has found, but NOT CACHED because of the robots file?
The Crawl Errors page? It should be a list of URLs that the bot encountered 404s visiting, so URLs disallowed in robots.txt should not show up in there.
simonmlewis
DevNet Master
Posts: 4435
Joined: Wed Oct 08, 2008 3:39 pm
Location: United Kingdom
Contact:

Re: 404 / blocked URL robots question

Post by simonmlewis »

Ok for one site the Robots was messed up so that explains why a ton of them came back. On a side note I don't know why, when Robots is now setup again, we cannot clear the thousands of 404s to confirm they are resolved. As now that the site is blocking 'googlebot' from seeing them, why does the list only go down by 1000 a day??

The main reason I write here is because one of our other sites has had DOUBLE the amount of the previous site, and it's robots file is correct and has always been so. We block /*?, so that a url that starts /index.php?page=selector.... (it's a long UrL) doesn't get crawled.

Yet all of a suddenly this site has 10s of thousands of these things appears as 404s, and yet we tell Robots not to crawl them.
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: 404 / blocked URL robots question

Post by requinix »

simonmlewis wrote:On a side note I don't know why, when Robots is now setup again, we cannot clear the thousands of 404s to confirm they are resolved. As now that the site is blocking 'googlebot' from seeing them, why does the list only go down by 1000 a day??
Because they don't want the bot to swamp a site with tons of traffic. You may be able to manually trigger a more massive recrawl, but otherwise just wait for it to catch up.
simonmlewis wrote:The main reason I write here is because one of our other sites has had DOUBLE the amount of the previous site, and it's robots file is correct and has always been so. We block /*?, so that a url that starts /index.php?page=selector.... (it's a long UrL) doesn't get crawled.

Yet all of a suddenly this site has 10s of thousands of these things appears as 404s, and yet we tell Robots not to crawl them.
You should first verify with your own server's access logs whether Googlebot is truly crawling the pages; while most of what I've seen suggests the opposite, it could be that the list does not correctly reflect the bot's actual behavior.

Otherwise I don't know what to tell you. I'm pretty sure Googlebot is not at fault here, but if you're sure everything is right on your end then maybe you need to try to find more official support from them.
simonmlewis
DevNet Master
Posts: 4435
Joined: Wed Oct 08, 2008 3:39 pm
Location: United Kingdom
Contact:

Re: 404 / blocked URL robots question

Post by simonmlewis »

So do they not allow you to just "clear" all thousands of 404s, even tho I know full well those URLs are not blocked again? So they only allow up to 1000 clearances a day?

The ones on the other site are 100% urls that have not been there for a good 3-4 years. And for some reason, in the past week these 10s of thousands have jut appeared.

My question is, if someone is trying to cause us problems, and has generated 10s of thousands of these URLs, the robots file should block googlebot from crawling it, thus they should not appear in the 404 list at all?!
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: 404 / blocked URL robots question

Post by requinix »

I don't know what I can tell you that I haven't already said a few times.
simonmlewis
DevNet Master
Posts: 4435
Joined: Wed Oct 08, 2008 3:39 pm
Location: United Kingdom
Contact:

Re: 404 / blocked URL robots question

Post by simonmlewis »

It's a problem then.
If it's blocked via robots, it shouldn't therefore appear in the 404s.
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
Post Reply