404 / blocked URL robots question
Moderator: General Moderators
-
simonmlewis
- DevNet Master
- Posts: 4435
- Joined: Wed Oct 08, 2008 3:39 pm
- Location: United Kingdom
- Contact:
404 / blocked URL robots question
In Google Webmasters you can "fetch as Google" to see if a URL is fine, or if it is being correctly blocked, for example, blocking those with & in then.
Why would the 404 Not Found list, have a bunch of URLs that have & in them?
Is it a fact that Google shows you what URLs are being requested, and when, and if they are going to 404 - even if they are being blocked by their crawlers?
So you can see a) what is being blocked and b) what consumers are being taken to.
Or - is there something fundamentally wrong if the crawlers are being blocked for a & URL but still appearing in the 404s list?
Why would the 404 Not Found list, have a bunch of URLs that have & in them?
Is it a fact that Google shows you what URLs are being requested, and when, and if they are going to 404 - even if they are being blocked by their crawlers?
So you can see a) what is being blocked and b) what consumers are being taken to.
Or - is there something fundamentally wrong if the crawlers are being blocked for a & URL but still appearing in the 404s list?
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
All the best from the United Kingdom.
Re: 404 / blocked URL robots question
What are you talking about with this "&" thing? You mean ampersands in a query string? There's nothing wrong with URLs like that.
-
simonmlewis
- DevNet Master
- Posts: 4435
- Joined: Wed Oct 08, 2008 3:39 pm
- Location: United Kingdom
- Contact:
Re: 404 / blocked URL robots question
We block them for our own reasons. Partly to avoid duplicate urls.
But that isn't the point of the question.
But that isn't the point of the question.
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
All the best from the United Kingdom.
Re: 404 / blocked URL robots question
You have a history of asking odd questions.
For example, you just said that you're blocking them, but one of your original questions was why the 404 list would have them. The answer should be obvious: because you're blocking them. That's such an obvious answer I have to wonder if I am correctly understanding your question in the first place.
Then you ask about whether Google is showing you requested URLs. Saying "if they are going to 404" sounds like you think Google can predict what will happen. They can't. They'll crawl your site and log what happens, and you can see parts of that log.
My conclusion is that you don't know what your blocks are doing. Ask the people who set up those blocks how they are working.
For example, you just said that you're blocking them, but one of your original questions was why the 404 list would have them. The answer should be obvious: because you're blocking them. That's such an obvious answer I have to wonder if I am correctly understanding your question in the first place.
Then you ask about whether Google is showing you requested URLs. Saying "if they are going to 404" sounds like you think Google can predict what will happen. They can't. They'll crawl your site and log what happens, and you can see parts of that log.
My conclusion is that you don't know what your blocks are doing. Ask the people who set up those blocks how they are working.
-
simonmlewis
- DevNet Master
- Posts: 4435
- Joined: Wed Oct 08, 2008 3:39 pm
- Location: United Kingdom
- Contact:
Re: 404 / blocked URL robots question
I know exactly what I am asking.
We are using robots to block certain URLs from Google's crawlers. If I fetch a certain URL as Google, it's confirmed as 'Blocked'.
In which case, why is Google logging these URLs... *at all*... ?? Surely the Robots it stopping it from even tracking them.... from even seeing them.
Is that a clearer way of asking this question?
If we have URLs that ARE real, but we don't want Google from Caching them, they won't appear in the 404s, as they are in fact live.
I thought putting in blocks on robots.txt was a way of telling Google to outright ignore a URL or a URL with a particular character/pattern in it. In which case, why does it show up in 404s... AT ALL.
We are using robots to block certain URLs from Google's crawlers. If I fetch a certain URL as Google, it's confirmed as 'Blocked'.
In which case, why is Google logging these URLs... *at all*... ?? Surely the Robots it stopping it from even tracking them.... from even seeing them.
Is that a clearer way of asking this question?
Why would it have a list, if we are blocking them?.... but one of your original questions was why the 404 list would have them. The answer should be obvious: because you're blocking them.
If we have URLs that ARE real, but we don't want Google from Caching them, they won't appear in the 404s, as they are in fact live.
I thought putting in blocks on robots.txt was a way of telling Google to outright ignore a URL or a URL with a particular character/pattern in it. In which case, why does it show up in 404s... AT ALL.
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
All the best from the United Kingdom.
Re: 404 / blocked URL robots question
I'm sure you do. But I don't.simonmlewis wrote:I know exactly what I am asking.
Well, you did just tell it to. And it said "blocked". So there's your answer.simonmlewis wrote:We are using robots to block certain URLs from Google's crawlers. If I fetch a certain URL as Google, it's confirmed as 'Blocked'.
In which case, why is Google logging these URLs... *at all*... ?? Surely the Robots it stopping it from even tracking them.... from even seeing them.
Blocking them doesn't prevent them from knowing about the URLs existence. It just tells them not to crawl or index the page.simonmlewis wrote:Why would it have a list, if we are blocking them?
Okay... So use robots.txt or a <meta> to prevent Google from caching them.simonmlewis wrote:If we have URLs that ARE real, but we don't want Google from Caching them, they won't appear in the 404s, as they are in fact live.
Alright, that sounds a little different than what I thought.simonmlewis wrote:I thought putting in blocks on robots.txt was a way of telling Google to outright ignore a URL or a URL with a particular character/pattern in it. In which case, why does it show up in 404s... AT ALL.
What you're saying is that you have particular pages which are currently 404ing (pages which will "go live" sometime in the future), and are seeing in your server access logs Googlebot try to crawl them even though they're listed in a robots.txt? Can you post the domain, or at least the contents of the robots.txt and the problematic URLs?
Or. Are you saying that you're using Google's Webmaster Tools/Search Console and it is reporting that when Google tries to crawl/index your domain it is getting 404s?
Or. Are you saying that you're using GWT/SC to inspect or analyze or do something to these particular URLs and the tool reports they are returning 404s?
You keep saying "Google" this and "Google" that, but there are many "Google" things out there that could be relevant. Be specific.
-
simonmlewis
- DevNet Master
- Posts: 4435
- Joined: Wed Oct 08, 2008 3:39 pm
- Location: United Kingdom
- Contact:
Re: 404 / blocked URL robots question
Pardon me for being blunt, but I'm not sure why this is such a difficult thing to grasp.
I have URLs that I won't want to be cache or seen by Google. We told that if the URL, for example as & in it, then adding that to Robots will stop it crawling those URLs.
Perfect. Exactly what we want.
Yet, it still crawls those pages, as they are showing us as 404s as those URLs are now dead. At some point, the Robots file was ignored (or damaged), and so rather a lot of pages were opened up to Google, and now there is a massive 404 list.
We figured that by having the robots in place correctly, it would not cache them anymore. And yet suddenly on another side, it's cached a TON of them, even with the robot file correctly in place. Those URLs if I visit them correctly go to 404s, but why are they showing up in Webmasters if we are telling Google not to cache them.
Hence my first question, is the 404 list a generic set of 404 pages that Google has found, but NOT CACHED because of the robots file?
I have URLs that I won't want to be cache or seen by Google. We told that if the URL, for example as & in it, then adding that to Robots will stop it crawling those URLs.
Perfect. Exactly what we want.
Yet, it still crawls those pages, as they are showing us as 404s as those URLs are now dead. At some point, the Robots file was ignored (or damaged), and so rather a lot of pages were opened up to Google, and now there is a massive 404 list.
We figured that by having the robots in place correctly, it would not cache them anymore. And yet suddenly on another side, it's cached a TON of them, even with the robot file correctly in place. Those URLs if I visit them correctly go to 404s, but why are they showing up in Webmasters if we are telling Google not to cache them.
Hence my first question, is the 404 list a generic set of 404 pages that Google has found, but NOT CACHED because of the robots file?
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
All the best from the United Kingdom.
Re: 404 / blocked URL robots question
My main difficulty is that you keep saying "blocked", but that word has no real technical meaning. You also keep referring to Google and I'm still not 100% that you mean Googlebot.simonmlewis wrote:Pardon me for being blunt, but I'm not sure why this is such a difficult thing to grasp.
You cannot prevent Google - or anybody - from seeing a URL. Referring to my statement above, you mean you don't want Googlebot to crawl or index those pages?simonmlewis wrote:I have URLs that I won't want to be cache or seen by Google.
Potentially, yes.simonmlewis wrote:We told that if the URL, for example as & in it, then adding that to Robots will stop it crawling those URLs.
How long ago did you fix the robots.txt? Has Google recrawled your site since then?simonmlewis wrote:Yet, it still crawls those pages, as they are showing us as 404s as those URLs are now dead. At some point, the Robots file was ignored (or damaged), and so rather a lot of pages were opened up to Google, and now there is a massive 404 list.
We figured that by having the robots in place correctly, it would not cache them anymore. And yet suddenly on another side, it's cached a TON of them, even with the robot file correctly in place. Those URLs if I visit them correctly go to 404s, but why are they showing up in Webmasters if we are telling Google not to cache them.
And it would be very helpful so I'll ask again: what is an example URL, and what is the contents of your robots.txt/what is the live domain?
The Crawl Errors page? It should be a list of URLs that the bot encountered 404s visiting, so URLs disallowed in robots.txt should not show up in there.simonmlewis wrote:Hence my first question, is the 404 list a generic set of 404 pages that Google has found, but NOT CACHED because of the robots file?
-
simonmlewis
- DevNet Master
- Posts: 4435
- Joined: Wed Oct 08, 2008 3:39 pm
- Location: United Kingdom
- Contact:
Re: 404 / blocked URL robots question
Ok for one site the Robots was messed up so that explains why a ton of them came back. On a side note I don't know why, when Robots is now setup again, we cannot clear the thousands of 404s to confirm they are resolved. As now that the site is blocking 'googlebot' from seeing them, why does the list only go down by 1000 a day??
The main reason I write here is because one of our other sites has had DOUBLE the amount of the previous site, and it's robots file is correct and has always been so. We block /*?, so that a url that starts /index.php?page=selector.... (it's a long UrL) doesn't get crawled.
Yet all of a suddenly this site has 10s of thousands of these things appears as 404s, and yet we tell Robots not to crawl them.
The main reason I write here is because one of our other sites has had DOUBLE the amount of the previous site, and it's robots file is correct and has always been so. We block /*?, so that a url that starts /index.php?page=selector.... (it's a long UrL) doesn't get crawled.
Yet all of a suddenly this site has 10s of thousands of these things appears as 404s, and yet we tell Robots not to crawl them.
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
All the best from the United Kingdom.
Re: 404 / blocked URL robots question
Because they don't want the bot to swamp a site with tons of traffic. You may be able to manually trigger a more massive recrawl, but otherwise just wait for it to catch up.simonmlewis wrote:On a side note I don't know why, when Robots is now setup again, we cannot clear the thousands of 404s to confirm they are resolved. As now that the site is blocking 'googlebot' from seeing them, why does the list only go down by 1000 a day??
You should first verify with your own server's access logs whether Googlebot is truly crawling the pages; while most of what I've seen suggests the opposite, it could be that the list does not correctly reflect the bot's actual behavior.simonmlewis wrote:The main reason I write here is because one of our other sites has had DOUBLE the amount of the previous site, and it's robots file is correct and has always been so. We block /*?, so that a url that starts /index.php?page=selector.... (it's a long UrL) doesn't get crawled.
Yet all of a suddenly this site has 10s of thousands of these things appears as 404s, and yet we tell Robots not to crawl them.
Otherwise I don't know what to tell you. I'm pretty sure Googlebot is not at fault here, but if you're sure everything is right on your end then maybe you need to try to find more official support from them.
-
simonmlewis
- DevNet Master
- Posts: 4435
- Joined: Wed Oct 08, 2008 3:39 pm
- Location: United Kingdom
- Contact:
Re: 404 / blocked URL robots question
So do they not allow you to just "clear" all thousands of 404s, even tho I know full well those URLs are not blocked again? So they only allow up to 1000 clearances a day?
The ones on the other site are 100% urls that have not been there for a good 3-4 years. And for some reason, in the past week these 10s of thousands have jut appeared.
My question is, if someone is trying to cause us problems, and has generated 10s of thousands of these URLs, the robots file should block googlebot from crawling it, thus they should not appear in the 404 list at all?!
The ones on the other site are 100% urls that have not been there for a good 3-4 years. And for some reason, in the past week these 10s of thousands have jut appeared.
My question is, if someone is trying to cause us problems, and has generated 10s of thousands of these URLs, the robots file should block googlebot from crawling it, thus they should not appear in the 404 list at all?!
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
All the best from the United Kingdom.
Re: 404 / blocked URL robots question
I don't know what I can tell you that I haven't already said a few times.
-
simonmlewis
- DevNet Master
- Posts: 4435
- Joined: Wed Oct 08, 2008 3:39 pm
- Location: United Kingdom
- Contact:
Re: 404 / blocked URL robots question
It's a problem then.
If it's blocked via robots, it shouldn't therefore appear in the 404s.
If it's blocked via robots, it shouldn't therefore appear in the 404s.
Love PHP. Love CSS. Love learning new tricks too.
All the best from the United Kingdom.
All the best from the United Kingdom.