Preserve and utilise your web server logs
Kaspar says: “A great best practice that has proven to be a really phenomenal recommendation for large websites time and again is saving, preserving, and utilising raw web server logs. The vast majority of websites are not taking full advantage of this opportunity, even though they could greatly benefit from it.”
Why is this important?
“The reason for utilising server logs is so that you can know rather than guess.
Typically, for a large website, a volume of landing pages are included in the sitemap, and there will be another volume of landing pages that are desirable - the cash cows and pages that we want to have indexed and crawled on a regular basis. Those two do not necessarily overlap 100%. I’ve been doing this for a really long time, and in my experience, there is rarely a large overlap.
You can use server logs to look at an extended period of time, and determine which pages you are telling search engines that you care about, which pages you actually care about being crawled and indexed, and which pages are being prioritised by search engines. In an ideal world, there will be a 100% overlap between those three volumes of landing pages. Frequently, however, there will be very little overlap at all.
Only by using server logs can you put yourself in a position to improve this. Crawl budget management comes in here (which is especially important for large websites) but it doesn’t stop there. By having server logs that cover an extended period of time (we’re talking six months to a year), you can actually evaluate your server responses. Among the most important of these are your DHCP responses. Are you getting 200 OKs? Do you have soft 404s and error pages? These are things that you can only truly understand by running a server log analysis.
Unfortunately, most websites do not take advantage of that - and that’s a loss. SEO is becoming more and more technical all the time. This is something that large websites can benefit greatly from if you start saving and preserving server logs today.”
Is it possible to get information on things like 200 OKs and soft 404s from online crawl tools or is that specific information only available by looking at log files?
“You can gain some insights through tools like Bing Webmaster Tools and Google Search Console. Search Console, for example, will pick up on soft 404s (which can be a sore point for large retail websites) but it’s just a sample. The server logs are the only way that you can actually tell how much of the crawl budget goes towards landing pages that cannot generate revenue - because whatever used to be sold on those landing pages is sold out or unavailable.
We can gain some insights from third-party tools, and there are great tools out there, but server logs are a critical element. It is possible to have an SEO audit without server logs but, with them, the insights are so much more precise.
It is also important to identify why a website fluctuates and you can understand a lot more by using server logs. You might be able to see, for instance, that there is an increase in bot activity before a page drops in ranking because Googlebot is trying to figure out what the page is about. These kinds of correlations are highly relevant from an SEO perspective, but they are only visible if you have your server logs at hand.”
Are website fluctuations a common issue?
“One of the most common questions that we encounter is: ‘I’ve got a substantial website and a substantial brand, but it goes up and down in Google search visibility. Why is this happening?’ Preferably, you want your site’s position to be improving but, at the very least you want it to be relatively stable. Most of the time, this is something that can be corroborated when you look into the data. This is a very important and very common question, even for large websites.
Server logs are incredibly handy to have so that you can address these kinds of questions very specifically and precisely. Many large organisations do not record these logs, and if they have not been recorded then they can never be recovered. Either you record them, or you don’t. Some organisations will only partially record server logs or retain them for a very short period of time. That’s also problematic because it doesn’t allow for a holistic picture.”
For soft 404s, why does Google typically think that a page should be a 404 even though the server responds with a 200 OK, and how can SEOs fix that?
“I like the fact that you said ‘Google thinks it’s a 404’, because it’s often not accurate. If you happen to have a landing page that expired - perhaps a commercial item that’s sold out or tickets for a concert that has ended – then the product is unavailable. In Google’s mind, that should be a 404. If it returns a 200 OK then users can find that landing page, go to that landing page from SERPs, and the server response remains 200 OK, yet the landing page says that it is unavailable. That’s a typical soft 404.
Google can recognise this. They are able to recognise the server response and the historic, classic content on-page but their on-page content recognition isn’t flawless. There are many instances where Google says that you have soft 404s on your website, but they are just picking up on negative wording within 200 OK relevant landing pages.
Only through server logs can you tell whether those are real soft 404s. Then, you can start to determine what you are going to do about it if they are. You might want to make them into real 404s or custom 404s (which are my favourite). This is where a page still says ‘404’, but it provides some additional added value. It provides an alternative for the user, to keep that lead and keep the user on the website in a meaningful user-relevant way. If not, you might want to noindex those pages. These alternatives are not really available without data analysis, which is always better when you have server logs at hand.”
Is it possible to serve a bespoke 404 based on something like the category of the website?
“You have to stay consistent in the mind of the search engines. If you were to provide a different response to bots and users, that’s something that Google might frown on. It might be considered user agent cloaking. From Google’s perspective, 404s are something that should be utilised if the content is gone and it’s not coming back anytime soon.
Of course, this is an ideal scenario that doesn’t always happen. Publishers are sometimes of the opinion that they can retain some of their PageRank equity - which is very debatable when we are talking about sales pages that typically attract very little PageRank equity to begin with. They try to retain that equity by 301 redirecting those ‘404’ pages to the root, or to the category, and end up creating yet more soft 404s.
This is a huge topic, but the bottom line is that 404s are there for a reason. They help the user understand that something is not available, and they help search engines to understand the same. They help us to make sure that the user experience is a good one. If Google understands (based on their data) that the user experience isn’t great, then that’s something that causes websites to drop in organic search. You want to prevent that from happening.”
How do you preserve your log files? Where is it best to store them and what software is best for accessing them?
“That’s actually something that needs to be considered on an individual basis because every website is different, and every architecture and every technological setup is different. Often, larger organisations will merge a number of websites and they will have a variety of solutions in place.
Saving and preserving should, of course, be done in a safe manner in terms of safety and data integrity. It depends on the setup and the facilities of the organisation, but it can be done on separate physical hard drives - which are pretty cheap nowadays.
I am often asked whether preserving and analysing server logs is going to cost a lot, and typically it is not expensive at all. If you’re looking at generating data for SEO it’s a relatively minor cost, especially for a large organisation.
The other objection that is often raised is the legal concern. I’m not providing any legal advice (I have no legal background; my background is purely focused on SEO), but there is no legal limitation on utilising and preserving raw web server logs if they’re anonymized. If you’re analysing raw files, you’re only interested in bots. Whenever a human user has been accessing the website, that’s of no consequence and can be completely dropped. The most important part of that data - the data treasure trove that can be built up over time - is verified bot entries. We’re mainly talking about Googlebot and Bingbot. Once they have been isolated, that data does not pertain to any human user. That’s the data that’s really important.
Yes, it does touch on GDPR (or CCPA in the US), but it’s not user data we’re looking at. Neither legal obligations nor cost are roadblocks to saving and preserving your server logs. However, in terms of how to do this in the most efficient way, it comes down to the individual organisation and setup.”
Is it useful and worthwhile to combine log file data with Search Console data?
“Absolutely. As a Google fanboy for a really long time, I also want to say that Bing Webmaster Tools is equally relevant and a great way to verify data. There are alternatives as well; Majestic is a fantastic tool, for example. The more the merrier, from my perspective.
You want to have as much relevant and fresh data as possible, but you also want to utilise a number of tools - both as data points and for analysis. Best case scenario: you’re going to be able to verify the findings. If they contradict each other, you can dive a little bit deeper and drill into the code. It absolutely makes sense to combine your log data with other tools.”
What shouldn’t SEOs be doing in 2023? What’s seductive in terms of time, but ultimately counterproductive?
“You shouldn’t be building links for PageRank purposes. Yes, it can work - otherwise Google wouldn’t be penalising the practice - but it’s very much a double-edged sword.
You never know whether the effort and budget that you are putting in is a complete waste of your resources and money - and it can always trigger Google’s wrath. Google still stands by the recommendation (and the Google Webmaster Guideline) that building links is a violation and that links should be merit-based. Hence, websites that choose to build links or do not clean up their legacy backlink issues run the risk of being penalised. Every penalty can be fixed but it’s much better not to take the risk, as it is a business risk at the end of the day.”
Kaspar Szymanski is a Director at SearchBrothers and you can find him over at searchbrothers.com.