Last Wednesday we had Julien Deneuville together with Tom Pool and Faisal Anderson live on Tea Time SEO sharing us their tips and recommendations about log file analysis. If you missed the talk you can read a summary of Julien’s tips right below and Tom and why not read the presentation uploaded to slideshare.
What is log file analysis?
Log file analysis is essential for technical SEO, as it’s the only way to really know what Googlebot is up to on your website. It’s also a great tool to see which parts of your website are efficient or not, and to manage your crawl budget.
What do you need for log file analysis?
1. Reliable Data Before You Analyse the Log files
The first step for any log file analysis is to make sure the data is reliable. I like to start by taking a sample of logs, just to check if all the required fields are there. Proper log file analysis can’t be done without the request URI, the response code, the User agent and the timestamp of the request. Having additional data such as client IP (to filter fake Googlebots), response weight or response time can be very useful too. Unfortunately, if some mandatory data is missing, you’ll need to wait at least a month in order to get enough data. A tip for IIS users: by default, this server logs the filename and the query string for each request, but not the rewritten URL, which can be quite misleading.
Secondly, you need to make sure you’ve got the entire logs. With CDNs, load balancers and other complex hosting setups, it’s very easy to forget about a server hidden somewhere.
Compare the amount of Googlebot crawl you get in the logs with what you can see in Google Search Console’s crawl stats report (only in the old Webmaster Tools for now, but let’s hope it won’t disappear!). There should be small differences, but nothing huge. You can also compare the volume of referral traffic with your analytics tool.
2. Team work – be nice to your team
First of all, you probably won’t get your hands on any log file without talking to your IT team. Be nice to them 😉 Remember to give them feedback and explain what you found with the logs data, it’s always appreciated. Log files contain your website’s entire traffic data, and not just SEO. From CPC and other acquisition channels to Business Intelligence or development teams, many people can benefit from the insights you’ll be able to bring them.And once you’ve shown the value of log files analysis to other teams, it will be much easier to share with them the cost for setting up a proper monitoring tool.
3. Set custom KPI(s)
Log files are a goldmine for SEO insights, but it’s easy to drown in the data. Some KPIs can give you a quick overview of what’s going on, and let you focus on what matters.For example, calculating the ratio between Googlebot hits and SEO visits for each page type on a given website will instantly show you which page types need your attention. Focus on what’s below or above average, as the “good” ratio can vary between websites. With experience, you’ll find other shortcuts to gain time and concentrate on the essential.
When should you do log file analysis?
One time only and also for monitoring your website
Log file analysis can be a one-time-only thing, when auditing a website. It will give you valuable insight and get you a step further. But logs are also a great monitoring tool, especially on larger websites. Premium tools like OnCrawl will fetch your log files daily or in real time and provide an ongoing analysis.You’ll be able to detect crawl errors, ensure your new pages are correctly crawled, quickly detect any problem when handling a migration, and more broadly to discover any change in Googlebot’s behavior. For instance log files will tell you whether you’re in the mobile-first index or not.
Thank you Julien for your tips, we will pass over to Tom below to share his on:
Why carry out log file analysis?
1. Reduce Crawl Budget Waste
There are a whole lot of potential crawl budget wastes that can be identified within a log file analysis. Duplication of content can be a big problem, perhaps where Google (or Bing) is crawling many ‘versions’ of pages, due to odd parameters, forms, calendars or other weird things. These may help identify what Google ‘sees’ as a link and can help reduce the amount of unnecessary pages that Google crawls.
You’ll want to look into the most and least crawled URLs, and work out why. Is there maybe some page that you thought was really helpful, but Google isn’t crawling it – or vice versa?
Consider linking from most crawled pages more, to those that don’t see as much or any activity. Do make sure that the linking makes sense from a user perspective, else Google might not see any value with the link.
2. Gather insights for your company/client
Combining log files with crawl data can also provide a lot of invaluable information. Are there URLs that are found in one dataset that aren’t found in the other? Is Google crawling URLs that are not linked within the crawl data that you have? Have you URLs found in the crawl that Google doesn’t know or care about?
These are all areas that you’ll want to further explore, and make recommendations to ensure all pages that you care about are being crawled, and ones that you don’t, aren’t.
It’s also worth looking to see if internal linking matches up with data shown within the log files. I’ve personally seen cases where the IA of a site has almost exactly reflected the most popular URLs seen by Googlebot. This can be a powerful motivator to stakeholders if you’re struggling to get a new page or section added to the overall IA of a site.
3. Referrer data can be a goldmine – if logs are set up to capture this!
Referrer data is absolutely awesome, however, a lot of logging solutions don’t have this set up by default. If possible, set logs up to capture this data! Then you can see where requests have come from, you can identify popular entry & exit pages, and also see which site or page sends the most referral traffic. You can also capture user data, and identify the user funnel better. Match this up with analytics data to get the most amount of insight.
How can you get started with log file analysis?
With the amount of potential data that you could be working with, you’ll want to ensure that the data that you are going to be working with is real! Utilise a reverse IP address lookup to verify that any Googlebot traffic that you have data for is the real deal, and not a fake. I’ve personally crawled a number of sites using a Googlebot UserAgent, and your logs might also contain this data too.
Following on from something that Faisal mentioned – use all the (relevant) tools that are at your disposal to be able to get the best insights. A personal favourite of mine for large scale data manipulation and analysis is Python, using the Pandas library. This enables you to manipulate massive amounts of data super easily, and can really help speed up the log file analysis process.
Thank you Tom for your tips. Watch the full episode below with Tom, Faisal and Julien.
If you’re crawling a particularly large website, consider enabling a crawl limit. You can use this to check for duplicate content issues which may not be picked up otherwise. Then, when you’re making recommendations and implementing them, you are making full use of your developer or resource, as you’ve got a comprehensive view on all the issues. Once this is complete, it should be easier to re-crawl the website and repeat the process to see if there are any more duplicate issues remaining.