Fili Wiese: Why I Adore Sitemaps, An Ex-Google Engineer's Love Story

Fili Wiese is an SEO Expert at SearchBrothers.com, he talks about everything you need to know about Sitemaps from the best practices, strategy, measuring success, to avoiding common issues and more. He provides unique insights with tips, ideas, strategy considerations and tactics on how to take maximum advantage of sitemaps for your website. Check out his live recording at BrightonSEO April 2019, followed by his transcript, below.

Fili Wiese's Transcript

Hello everyone, it's an absolute pleasure to be back here. Let's talk about a topic I am extremely passionate about, yes, that does happen in SEO, I'm a bit of a geek so, I do like the technical details. So, let's talk about sitemaps yeah, you didn't think that today right boring topic, no it is actually quite cool and I'll tell you why. Sitemaps is for anyone that doesn't know are simple files on your server that basically contain all the URLs of indexable content on your website. Now, these URLs we can find by looking at your database or we can crawl your website to extract those URLs. A lot of people think well, this manages micro-budget. Unfortunately, it doesn't however, what it does do is communicate to search engine what is our priorities what which URLs do we want to have crawled more frequently and more often yeah, which ones are important to us in a nutshell. It also strengthens our canonical signals, which is actually quite important so we can avoid messages like these. They help us communicate a consistent preferred URL to search engines like Google.

Now, I did talk last year about optimizing for search BOTS. So, if you want to check out that talk. This is the URL. I also do, as Kelvin said, the BrightonSEO Advanced Technical SEO on page training. So, feel free to sign up for September 12, if you want to join me for that one.

Coming back to sitemaps, there's a couple of different formats that we can use. One old-school one, that you may have heard of it's the HTML sitemap. Now, I personally do not like the HTML sitemap that much, and I'll explaining a little bit why. Other formats and most used are the XML sitemap format. Which allows us to also use indexes, which is pretty cool. In addition to that, it allows us to add additional information to image sitemaps, news or video sitemaps and new site as well as, and this is actually pretty cool, and I'll show you an example why Hreflang is actually really cool. Another format that a lot of people are not aware of our XE feeds and most CMS systems support feeds. So, they have already a feed available, which means you already have a sitemap now, there is a downside with a feed like RSS or atom feeds. There is a downside, that is that it mostly contains the latest URLs and not everything. So, it's not ideal. Another alternative, are plain text sitemap. Now, I do run into situations where developer teams are refusing, for whatever reason, to program an XML sitemap. But you can convince them to set up a plain text file with one URL per line because that is basically a valid plain text sitemap yeah, valid sitemap. So, by all means do that if you can't do XML. It's also quite interesting because in the industry right now, you'll see Bing as well as Google they're working on indexing API and some people say this will replace the XML sitemaps. Bing is doing a similar thing. They launched that a few months ago but I don't think though that the indexing API will replace sitemaps because although we can have indexing API communicate which URLs, we want to have crawl now, we still can't add the additional metadata like Hreflang or news or video etc. So, I don't think XML sitemaps are going away anytime soon and, sincerely I don't, I really hope they don't, because it's way too cool. Now we do want to measure success and what we can do is we can structure sitemaps in a way that we can track them separately. For example if you have an e-commerce shop you can have all your product pages in one sitemap and all the category pages in another sitemap and measure it that way. Add those sitemaps then to Google Search Console and see how Google is crawling those particular URLs get the stats or get the errors that Google is running into now.

When you then go to Google Search Console, and this is something I really like about the new Google Search Console, the fact that you can actually filter on all URLs, all the submitted URLs, which is a combination of all your sitemap URLs or per sitemap. If you haven't checked that out yet, check it out. We can see per sitemap, or as aggregate across all sitemaps, what the issues are which wrong URLs are in our sitemaps but, we can also work the other way around, and see what is not in our sitemap. So, these are URLs that Google has identified as URLs ready to serve to users, while we have not added these to our sitemaps as such, we may not be aware of that these particularly indexable patterns are being served to users and impacting our rankings and user signals and everything else. We can get that sample from Google but unfortunately, it's just a sample so, we still need to look at a log files and get all the other ones and crawl them. Now, we can use the URL inspection tool to check if its individual URL is indexed or not and submitted or not yeah, that is how we measure the success part. Now there are some misconceptions within the industry as well about sitemaps. For one, people say, okay, I put in sitemaps now that guarantees it being crawled, no it doesn't, there's no guarantees, it's a suggestion to search engines it's not a mandatory rule or guarantee. saying goes through rankings sitemaps do not impact rankings at all, yep, so forget that part we don't want to use it as such there are also limits that we have to keep in mind when we're talking about sitemaps can only be that big and we have to split them up in multiple sitemaps at that point for news.

This is actually even, a slightly different limit that we're talking about. As you can see time is actually a major factor, for those limits any news article that there's more than two days or should not be in your news sitemap. We can also add additional fields to XML but, I do run into a lot of resistance with the XML sitemaps as I mentioned earlier, with developer teams because they obsess about all the fields they need to fill out. The good thing is that most search engines don't care about any of those fields except for the location. Now, which means you can pretty much ignore the other three so, that makes you drop a lot easier. And now, a common mistake I also see is that people include all the URLs of their website indexable and non-indexable. Now, ideally a sitemap should only have the canonical's of the indexable URLs.

Now, the only exception to that is when you're trying to de-index something and trying to track if that is happening so, in that case you want to create a separate site map with just those patterns that you want to get de-indexed separating them from your normal site maps that should just have the canonical's of the indexable URLs. You also need to keep your sitemap up-to-date. It's very often that I run into situations where you ask them ok, when did you last update your sitemap well, that was six months ago. Seriously? A website changes in six months yeah, so you need to keep that up today ideally, you do this dynamically when needed.

Be mindful, don't nest indexes. It’s a specification on the site maps or org website not to do that. Now, some search engines are smart enough to figure it out but, why leave things to chance? Don't nest indexes. Location matters as well, where you put your sitemap matters, if you put your sitemap in a subfolder then all the URLs in that subfolder, all these other URLs in the sitemap should be for that subfolder. Now, the same goes when it comes to different host names. We can get around that by using the robots.txt reference or sitemap in the robots.txt. We can get around these restrictions, this is according to the official documentation of sitemaps.org. Here's an example, of where it goes wrong this is a well known Linux website gnu.org, where we see a robots.txt on HTTP and you can see that the sitemap reference is – HTTP. When we then check that we get redirected to the HTTPS site map and then we see again a reference to HTTP. Now you follow that to - again we redirect to HTTPS. Where all the patterns including all the Hreflang annotations are on HTTP. So, this is a faulty way of making your sitemap yeah, these patterns need to be updated.

Now, I want to briefly talk about internal linking. Christoph did a great job talking about internal linking already. So, I just want to say one thing and I'm coming back to my HTML sitemap issue. HTML sitemaps, if you have no other way of improving your internal linking can be a tool in your toolset to fix some of that. However, it's by far not the best way to go about HTML sitemaps, because HTML sitemaps should not support your internal linking, your internal linking should be your HTML sitemap. Every link, you put on your website is contributing as an HTML sitemap. Think about that. Every other format, on the other hand, from sitemaps do not replace actual internal linking so, you still need to do internal linking just because you have a URL in your sitemap does not guarantee it actually gets crawled or anything else it doesn't get link juice etc. You still need to actually place internal links.

Now, let's talk very briefly about a couple of situations how sitemaps actually solved other SEO issues. So, here we had this web site that had no canonicals and was using your internally URLs with tracking IDs. And in the Google SERPs it looked horrible and this is not what we wanted. Unfortunately, the CMS was not flexible like we couldn't change marks we couldn't add canonicals that would have solved a lot of the problems. However, custom CMS not always possible. Luckily, we were able to change the internal linking and remove the tracking ID and replace it with event tracking in Google Analytics. Now, with that we also added a plain text sitemap and as a result, we let Google crawl, some time past for Google to process everything and yes, all the URLs in the SERPs basically, were now without that tracking ID. So, it worked, great, this is what we wanted. We were not able to add the canonical but, we're still able to change the URLs in Google SERPs. Another situation, that arrives where the assignments really helped us out was improving page speed especially specifically for Googlebot so, we had this website which is based here in the UK with service in the UK to be close to users.

Unfortunately, within Google Search Console we had horrible stat when it came to page speed. So we implemented browser caching to make sure that our responses could be cached for at least six hours to 12 hours. We implemented a custom Edge Server, we put this in the Google cloud in the U.S. We also set up a crawler with a cron job, in this case we use Screaming Frog with command line so we could basically do cron job and on a day-to-day basis, multiple, two times a day, three times a day, we could crawl the entire sitemap, which we did. We did this all within the U.S. in a Google Data Centre in the U.S. We crawled and we cached by doing so, we cached all the important pages that were in our sitemap that we considered as the important pages for the website and we cached it locally in the US on our Edge server. As a result, our page speed significantly increased, improved for Googlebot, while we still had our main server in the UK. So again, a solution was made using sitemaps, we were able to do this because of the sitemaps.

Another way, that sitemaps wants solve some issues was with having multiple sites. Now, this is something that I've run into unfortunately, way too often, where we have multiple websites on multiple domains with servers in multiple countries, and this are a challenge. We wanted to add Hreflang, the problem was too many stakeholders, too many priorities, they all had different priorities, yeah so, this is a huge organization with different IT teams in each country, with different servers and they wouldn't let us edit their websites. We could not add the Hreflang tags to their content, to the HTML. What we could do, on the other hand, was luckily, accessing their database, they did have databases available and we could extract all the canonicals from these databases and even filter them by indexable state. We built a site map based on that added the Hreflang annotation, put this in a cloud pocket, like in Google's storage bucket in the cloud, in a place elsewhere than any of those servers. We added that bucket also to Google Search Console so we could let Google Search Console know "hey, we are related to that particular bucket and our site" and we reference every single host name's sitemap, because we generated multiple sitemaps one for each web, one for each host name, we added these to the robots.txt. So in the end the only thing that we changed on all those domains was one line in the robots.txt, everything else we kept fully under control on our site. We let Google crawl and yes, we did see positive results when it came to hreflang in the Google search results. So again, sitemaps really can be useful. Just a very last quick thing, I would like to say they say as well to hopefully open your mind as how we could use sitemaps also we can use for example the command line and look at hey, how about we download all the site maps using command line, using this nice command and store all the URLs. We have all the URLs in one go, we can do this ourselves we don't need crawlers for this, we can do that ourselves, we can do it also with indexes.

There's a nice article here, that explains everything unfortunately, I would have loved to written this article myself. I have to give credit, check out the article. yeah? One of the other things that we can do for example with this is find mismatches. I have seen situations where here, for example going back to that Linux website, where we have 61,000 patterns estimated. Now, keep in mind this is just a sample this is an indicator this is not the fixed number, like oh that's it, but it gives us a bit of an idea like okay, how many URLs are there within the Google index, this just an idea. Then we counted all the URLs in the sitemap and we only got to 4,500 so, 4,500 on the one hand and 61,000 on the other.

My guess is they're going to have internal SEO issues. They don't even know all the URLs they have. Now 61,000, even if it's an indicator, is by far out of the ballpark yeah. So, that tells us that. So again, sitemaps can be used in different ways. I'm hoping that I opened your eyes to that for today. So, just a couple of quick takeaways there are multiple formats, choose the one that works best for you yeah. Ideally you just have the canonicals of indexable URLs in your sitemap. Keep in mind, that sitemaps really strengthened your canonical signals, allows you to add additional information like Hreflang, image, news etc. video, and the sitemaps can solve creatively other SEO problems. They can assist in solving those but, you need to use them. This is why I love sitemaps so, with that I want to, thank you for your attention my name is Fili Wiese. I used to work for Google Search - nowadays, I'm an SEO expert at SearchBrothers. You can find me on Fili.com as well as SearchBrothers, SearchBrothers.uk. If you want a copy of the slides send me an email to the forum on either website. Thank you very much.

 

Fili Weise

Fili Wiese

Fili is a renowned technical SEO expert, ex-Google engineer, frequent speaker and was a senior technical lead in the Google Search Quality team. At SearchBrothers he offers SEO consulting services with SEO audits, SEO workshops and successfully recovers websites from Google penalties
Menu