Initial Research Study of backlink data for 100 sites
Back in 2008, in the very early days of Analytics SEO we built our own web crawler and set about trying to crawl the web.
As we crawled more and more weird and wonderful websites, our data grew almost exponentially and we soon realised that in no time at all we would have more servers than chairs in our office!
Given the ever-increasing nature of the task (trillions of URLs and counting), the size of the data to be crawled, processed and analysed was at a scale that meant that crawling the web and licensing the database was going to become even more of a specialist function.
In 2010, we signed a licensing deal with Majestic SEO to incorporate their link data into Analytics SEO for all our customers. We still crawl the web today, and our crawler is much more advanced than it was back then (we’ve seen lots more peculiar sites to help us refine it), but today we focus on spidering client sites only and re-crawling backlink data.
In 2013, there are now a few more specialist link data providers of which Majestic SEO, Ahrefs and Moz are the major providers. There have been a number of blog posts and articles comparing each provider for coverage, depth and quality; some of them have provoked a lot of debate. However, we felt that most of these comparisons only looked at data for a handful of sites and in order to get a clearer picture of the relative strengths of each provider, a comparison of data across a much larger sample size was necessary. So that’s what we have set out to do.
Today, we’re publishing a summary of our initial findings across 100 sites. But we are already in the process of collecting data for 1,000 randomly selected sites and will follow up with an even more in-depth study in a matter of weeks.
The purpose of publishing the initial study now is to ensure we have thoroughly considered all angles to make sure that our extensive comparison will be as fair and objective as possible. We know how hard it is to successfully crawl the web (after all we gave up doing it) and we completely respect what all three of these companies have achieved. Let’s not forget that without these guys, reporting on our link efforts would be like going into a gun fight with a water pistol!
We wanted to undertake more thorough research across more sites than had been attempted before. But there are obvious limits in terms of what’s practical and feasible to achieve in a reasonable period of time – after all some sites have millions of links. We decided to exclude these large sites for many practical reasons; one of which was that we wanted to re-crawl all these links and we felt that we might make ourselves unpopular if we made too many simultaneous requests from each site.
In the end, we determined to compare sites that were reported as having less than 50,000 backlinks by all data providers; this nicely matched up with the data providers’ API constraints.
What we are doing here is an ‘end-user’ analysis of the total number of links available for selected sites from the three backlink providers and not an analysis of how well their crawlers and indexation works (this has been done before, many times over).
At this stage we’d like to thank the data providers for providing free access to their APIs for the purpose of this research.
We selected 1,000 sites at random from our database. We chose this sample as it enabled us to sense check the data that came back; we also have anonymized Google Analytics data for this data so we could start to look at how well each provider actually picked up links that are referring traffic to a given site. From this base we have selected 100 sites at random for this initial study.
We used each provider’s API to get up to 50,000 backlinks for each of the 100 sites.
We re-crawled all the backlinks using our own crawler and followed any re-directs in place and then checked every source URL to see whether it was indexed in Google (this amounted to ~6m checks for just 100 sites!).
The first piece of this research is based on source URLs only (a source URL can have multiple outbound links), so other link factors such as hostnames, unique domains, IPs, Unique Class Cs or referrals, whilst accounted for in the research, are ignored at this stage. We are interested in the absolute total numbers of source URLs found by all 3 providers. We then get into the total number of links thereafter.
Please remember this is just the initial study; further in-depth research and analysis is already underway on the full data set – so if you have any suggestions or critique then do please comment below.
OK, let’s get stuck into some data!
For the purposes of a quick preview into the full set of data (1,000 sites) we analysed the backlinks of 100 sites (maximum of 50K links per site). Please note, when we talk about Majestic SEO we were only analysing its Fresh Index.
Total number of sites found by data provider (Base n=100).
It seems that all providers have data for the given sites. Could this mean they have 100% coverage? Evidently, this sample size is too small to imply this. But it will be interesting to see how this coverage changes across 1,000 sites (which is admittedly still not a huge sample size given the size of the web).
If each provider has 100% coverage then will this ratio also apply to the total number of links they can serve per site? i.e. 100% of all links available to any given site or page? To find out we then compared the total number of Root Domains and Source URLs (note – not links yet) for each data provider.
Interestingly, MajesticSEO has the highest total amount of detected source URLs. 1.4 million is about 100,520 (about 8%) more than Ahrefs, and that’s a significant number as it translates to about 1,005 per site. So where does the difference come from? And what about Moz? Moz have found the same sites but clearly have a different approach to crawling the sites they find (quality over quantity?).
In terms of root domain numbers it’s a similar story with MajesticSEO having the majority of Domains (76,033) and again, Ahrefs with the second highest number (65,798). Moz has fared much better this time round and has a total of (52,174) domains.
Let’s throw something else in the mix: Number of links!
One metric we are all concerned about as SEOs is that golden number of links. Moving on from the number of source URLs we can start comparing some link data.
A pattern is beginning to emerge. Majestic SEO has returned a whopping 2 million links in total (19% more than Ahrefs). What would this number be if we looked at 1,000 sites? There is a difference of 351,578 links between Majestic SEO and Ahrefs and that’s an amount that could hurt your link reports if they were to be missing. Whilst those two battle for the top spot, Moz comes in at a modest 247,876 links. That’s a very small number compared to the others and you might not unreasonably jump to the conclusion that this quickly rules out Moz as an effective source for link data. But let’s not be so hasty.
When you look at the ratio of links found per source URL then Moz does just fine.
It is evident that, of the URLs it chooses to crawl, Moz finds as many links as Ahrefs. Majestic SEO finds slightly more links per source URL, but this could be explained by the differences between how the providers handle link parsing (more on this later).
So let’s see what happens when you re-crawl these links to make sure they still exist.
Interesting. Around 40% of the links reported by both Majestic SEO and Ahrefs no longer exist. Most alarming is the number from Moz, where only 46% of their reported links actually exist and in total numbers that’s only 114K links from a possible 2.3 million live links. That’s a significant gap in numbers. Please remember, this is a random sample of 100 sites and it’s worth waiting for the full 1,000 site study to determine if this is a consistent pattern.
To double check this analysis, we randomly selected 100 backlinks from each data set that were marked as ‘not live’ by our software and then manually checked the URLs to see whether we had missed something in our automated checks or whether there might be other anomalies caused by malformed HTML for example or differences in how links are parsed from the HTML.
To be honest, when we first ran this analysis the numbers looked too low and we found a few instances where we were not finding an exact matching link on the page. For example, Moz were stripping out hyphens in anchor text (unless the anchor text was a URL) and the other providers and ourselves were not. This meant that we were not finding an exact matching string and therefore declaring the link as ‘not live’.
Whilst we have gone through quite an extensive process to try and make this analysis as accurate as possible, we recognise that there could still be instances where we do not find a matching link because we parse a URL differently to a particular provider. For this reason, we’ll go back to each provider to see whether we can clarify the different ways they parse the data.
We have also re-run the analysis to show URLs which have a live link to the same hostname. This shows you that even though we have been unable to find an exact matching URL we could find a link on that page to the same hostname. This indicates that, a) the link has changed; or b) our matching rules did not find an exact match and it is actually the same link; or c) there was actually more than 1 link on the page and the link we are looking for has been deleted.
Links change. They get amended, archived, paginated onto different URLs and removed. That’s why, even though all 3 providers do an amazing job of trying to maintain the freshness and quality of the link data, we re-crawl it.
Now that we know how many live links we have, why don’t we check how many are indexed by Google! Why? Because that’s our true indicator as to whether Google accepts these source URLs as valid web pages whose links may well be worth their juice. As a general rule of thumb, if Google has crawled a page and providing the page isn’t blocked by robots.txt or meta tags it should include it in its index. If not, then Google has deemed the page to be spammy, low quality content, duplicate content or from an untrusted domain. The list goes on… but you get the picture. If it’s not in the index, it’s much less likely that the links are worth anything.
I know what you are all thinking. Look at Moz! As a ratio, Moz detects better quality links than Majestic SEO and Ahrefs. Or does it? Does being indexed in Google mean the link from that page is then acceptable? Probably. Does it make it a good link? No. Not really. It just depends on what we call a good link.
However, given the index rate from each provider I think it’s safe to assume that the more links you crawl, and the deeper you dig, the more spam you find as a result. So having a lower index rate might not necessarily be a bad thing, especially as these are all the links that have been found pointing towards a given site; and not just links that have met Google’s quality guidelines and point towards a given site. As SEOs, this is exactly the sort of thing we need to know when analysing links (at any level), especially if you want to find (and disavow) bad links.
In this small data sample of 100 sites, Ahrefs and Majestic SEO have been very, very close in numbers. Why? Are they picking up the same links?
What happens when you look at Source URLs and filter these by uniqueness? In other words, how many unique URLs has each provider found that the others haven’t? This should be interesting…
My initial reaction was to say, “Wow! I need to be using all 3 data providers!” … That’s until I started adding up the numbers. Instinctively speaking, it just doesn’t feel right that 66% of Majestic SEO’s URLs are unique to it, 64% of Ahrefs’ URLs are unique to it and 36% of Moz’s URLs are unique to it. I mean… it just doesn’t seem realistic, the number of common URLs should be much higher.
This is a key focus area in the forthcoming in-depth research study. So if you are interested in seeing the detailed analysis, register for our newsletter so you can get notified as soon as it is released.
In the next stage of the research, we will expand on the number of sites we are looking at to the full 1,000 sites. This should allow us to firm up our initial thoughts and draw better conclusions on the data presented so far.
Secondly, we will be verifying these raw link numbers at a much more granular level and trying to account for all the different reasons there might be duplicates and other anomalies, both at a URL level and link level. We’ve already seen how different providers handle parsing html slightly differently and we’ll be digging further into issues such as;
- URL encoding
- Case (in)sensitivity
- Canonicalisation issues
- The definition and composition of a link, e.g. Is a picture and text link in the same anchor tag one link or two? It appears the providers have different views on this!
I believe this will all have a material effect on this analysis. Sorry, but you’ll just have to wait a few more weeks until it’s complete. So in the meantime, before it’s too late, please do comment below with any other suggestions you have or analysis you would like to see.
Just to re-iterate – crawling the web is hard and expensive and the task can only get bigger!
It is extremely difficult for any single provider to excel at all aspects of coverage, depth, freshness and accuracy, e.g. If you build as comprehensive an index of as many unique domains as possible then it is likely that the depth of your crawl or frequency of crawling might be compromised.
One area this research doesn’t touch on is the quality and importance of each provider’s authority metrics. For example, it appears that Moz are more concerned with link quality than total number of links. From speaking to the team at Moz, it appears they are using their metrics to help them with that evaluation and to assist in the filtering of spammy links.
In a future research study we might tackle this subject by correlating each provider’s data against Page Rank/rankings/total number of organic keywords (or something else?). We’d welcome some suggestions….
In the meantime, all that remains to be said is a big thank-you to Moz, Ahrefs and Majestic SEO for providing us with the API usage for this research; and hats off to them all for rising to the considerable challenges of crawling the web to provide an independent authoritative source of link data. Until Google gives us access to all our links via an enhanced Google Webmaster Tools API then look no further for your link data!
Majestic SEO API
Stay tuned for the next piece in a few weeks’ time and do add your comments and insights below…
Update: Full review post