It’s been a few months since we carried out a smaller piece of research, based on a group of 100 sites. Just to recap, here is a quick rundown of the first study:
- 100 sample sites, selected at random
- Details on 50,000 backlinks per site imported and analyzed from each provider
- ahrefs vs Majestic SEO (Fresh Index) vs MOZ
- In total, we analyzed 4,010,432 links
- Conclusion: Majestic had the most links, whilst MOZ had the highest ratio of actual live links in its index
That was a very small sample of this research and gave us some insights into what we might expect when we analyzed bigger volumes of links. Whereas in the previous research we looked at 100 sites, with this second study, we we’re going to be looking at 1,000 sites and potentially tens of millions of links. I’ll give you all an idea of the kind of data processing that this might involve:
- Taking 1,000 random sites
- Importing up to 50,000 backlinks max. per site
- From 3 link data providers
- That’s a total of 150,000,000 links to analyze (potentially)!
- Recrawling every link (live, dead, redirect, etc)
- Checking each link against Google’s index
- At least 450,000,000 different queries!!!
Overall it has taken us around 6 weeks to complete this research, far more time than we anticipated but given the huge numbers in question, it’s acceptable. We love “Big SEO Data”! Fetching those links, checking for their status (individually), checking their Google indexation, cleaning it all up (de-duping etc), comparing them to identify links in common, unique links and overall health has been a huge task and a massive technical achievement!
Anyway, with all this in mind let’s look at the results and try to get a better understanding as to what is really going on with our links.
We will start off by looking at the total number of links, and drill down into Source URLs and then Root Domains.
The graph below shows the total amount of links claimed by each provider:
Majestic SEO just manages to topple Ahrefs with a total of 13.5 million back links detected, about 1 million more than Ahrefs. Moz has detected a significant smaller number of links, just over 1 million. From this we can safely conclude that in terms of total maximum link coverage it’s a straight fight between Majestic SEO and Ahrefs. Once we start to filter the data by uniqueness and live ratio we expect to see massive changes in numbers.
There are many factors as to why these numbers are so high – and they can perhaps explain why the difference between Ahrefs and Majestic SEO is so small, as well as why the step down to Moz is so big!
Total Links VS Live Links
We checked the number of live links from the total number of links to identify how many of these claimed links were actually “live” and providing any value to their target site. See graph below:
It seems that at least 40% of links reported by the link providers no longer exist:
Bear in mind this number can include pages that have been relocated (paginated) with no redirects, or sites that were temporarily down when they were checked.
Moz seems to have the lowest ratio of live links from the analysis. Ahrefs, despite its huge link database seems to have most of its links being reported as live. That is particularly surprising because the higher amounts of links you store, the higher amounts of bad/spammy links you would expect to find; yet not only does Ahrefs store huge amounts of data, it also seems to demonstrate that the best part of it is live.
Majestic SEO has the highest amount of absolute live URLs from the analysis; and an impressive 300K (approx.) more than Ahrefs. It seems as though if a user needs the absolute highest possible number of total live links, then the analysis so far suggest Majestic SEO is the place to go.
An important factor is to understand the number of unique links per provider. Why? Because it gives us a better picture as to how many exclusive links they can crawl, index and analyse that the others cannot. Also, if we add up all the unique links amongst them we should get nice little number of absolute unique links across the board, hence the maximum coverage that could be gained if all 3 were being combined.
|Unique Live Links||5,208,630||5,486,308||87,179|
|Unique Live Ratio||66%||67%||17%|
It seems that both Ahrefs and Majestic SEO are yet again streets ahead in the numbers. 66% and 67% is a phenomenally high number given the fact we are looking at only 50,000 backlinks per site (as a maximum for this study). Moz’s ratio is a lot lower but this is not necessarily a negative indicator; in fact it might prove to be just the opposite. Such a low percentage could indicate that Moz is cleaning up links in a way that is causing less duplication.
As we collected the data we noticed several interesting things that hugely inflate the link numbers; both the “total link” numbers and the “unique live” links. One of these things is the way a link’s structure (HTML wise) can cause different providers to interpret the link differently and in some cases count a single link as 3 individual ones! Scary stuff. We will delve further into this shortly.
Unique Live Links VS Unique Live Indexed Links
A strong indicator as to the quality of a link is whether or not its source URL (page) is indexed in Google. Being indexed means that page is crawlable (that’s obvious), easy for Google to access from its site structure and most importantly shows that there’s a reasonably strong chance it’s a unique page in terms of content. Being indexed in Google is not the best indicator of how good the content within the page may be, but it does work as a favourable factor. The more of your links that are found in Google’s index, the better. That’s because they are very likely to be counted towards your link power which, at the end of the day, is what you are trying to achieve in the first place. If a link is not in Google’s index there’s a strong chance that those pages where the links reside within are not up to Google’s standards; and if they are not, they won’t be counted towards your total link juice.
Using the unique live link count from each provider and checking those for indexation, we get the following:
When it comes to the indexing ratio it seems Moz has the edge. This is not surprising, however, as Moz’s link database is significantly smaller, and they also focus more on higher quality links and probably have strict link filters in place to iron out the noise.
Again, ahrefs and Majestic SEO are very similar in numbers and could suggest that they are picking up and crawling the same types of links, with roughly the same internal filters and rules. (i.e., how is a link treated; is it counted?; how is it counted?)
Thus far, then, if you’re only concerned with analyzing quality, it seems Moz might give users a good indication as to the overall quality of a sites backlink profile, even though that profiling might only be a small sample of the real total number of links.
We then crosschecked and matched all common links across all the link providers and noticed some interesting facts:
It’s no surprise that Ahrefs and Majestic SEO have a lot of links in common. What interests me is that Ahrefs has more links in common with Moz, more so than Majestic SEO. Could this be an indication that Ahrefs, like Moz, pick up more of the top-level quality links? Maybe, but the difference is not significant enough to support this claim. In order to answer this question we would need to look at those links in detail; probably on an individual basis. But because we are dealing with well over 27 million links it would be extremely time consuming to do so, therefore we need to take a slightly different approach and tackle this from another angle.
A lot of these numbers do come down to the way each provider interprets a links, the way they go about crawling the web and the way they classify each link, all of which makes a huge difference in terms of what they each report back to users. We will show you some of the rules and filtering we had to carry out in order to clean up some of the “noise” in order to make this test as accurate as possible.
Source URLs and Links
To further understand how reliable the link numbers are, we can look into how many source URLs each of the providers have picked up instead. Why is this important? Because it will highlight the average number of links that each provider picks up from the same page. The indication here is based on the assumption that too many links detected from the same page could suggest link duplication. This can help to highlight how much duplication/variation can exist when interpreting a single link.
We then repeated this process with a step-up to root domain level, comparing source URLs to root domains to find out the average number of source URLs per root domain. This could then again be an indicator as to any possible duplicated/variations of the same URL that are being counted multiple times, hence inflating the total number of links.
The higher up the tree we get, the less duplication we expect – and this allows for a fairer comparison, as well as allowing us to pinpoint where the numbers could start to skew due to duplication of URLs and links.
We expect the number of Root Domains to be duplication free, as there can’t be a way to misunderstand or read a root domain.
Source URLs can be duplicated. This is because they can be formed with variations of upper and lower case, trailing slash, tracking IDs, Canonical URLs etc… So we do expect to see some inflated numbers here.
Just like Source URLS, links can be duplicated for various reasons. Not only will they inherently be duplicated if the Source URL they reside within is duplicated, but they also bring their own attributes which could further skew the numbers. Attributes such as a link with a title, a picture embedded within a text link, an anchor text with a picture and a title in the same element will all cause duplication. We have found that the link providers do interpret the same link in multiple ways, thus duplicating and counting the same link twice or more. See table below:
|Total Root Domains||Total Source URLs||Total Links|
The table shows some interesting figures. It seems that each provider is better at finding (or duplicating) certain elements in the overall process than others.
Majestic SEO seems to have found the most Root Domains than all the others, but Ahrefs has identified more Source URLs than Majestic SEO; and that’s from a smaller pool of Root Domains than Majestic SEO. Not only that, but Majestic SEO reports more links than ahrefs, even though it found less Source URLs than ahrefs, which might mean its duplicating more links, whereas ahrefs could be duplicating Source URLs. If we look at the graph by ratios this becomes clearer:
It looks like ahrefs finds more pages per site than the other link providers. At 37 pages per site, they find on average 5 more pages than Majestic SEO.
This could be interpreted in various ways. It’s either a representative number of the crawler’s depth, the length of time spent crawling a site, the frequency of the crawl or it highlights how many more duplicate pages they find and index. This is impossible to know unless the URLs are looked at individually; but because there are almost 10 million in total, we will have to settle for educated assumptions.
However, it seems that Majestic SEO finds a lot more links per page than Moz and ahrefs. This seems to me more of a case where the same link is being counted more than twice, simply because all 3 providers are dealing with the same set of sites – so its highly unlikely that Majestic SEO has found more links than the other providers on the same page.
What’s even more interesting is that ahrefs and Moz found exactly the same amount of links per page, thus indicating that Majestic SEO might be treating certain link types or combinations on given pages differently to Moz and Ahrefs.
It’s very interesting to see where the numbers begin to overlap and bloat. But what if we took those ratios and readjusted the total number of links by normalising the ratio of links per page? In other words, what would happen to the number of links if all providers were picking up the same amount of links per page? We can achieve this by taking the total number of links per provider, and readjusting this value based on the normalised ratios. See below.
|Current ratio of links per page||Total Links||New Ratio of links per page||New Total Links||Difference|
Once we have accounted for and subtracted the extra links from Majestic SEO that we assume are duplicated, we get a new stacking order. Due to this, Majestic SEO loses a massive 1.809,911 links, and drops to no.2, behind Ahrefs. The number that worries me the most however is the amount of Source URLs per site reported from Moz. Even as an average, it is unrealistic to think that Moz can only find 4 pages per site. There are a few possible explanations for this:
- There is some data missing in the API (from Moz’s side) – they purposefully leave data out
- The API is being queried in some incorrect manner, or something else is missing (although we checked and double checked this, and also have no problems getting data for Majestic SEO and Ahrefs)
- Moz is picking up on the extra Source URLs but choosing to omit them because they don’t pass certain quality guidelines?
- Perhaps they do pick up on all pages from a site but in certain circumstances they could fail to index any pages on sites where there could be crawl related restrictions, or sites built using something Moz doesn’t like to crawl. Thus this could skew the number of pages crawled quite badly.
- Moz could handle 302s, 301s, 404s differently to Majestic SEO and Ahrefs. So whenever “Rogerbot” (Moz’s crawler) stops at a certain page, Majestic SEO and Ahrefs may very well decide otherwise.
- Moz will not follow a “no follow” link, or a 403 error.
Food for thought! Anyone from MOZ care to elaborate?
The donut charts above show how the total link numbers are spread across the main 3 areas we were focusing on (dead links, unique live links and live links in common). Bear in mind that the “live link in common” is a metric which is relative to each of the other providers; hence why Moz has such a low “unique live links” percentage, because most of their links have been picked up by the other two providers, which have far bigger numbers.
From the outset it seems that ahrefs looks like it’s the “overall” safe place to do your backlink analysis as it has a little of everything (quality links and volume) – however in absolute number terms, Majestic SEO will give you the greatest coverage as it has the most number of unique live links out of all three providers.
Data clean up
In the process of compiling all this data we stumbled upon a few issues. What we realised is that the way each provider crawls a site, what they store and how they store it, the cleaning they seem to carry out on their collected data behind closed doors, the way they interpret a link and how they handle link types, duplicate URLs and canonicals all heavily influences what the end user ends up seeing. There is a substantial amount of data cleaning and sorting that happens before the user ever gets to see a figure. Collecting, storing, ordering and serving these links is a very difficult task. Crawling the web for links is difficult, technical and never 100% accurate.
Here are some of the causes we found where links can be counted twice or more from the same source:
The monotonous “ / “
Yes. The trailing slash, or forward slash, has made life very difficult for us indeed. You see, this little fellow pops up all over the place and really complicates matters when it comes to comparing exact match strings. All three providers seem to handle this differently and we encountered many cases where provider A would count a source URL as www.mysite.com/page/; whereas provider B would instead prefer the non-trailing slash version of the same URL www.mysite.com/page. When we then compare these URLs they are treated as being unique; when in reality they are the same link. This also skews our data when looking for common/unique links amongst them.
Problem caused? Inflates the “unique link” numbers as well as the “links in common” numbers dramatically. Also counts and duplicates the same links twice or more, inflating overall numbers considerably.
Case sensitive URLs
Case sensitivity has also been an issue. The link providers handle this very differently as well. So a link built like this www.ThisIsMysite.com/index is treated as being different from its all lowercase version of www.thisismysite.com/index, and this creates duplication. Now, imagine we also throw in the trailing slash issue into the mix? How many possible variations of the same URL can we now have? At least 4. I think you can see where I’m going with this.
Problem caused? Further inflates numbers by duplicating links. Both the “unique link” count as well as the “links in common” are affected. Total link numbers are hurt too.
Text link, Picture link, or both?
This was a really difficult one to spot, but spot it we did. It seems the mark-up of a link and how this mark-up is interpreted can greatly affect how its reported to the end-user. Here is an example:
<a href=”www.mysite.com”> <img src="logo.jpg" alt="Our Logo" /> Home </a>
And this is how the html link above can interpreted by the link providers:
- Provider A = 1 text link
- Provider B = 2 links. 1 text and 1 image
- Provider C = 2 links. 1 anchor text that reads “Our logo” and 1 image link
This is where the complication really starts. We have seen this pattern across all three providers in different situations. Sometimes it’s a case where the alt text of an image is considered the actual anchor text. Others split the two out and count a text and an image as two individual links even though they sit within the same element.
We have also seen other ways in which links are ‘coded in’ that throw up different readings from the providers. It is extremely hard to account for them all and we are certain that if we kept digging, we would find even more variations of links/html/mark-up combinations that would be handled in their own way, by each provider; and in turn, adding further mist to the numbers they provide. Also, this makes it very hard to compare providers like for like.
Problem caused? Further inflates numbers and creates duplication. Combined with some of the other causes above it is safe to say that the numbers and comparison can be somewhat hard to interpret at first.
When comparing links we tried our best to clean up as much noise as possible to keep things as consistent as possible. We accounted for “/” problem by ignoring it and counting links with and without a trailing slash as one. This returned positive results.
We further normalised numbers by discounting case sensitivity where possible; I say this because I am sure we haven’t encountered all possible scenarios and variations of URLs that could cause artificial duplication. We had to process over 27 million links, and sense checking those individually is impossible so we randomly checked links and noted the differences. But just like with anything else in this analysis, the more we looked, the more we found. It was just a matter of where we drew the line.
Conclusion from the numbers
The researched was based on 1,000 sites, randomly picked from our database. We had to limit the sites we picked as the API limit on one of the providers was 50,000 links per site. So we had to choose sites that had 50,000 links or less. It would have been nice to be able to analyze bigger sites with probably higher quality links (which would probably provide different results).
There are a number of reasons why this research can never be 100% accurate (as per the above), but we did carry it out as fairly as possible and cleaned the data as much as we could. We tried to look at this data from a user’s perspective and only look at links supplied to us, and compared them as such. However, when querying the data we did make sure we called upon the right parameters:
- Mentions were excluded
- Only fresh indexes were included
- Deleted links were excluded
- All link types were included (as would be served up to a user; Video, pictures, text)
Based on all this, the numbers are mean averages and without looking and analyzing them at a link level, it is virtually impossible to make accurate judgments. What these numbers do provide is a good idea of where each stands… as well as which might be best for you given your budgets or other constraints.
Summary of Providers
In alphabetical order:
- Ahrefs – they possess a high volume of links, have the deepest crawl per site and the least amount of suspected duplication. They are an overall good contender.
- Majestic SEO – has the highest amount of links available, unique root domains and the highest amount of unique live links. But we suspect some link duplication could be taking place. Excellent for understanding of landscape numbers.
- MOZ– clearly missing link numbers, but they seem to focus on link quality and have the highest ratio of links indexed in Google. Very good for measuring link quality.
Unfortunately there is no be-all and end-all when it comes to a link analysis platform. Ahrefs, Majestic SEO and Moz all have their strengths and weaknesses and thus can help you in various ways, dependent upon what type of analysis you are after. In an ideal world you would want to work with all 3! But as that isn’t realistic for most people, then identifying what you need from your analysis, how far along your SEO campaign is, how much budget you have and how strong your marketplace is will determine which platform you should be using. Either way, you will be looking at solid data whichever way you go.
NB: the data for this research was collected April and July 25th, 2013; ahrefs switched to using a new index on the 21st August.