"The more I find out, the less I know."

Thursday - October 26, 2006 at 09:37 PM in

Traffic Stats: Harder than they Look

Shorter zeFrank: "Hey, why is it that Rocketboom claims 10x the viewership that I have, but Alexa says I get more traffic?"

Shorter Andrew: "Reading an Apache log isn't rocket science, and my numbers are accurate."
As it happens, I know a thing or two about traffic measurement (though I'd hardly call myself an expert), and with all respect to Andrew, he's answering the wrong question.

Yes it is true that analyzing an Apache server log isn't that difficult--with some pitfalls--but that's not the issue. The issue isn't figuring out how much traffic you get, the issue is proving to a third party (i.e. advertisers) how much traffic you get when you have a financial incentive to inflate the numbers.

I don't want to suggest that Andrew is inflating his numbers--I have no way of knowing one way or the other--but this is a very real problem. It was a problem back with Web 1.0 when advertiser-supported websites were known to inflate their hits by various methods, including forging the log files (it isn't very hard, and almost impossible to detect).

So a lot of advertisers are rightly skeptical of self-reported traffic statistics from websites.

And I think zeFrank raises a legitimate question when he asks why RB claims ten times the viewers yet at least one independent traffic monitoring site suggests that zeFrank gets more. You can't just casually dismiss a 10x discrepancy like that by saying that Alexa's numbers are crap.

There's lots of reasons why this could happen:

1) Apples and Oranges

There's a gazillion different way to track web traffic: "hits," "visits," "bytes served," "downloads," etc. None of these are perfect, and definitions of them are not always consistent.

It appears that Rocketboom's traffic claim is, essentially, the number of times video files are served. Alexa's ranking is something else--I'm not sure exactly what.

Someone pointed out that on the RB website, every time someone loads a page it loads a video, whereas on zeFrank the videos only load when a visitor actually clicks on the video. zeFrank's behavior is more user-friendly, while RB's will substantially increase the number of video files served (I would guess by a factor of at least two, and likely more). This could account for a significant part of the discrepancy.

2) Sampling Bias (aka Alexa Sucks)

The group of people Alexa samples to gather its statistics is not representative of Internet users as a whole (but then, what is?). This could make a difference if RB and zeFrank attracted substantially different demographics. I don't know if they do, but I seriously doubt that Alexa's non-representative sample could account for an order-of-magnitude error. 25% might be believable, but I'd be more inclined to think that the two shows actually have very similar demographics.

More significantly, though, is the fact that Alexa only counts people who visit the site in their web browser. Alexa completely misses anyone who downloads through iTunes or an aggregator like Democracy, or something completely different like RB on TiVo. I don't know what fraction of the total this might be, but I suspect it is large.

What's more, since zeFrank has more interactive stuff on his website, zeFrank viewers are more likely to fire up their browser and go to the comments, etc. Rocketboom is more of an old-media (oops, sorry for the naughty word) model, where you view the video and that's about it. I would guess that people who subscribe to RB in iTunes (or other aggregator) are less likely to visit the website than zeFrank subscribers.

3) Network Caching/Web Proxies

It is not safe to assume that every person who views an online video got it from the host website.

Some ISPs (most notoriously AOL) will cache web pages on their internal network as a way to save bandwidth. When an AOL subscriber requests a page, the AOL proxy server will first check to see if it already has a copy--and if so, the subscriber gets it from the AOL proxy, and the request never hits the host site.

This is a very effective way to save bandwidth, but it can wreak havoc on server statistics. On my company's website, as much as 25% of our traffic comes from AOL, and on the static pages we never see it in our logs (but we know it's there because we still get requests for the dynamic web pages).

However, it is possible to configure your web server to instruct proxy servers (like AOL's) to not cache your page. If Andrew set up his server this way but zeFrank didn't, this could account for a large fraction of the discrepancy.

This works in zeFrank's favor: his viewership may actually be quite a bit larger than he thinks, because a lot of people might be viewing his videos through ISPs which do web caching.

4) Someone's Lying

I started this message by pointing out that the problem isn't figuring out how much traffic you've got. The real problem is being able to prove it to a third party.

Both Andrew and zeFrank are selling advertising on their videos, and so both have a financial incentive to make their traffic numbers look as big as possible. I'm not saying that anyone's being dishonest--I have no way to judge. But it has happened in the past, and there's no doubt it will happen again in the future. If all you're releasing is your own web log analysis, those numbers are automatically suspect if there's money involved. You want to have a trusted third party gathering the data.

It is also possible to be misleading without being out-and-out dishonest: for example, citing your peak daily traffic in a way which implies that it's your average traffic.

In summary: this is a much more complicated question than just whether everyone's being honest. There are many different ways to measure traffic, and some of the specific design and configuration decisions made by both zeFrank and RB can lead to wildly different numbers for their respective web traffic even if viewership is similar.

So zeFrank's challenge ("Why does RB claim so many more viewers than I get but Alexa says I'm ahead?") is reasonable. On the other hand, there are many likely reasons which don't imply any dishonesty on anyone's part.

And this is an issue which will continue to come up as long as there's money in this business.

Posted at 09:37 PM | Permalink | | |

Powered By iBlog, Comments By HaloScan
RSS Feed