In the world of nature and in the world of manufacturing, extreme skewness is essentially unheard of. Outliers, though they occur from time to time, are rare, and therefore the “> 30” rule of thumb given in many stat textbooks (some books advise “> 25”) is safe whenever we are dealing with data distributions in most real-world situations.
For readers unfamiliar with what I’m talking about, the subject is Gaussian (a.k.a. normal) distributions and the closely related Central Limit Theorem, hereafter abbreviated CLT.
The rule of thumb says that the CLT can be safely applied whenever the sample size, n, is greater than 30 (or 25, depending on your textbook). The CLT, in turn, says that the sampling distribution of the sample mean is well approximated by a normal distribution centered on the true population mean, with a standard deviation that is smaller than the population’s standard deviation by a factor of the square root of n.
I know what you’re thinking. Bore, snore, zzzzzz. Who cares?
Well, the correct (or incorrect) application of the CLT can have huge consequences in situations that require parameter estimation in the face of uncertainty.
As I said above, the CLT can safely be applied whenever n > 30 in data from most real-world situations. However, two important exceptions are (1) the world of technology and (2) the world of banking and finance. Are you seeing where this could get us into real trouble?
There have been a few human beings taller than 7'6", but there has never been a human being 700 feet tall. However, multiple-order-of-magnitude differences for outliers do occur in the worlds of technology and banking/finance. Amazon.com receives hundreds of millions of page views per day, which is about 7 orders of magnitude higher than the Zocs.org blog you are reading, and a billionaire (of whom there are now about 500 in the U.S.) is 4 orders of magnitude above where most of us are on the wealth scale.
Sensible parameter estimation means that we compute not only a point estimate of whatever it is we are trying to estimate, but also a confidence interval that tells us the set of values that we are fairly sure are likely values for the true parameter. For example, if we want to know President Obama’s job approval rating, we can poll 100 randomly chosen adults and simply ask them their opinion. Some will say they approve, and some will say they disapprove. The margin of error (for a 95% confidence interval) with a survey that small is about plus or minus 10 percentage points, which means that if 42 people in our survey say they approve of President Obama’s job performance, we can be 95% confident that the true percentage, nationwide, is somewhere between 32% and 52%. If we want a more accurate poll, we need to interview more people. With a sample of 1000 people, the margin of error would be much smaller, only about plus or minus 3 percentage points.
However, we cannot use a random sample of hit counts from 100 web pages to estimate the mean hit counts per page for the Internet as a whole, not even with a large margin of error, since the Internet exhibits extreme skewness. That is to say, we could compute a point estimate, but it would be worthless, since the confidence interval we compute will usually not give us a true picture of our level of knowledge. If our sample of 100 happens to include Facebook and Amazon, we will wildly overestimate the parameter, whereas (if as is much more likely) our sample excludes Facebook and Amazon and Yahoo and Google and all the other top sites (which together account for the bulk of page views), we will dramatically underestimate the parameter. There are billions of web pages on the Internet, after all, which means that a random sample of 100 has practically no chance of including those big ones. Increasing the sample size doesn’t help much, either, since the confidence intervals will still be misleading and just plain WRONG. If you say you’re 95% confident that the true value of a parameter lies between 1.08 and 2.70, and the true value is 36,521, then you’re WRONG. Not just a little bit wrong, but colossally and embarrassingly wrong. The problem is that a relatively small number of websites account for most of the page views. If you want to estimate the mean hit counts per page, or the total number of page hits, you’re going to have to use a measuring technique that accounts for the big players separately from the little ones. And that is feasible, in the case of web pages, since we happen to know who the big players are. But what if the parameter we are trying to estimate is something truly unknowable, such as the probability of a general financial collapse?
The web page example above was an example from the world of technology. For an example from banking and finance, think of the extreme skewness and extreme outliers seen in wealth and profits. Two people, Bill Gates and Warren Buffett, all by themselves, have approximately as much wealth as the lower 50% of Americans combined (approximately 160 million people). In 2013, the top 10 most profitable corporations in America earned roughly one-seventh of the total profit that was earned by all businesses, and remember that when you count all of the dry cleaners and gas stations and CPAs and tutors and hair salons in America, you’re talking about tens of millions of businesses.
Therefore, the simple rule of thumb (n > 30 or n > 25, depending on your textbook) is not appropriate for the worlds of technology, banking, and finance. What should we call people who use the CLT or Gaussian models to perform economic risk analysis (think: the “geniuses” who gave us the catastrophe of 2008, which we are still digging out from)? Eight million people lost their jobs in the U.S. alone in the aftermath of that catastrophe, and some of the victims are still unemployed or underemployed today. Here are my suggestions for what to call the perpetrators: overconfident, arrogant, ignorant, and above all, overpaid.
Not a single one of them went to jail. The arrogance and overconfidence they exhibited are still with us, and the banking reforms that Congress implemented to try to prevent another similar catastrophe in the future have been less than successful. It’s quite likely that the whole cycle will be repeated again, on an even larger scale.
The quants (quantitative analysts) on Wall Street weren’t necessarily arrogant and overconfident, but the people who listened to them uncritically and ignored the associated caveats certainly were. An awful lot of the quantitative analysis involved in the overleveraged investments of the mid-00s was based on Gaussian models and the CLT.
There were other bad assumptions, too: assumption of independent events when computing risk, assumption that real estate appraisals and investment ratings were made in good faith (when frequently they were made with conflicts of interest), assumption of ability of lenders to repay loans, etc. There was also a good deal of abdication of due diligence in evaluating investments, not to mention outright fraud. But the misapplication of the CLT is right up there.
I would claim that there’s nothing inherently wrong with arrogance and overconfidence.
Shoot, if we didn’t have arrogance and overconfidence, we wouldn’t make
much progress as a species. Every significant advance requires someone
wildly arrogant and overconfident (and, usually, incredibly lucky) to
The American way is that arrogant and overconfident people should be
paid what they deserve. If they are taking extreme risks, they should
lose their investment most of the time, and every once in awhile, they
should have a spectacular success and earn a lot of money. That’s fair.
That’s the way it should be.
My objection is that the arrogant and overconfident people who gave us the Great Recession of 2008 didn’t lose their shirts. Their income and wealth are up dramatically since 2008. Effective tax rates, including those resulting from the “temporary” Bush-era tax cuts, are at near-record lows for both wealthy individuals and corporations.
As for banking profits and corporate profits in general, they are both at all-time record highs. The reason? We, the taxpayers of America, bailed out AIG (and, by extension, Goldman Sachs) and the big banks when the whole system was approaching a total meltdown in 2008. You'd think that the least they could do for us would be to give the U.S. Treasury a hugely generous return on investment.
(Yes, yes, I know that the government ultimately made a good profit on the AIG bailout and the Fannie Mae/Freddie Mac overhaul, as well as the bailout of the “too big to fail” megabanks. But considering that none of those organizations would be employing anyone today if we had not bailed them out in 2008, it seems to me that we should be entitled to hundreds of billions or even trillions of dollars in compensation, not the paltry tens of billions we received. And the former head of AIG, Hank Greenberg, now has the gall to sue the U.S. Government for $25 billion, claiming that the bailout, the one that saved his company and kept his shares from being worth exactly $0, was illegal.)
We had no choice but to bail the scoundrels out in 2008. That’s right, we had no choice. We had to bail them out, because the alternative would have been a global financial crisis that would have made the Great Depression look like a summer breeze.
However, we don’t have to let history repeat itself. Fool me once, shame on you (though I apparently can’t send you to jail). Fool me twice, shame on me.