Regression to the Mean
HistoryIn the late 1800’s, a Dr. Francis Galton was studying the genetics of how the height of the son related to the height of the father across a large population. What he found out was that if the father was tall compared to the mean (average) height of the population, then the son tended to be shorter than the father and that if the father was short as compared with the average male height of the population, then the son tended to be taller than the father.
This is contrary to the expectation of genetics which would seem to predict that the son’s inherited genetics would tend to be more similar to the father. Instead, what was happening was that each male person born in a family tended to contribute to the average for all males in society. That is, if one person in the family is tall then the next must be shorter to average out to the average height of the general population. Galton described this as “regression toward mediocrity” and went on to develop some very sophisticated math tools and techniques to do what he called “regression analysis”.
At first this sounds like it is a remarkable discovery but upon closer examination, it is just common sense. Let’s look at what would happen if this “regression toward mediocrity” did not happen. Let us assume that the expectation of the genetic inheritance was actually the predictor of the height of the son. This would mean that the occasional tall father would have tall sons. Unless they all grew to exactly the same height and then stopped, we can guess that the occasional son would be taller than his father. But if we follow the genetic expectation, that son would also have a tall son. If we extend this for a few dozen generations, we end of with lots of people dozens of feet tall.
This would also mean that there would not be a level average height for males in the general population but an increasingly taller trend that increases with each generation. Since that has not happened in all of history, there must be something wrong with the expectation of the genetic inheritance theory. Dr. Galton’s discovery does, in fact, apply to the general population but it has been found to apply to nearly everything that has an “average” value for some aspect of it’s description.
>>It should be noted that over the past few centuries, there has been a very slow rise in the average height in the general population - men and women - but it is due to an overall improvement in nutrition and health care rather than in genetics<<
Law of Math
Regression to the mean is a statistical phenomenon that is a fact of life in nature. It essentially occurs where the measures (for example the average heights of men) on the average regress toward the mean or average. The net effect of regression toward the mean is that the lower measurements tend to be higher, and the higher measures tend to be lower. It is important to note that regression is always toward the population mean of a group. That implies that there is a unique reference value, called the “mean”, that is an intrinsic part of every group of anything.
This intrinsic reference value, if known, allows you to define each and every individual in the group as being either above or below that value - above or below the mean or average of the group. The best example of this is in school testing of college students. Every student is tested and given a score that is relative to the overall average of the entire population that takes the tests. If you placed in the 10 “percentile” group, that means that you have a score that, on average, only 10 percent of the population gets. In this case, you are not being compared to getting a perfect score on the test but rather the comparison is against the highest scores made by anyone that took the test. This kind of test scores are called “grading on the curve”. The one student that makes the highest score is the “curve setter” and all the rest are scaled according to how many made each score so that in the end you have the familiar “bell curve” of scores for the entire population. That bell curve is the intrinsic reference value for that test and that population of students.
It has been said that regression toward the mean is a phenomenon that is similar to several everyday expressions such as “law of averages”, “things will even out” or “we are due for a good day after a string of bad ones”. And one that I would like to add is “it can’t possibly get worse (or better) than this!” Basically what all these phrases are saying is that “extreme experiences tend to be balanced by less extreme experiences”
Formal Math
Because regression can be applied to so many aspects of life and events, it is a highly developed aspect of mathematics called Regression Analysis. It uses some very sophisticated methods to look at what might otherwise be viewed as almost totally random data. I will not turn this into a mathematics textbook but I think it is important to understand that certain kinds of information can be very accurately defined with the precision of calculated numbers. Such calculations put relative quantities on the choices that we are faced with in our daily lives. When properly applied, they can be used to show us the favorable choice to make from among some very complex alternatives. For instance, in the lottery, betting, sports, politics and hundreds of other areas that require us to evaluate choices.
Below I have summarized some of the potential mathematic procedures that can be applied to the decision making processes. Don’t be confused or distressed by the complexity of these functions. You will see that, like the basic concept of regression, much of it is logical and common sense, once you know how to look at it.
Finding the Mean
If you have a lot of data and want to find the closest consistent pattern of the data, you can apply a technique called “curve fitting”. This is the basic function of regression analysis - to regress the data into a mean or average value and then be able to describe that average value in a formula. The result is a regression or prediction line or curve. It is called a prediction line because it can be used to predict the response to data you have not yet generated. For instance, the prediction line for a coin toss is 1 in 2 or 50-50. As we saw in the test runs of thousands of tosses, this prediction line could be a very accurate indicator of the response of future coin flips. In more complex regression, it is possible to determine the average response of medical studies, voter responses, accident and crime data or consumer buying patterns.
In the case of buying patterns of consumers, as the quantity of sample data (the number of products being studies) and the number of times each is recorded (number of buyers in the study), the accuracy of the prediction line improves. This gets so accurate that it become profitable for supermarkets to pay you, with discounts, so that they can get information about your buying patterns. They do this by getting you to register with their “buyer’s club” or with their “discount club” but what they actually did was get a lot of data about you and then record your every purchase so they are better equipped to market to your needs and appeal to your buying patterns.
Curve fitting or the creation of the prediction line is what the horse rasing bettor does in his head when he analyzes the past records for each horse before predicting which one will win the next race. Using regression analysis, that process can be quantified so that you have a number assigned to the chances for each horse in the next race. In each race, if you bet on the horse with the highest calculated chance of wining, you’d do better than the best racing bettors that ever lived.
Finding Multiple Independent Variables
How Good is your Prediction Line
In some cases, the real world data that you are trying to analyze appears to be very random and sporadic. What this means is that there is an average but each event or value might be very close or very far away from that average. In the stock market, this is called volatility and is the measure of how wildly the value of the stock swings from one day to the next. You can actually calculate this volatility using a technique called the Correlation Coefficient (CC). This is a value from 0 to 1 that says how close your prediction line is to the actual data. If the CC = 1 then you have an exact match and you can predict every single event with perfect accuracy. This might occur if you discover that each person in a particular store that buys diapers has a baby. An obvious conclusion but one that you can use if you are the store owner by offering everyone that buys diapers a coupon on bulk buying of baby food.
A value of CC=.5 would mean you have a 50-50 chance of prediction of the next even. This would be the case of a coin toss and it would not be that useful for betting. However, if you had a CC=.5 for data such as your chances of winning a large lottery prize, then you have a much more usable figure. The different is in the application of the prediction line to the alternatives of the response line - in other words, a 50% chance of a heads on a coin toss is not as useful as a 50% chance of winning the lottery. The Correlation Coefficient validates your ability to use the prediction capabilities of the regression analysis you have done.
Measure a Small Group - Apply the Results to a Large Group
There is a whole field of study called Statistical Inference that takes sample data and uses it to infer or predict what a larger group will do. This is the essence of the marketing analysis that is done with focus groups and public opinion polls. In fact, some very fancy math is used to determine the exact sample size in order to achieve a reasonable degree of accuracy. You can also decide on the degree of accuracy you wish to achieve (called the Confidence Interval) and then compute how many data points or samples you need to collect to achieve that degree of confidence in your resulting prediction line.
This aspect of regression and statistics is perhaps the one you have come into contact most often with and didn’t know it. Besides the focus groups and public opinion polls often used in politics, there are surveys and buyer pattern analysis that is taken on a small scale and then applied to a larger group. We sometimes call these “pilot studies” or “sample testing”. This is often the only method used in drug testing and yet it is used to “predict” the responses of everyone that will eventually take the drug.
If CC=0 then there is no more correlation between the plotted data and your prediction line than random chance would predict. There are, in fact relatively few such instances of analysis since it is now becoming more and more clear that even seemingly random events can be described by fancy formulas or sophisticated regression analysis.
In all cases, the regression must refer to some baseline or reference value. This value is held fixed or independent and then a second value is compared to it. The first value that is held fixed is called the independent or predictor variable and the second value is the dependent or response variable. In all our discussion, we have used predictor and response variables but have not called them that. For instance, in the coin toss, the 50% figure of heads or the 50% figure of tails is the predictor value. In our test flips we averaged 50.082% heads and 49.918% tails. These were the response variables. In the real world, the actual response variable may never exactly equal the predictor variable unless you spend a lifetime flipping coins. However, it should be noted that before we flipped a single coin, we could know that the RESPONSE of the flips would be VERY CLOSE to the PREDICTOR value - and it was.