Tag Archive: statistics

Random Effects

My last post was on the fixed effects model. I established that the fixed effects model assumes that variables not included in the regression are correlated with the variables included in the regression, and thus the results of the regression cannot be used to assess the effects of unobserved variables. The random effects model, on the other hand, assumes that unobserved variables are not correlated with observed variables, and allows the regression to be used to investigate the effects of variables not included in the regression.

In my past posts on the panel data model and its specific variations, I explained that the general form of the panel data model is y_{it} = \alpha_i + \beta'x_{it} + \epsilon_{it}, and the general form of the fixed effects model is y_{it} = D \alpha_i + \beta'x_{it} + \epsilon_{it}. With the random effects model, the general form is y_{it} = \alpha + \beta'x_{it} + u_i + \epsilon_{it}. In this model, \alpha is taken to be constant, and u_i is a measurement of random disturbance for each cross-sectional unit.

In choosing whether to use a fixed effects model or a random effects model, one must first test to see if individual effects exist. This is done using a Langrange Multiplier (LM) test. If they do indeed exist, then a Hausman test can be used. The Hausman test uses a hypothesis test to determine whether or not the fixed effects model and the random effects model have the same variance. If their variances are the same, then a random effects model may be used. If not, the more restricting fixed effects model must be used.

Panel Data

For my capstone I intend to look at the mathematics behind regression analysis using panel data, or panel data modeling. A panel of data consists of two components: a cross-section and a time-series. For instance, the data I am using consists of individual observations for each of the 58 counties of California over a span of 8 years (2000-2007). This means that each variable in the regression has 464 (58 \cdot 8) observations. An advantage of this approach is the ability to account for variability over time as well as across the cross-section. Also, it allows for analysis of data with a limited number of observations over time (provided there are substantial cross-sectional observations) or a limited number of observations over the cross-section (provided there are sufficient time-series observations).

The general form of a panel data model is y_{it} = \alpha_i + \beta'x_{it} + \epsilon_{it}. In the model, i represents the cross-sectional units, t represents the time-series units, y represents the dependent variable, x represents the independent variables, \alpha represents the individual effects coefficients, \beta' represents the set of coefficients for the independent variables, and \epsilon represents the error terms. This is just the general form of the panel data model. The specific variations of the model that I will be looking at will be discussed in a later blog post.

Sportswriter Gregg Doyel in his recent column, Numbers Don’t Lie: Sabermetrics Should Win AL Cy Young, explains why Felix Hernandez (and sabermetrics itself) should win the American League Cy Young Award. For those who don’t know, Felix Hernandez is a pitcher for the Seattle Mariners, who despite having the best stats in the American League (AL) (2.27 ERA and 232 strikeouts), had an underwhelming record this season of 13-12. Sabermetrics is the study of baseball using baseball statistics and objective evidence. The Cy Young Award is given to the best pitchers in baseball (one from the American League and one from the National League).

Doyel argues that Hernandez will win the AL Cy Young because he is the best pitcher in the league, a fact backed up by the mathematics of sabermetrics. In his opinion, it is not fair for a pitcher to lose the award because of factors beyond their control, such as playing on a team with very little offense (sorry Mariners). A pitcher cannot win a game if his team does not score runs.

I hope that Doyel is right. An award reserved for the best pitcher in baseball should go to the best pitcher in baseball, regardless of the success of their team as a whole. With that being said, go Mariners!

In the video, Peter Donnelly Shows How Stats Fool Juries, Peter Donnelly delivers a talk on how statistics can be deceptive. After a few jokes about the social awkwardness of statisticians, Donnelly moves on to an example. He describes a scenario where you flip a fair coin until the patter HTH emerges (H being heads and T being tails). You then flip the coin again until the pattern HTT emerges. Then he asks the audience what they believe to be true:

a) The average number of flips for HTH to emerge is less than the average number of flips for HTT to emerge.

b) The average number of flips is equal for both.

c) The opposite of a) is true.

Most people in the audience answered b). The correct answer, however, is a). This is because, as Donnelly explains, the pattern HTH can repeat itself in five flips (HTHTH), where HTT cannot (HTTHT).

This is the first example he gives of how topics in statistics can often times be deceptive. He then moves on to a more relevant example (one covered in MATH 341) of the probability of having a disease given a positive test result for a test with 99% accuracy. He illustrates that while a positive test result may make it seem that there is a 99% probability that you have the disease, the true probability depends on how many people have been tested, as well as the actual probability of having the disease. If a million people are tested and there is a .01% probability of having the disease, then there will be a much larger number of false positives (9999) than people who actually have the disease (100). Furthermore, the number of the number of people tested who have the disease, only 99% of them will have a positive test result. This makes the probability that one actually has the disease given a positive test result considerably small (less than 1%).

This example relates to how statistics can be used to deceive a jury. As a matter of fact, in the wrong hands, statistics can be incredibly dangerous. This was illustrated by a true story about a pediatrician who testified against a woman accused of killing two of her babies. The pediatrician mistakenly claimed that the chances of having two infants die from Sudden Infant Death Syndrome (SIDS) is 1 in 73,000,000. One of the many mistakes made by the pediatrician was that the probability of having two children die from SIDS is independent. The woman was convicted, and was not released until her second appeal.

See the video of Peter Donnelly’s talk below:

Statistical Surveys

The Numbers Guy’s blog post, The Census’s 21st-Century Challenges, discusses the difficulty with conducting a nation wide survey. Part of the challenge lies in getting the survey to the people. Some people do not have a permanent address, or have moved recently without their address being updated. Another problem lies in actually getting those who receive the survey to participate.

The U.S. census claims to be mandatory, but lacks the enforcement to be so. That’s not to say that calling it mandatory is useless. It’s estimated that by replacing the word “voluntary” with “mandatory” increased survey response by about 20%.