(1) 10 minutes. Below is a qq plot against a normal cdf of a sample of 200 observations. Does the plot
show the sample has approximately a normal distribution, a distribution with fatter tails than the
normal, a distribution with thinner tails than the normal, or a distribution with a fatter tail than the
normal on one side and a thinner tail on the other? Explain your answer.
Normal Q−Q Plot
The qq plot (quantile-quantile plot) plots the quantiles of one distribution against those of another. The
p’th quantile of X is the value q p of the random variable X such that P[ X ≤ q p ] = p. The plot shows
that, e.g. at the left end, the quantiles of the normal (the horizontal axis) are more widely spaced than
those of the sample distribution, so the slope of the plot is low relative to the slope in the middle of the
distribution. In one of the exercises, you computed a qq plot of a t-distributed sample against a normal,
and that plot was steeper at the far left and far right than in the middle. A t distribution is fatter-tailed
than a normal. This question’s plot shows flatter slope at the far left and far right than in the middle. So
it corresponds to a thinner-tailed distribution.
If you didn’t remember the shape from the plot on the exercise, trying to figure out whether the plot
is fat or thin tailed by thinking it through from first principles is a bit tricky. The qq plot plots y against x,
where the relation between y and x is determined by
Fx ( x ) = Fy (y) ,
with Fx and Fy being the cdf’s of x and y. The slope of the y( x ) function is then easily shown to be
y′ ( x ) =
f x (x)
f y (y( x ))
If you think that a relatively fat-tailed y distribution is one in which f y / f x is big in the tails (a natural first
stab at a definition of “fat tail”), you might think that a low slope at the extremes of the distribution goes
with a fat tail — when in fact the opposite is true. The reason is that we are not taking the ratio of f x to f y
at y = x, but the ratio at Fx ( x ) = Fy (y). One definition of relative tail fatness is that x has fatter tails than
y if f x ( x )/Fx ( x ) is smaller than f y (y)/Fy (y) if x and y are both close to the lower limit of the support
of their respective distributions. (At the other extreme, the comparison is based on f x ( x )/(1 − Fx ( x )).)
by Christopher A. Sims. This document is licensed under the
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
So if x is a t with one degree of freedom, this ratio behaves, as x approaches −∞, like −2x/(1 + x2 )
and thus approaches zero, whereas for a normal it behaves like − x, and thus approaches infinity. So
the t has fatter tails, and for large negative x, at points where Fx ( x ) = Fy (y) (with y normal) we will
have f x / f y = f x Fy /( f y Fx ) very large. I.e., the slope will be very steep at the extremes in the qq plot
when a t(1) is plotted against a normal. The plot shown on the exam was of a sample from the uniform
distribution against a normal.
(2) 30 minutes. Here are the results from estimating a logit model from the Stock-Watson “Names” data
set. These data resulted from sending out resumés to prospective employers with names that were
likely to suggest the applicants were black, and/or female. The variable call_back is one for
those resumés that generated a call back from the employer, 0 otherwise. black is 1 for resumés
that were likely to be identified as black, zero otherwise, and female is one for resumés likely to
be identified as female, 0 otherwise. There was also a dummy variable that was 1 for employers in
Chicago. (There were a lot of other variables, but we are keeping this simple.) The logit regression
has call_back as dependent variable, black, female and chicago as independent variables.
Here is the summary output from the maximum likelihood fit:
Estimate Std. Error z value Pr(>|z|)
0.1246 -17.743 < 2e-16 ***
0.1075 -4.120 3.78e-05 ***
0.1096 -4.198 2.70e-05 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(a) Find an approximate 95% probability interval for the coefficient on black and use it to give
a 95% probability interval for the difference in probability of a call back between a black male
not in Chicago and a non-black male, also not in Chicago.
The 2.5% tails of the N (0, 1) distribution are at ±1.96. Thus the 95% interval on the black variable
−.4431 ± (1.96 · .1075) = (−.2324, .3399) .
For a male not in Chicago, the coefficients on female and Chicago are not needed to compute the
probabilities. The interval for the probabilities for a non-black and a black, respectively, are
= .0987 ,
1 + e−2.2116
= .0657 .
1 + e−2.2116−.4431
(b) The estimated covariance matrix of these coefficients is
(Intercept) 0.015537058 -0.012133751 -0.0046002636 -0.0027857895
-0.012133751 0.017814209 -0.0002528960 -0.0037471701
-0.004600264 -0.000252896 0.0115659895 0.0001393946
-0.002785789 -0.003747170 0.0001393946 0.0120026554
Use it to construct a chi-squared test statistic for the hypothesis that the female and Chicago
coefficients are both zero. Three significant figure accuracy is fine, and you will get full credit if
you set up a correct numerical matrix expression for the statistic, even if you don’t carry out all
the matrix multiplications, inversions, etc. Explain how to determine whether the hypothesis
is rejected at the 95% level by a frequentist test. Explain with a hand sketch what region of
the space of coefficients on the two tested variables has probability pchisq(s,df), where s
is your statistic, df is the appropriate degrees of freedom for the chi-squared distribution, and
pchisq() is a function delivering the probability of a chi-squared variable being less than its
first argument (s in the case at hand).
The test statistic is
The inverse of the matrix in the center is
.01781 · .012 − .0037472
) −1 (
And the test statistic evaluates to 18.655, which, being distributed as χ2 (2) under the null hypothesis, rejects the null at any reasonable significance level. The probability of a χ2 (2) variable exceeding this level is .0001.
The highest posterior density region of the parameter space for these two coefficients is an ellipse
centered at the coefficient estimates and slightly negatively sloped (because the coefficients have
negative covariance. The ellipse with probability .9999 corresponding to our test statistic passes
through the null hypothesis point (0,0). A plot of it (more precise than you were required to produce)
is below. Note that it includes some negative values of the coefficient on female (as we might
have expected because of its smaller t statistic) but scarcely any positive values of the chicago
(3) 20 minutes. We would like to estimate the coefficients in the equation
y j = β0 + β1 x j + ε j .
The data are i.i.d. and ε j has mean zero, but we do not actually have data on x. We have instead
two error-ridden proxies for x
z j = x j + νj
wj = xj + ξ j .
We are willing to assume that all the data are jointly i.i.d. across j, that νj , ξ j and ε j all have zero
mean conditional on x j , and that νj , ξ j , and ε j are uncorrelated with one another and have constant
variance across j.
(a) Can we obtain consistent estimates of β 0 and β 1 by replacing x j with z j in the equation and
then using w j as an instrument for z j ? Would the results be different if we did the reverse,
using z j as an instrument for w j ?
The original equation implies that when we replace x j by z j we get
y j = β 0 + β 1 z j − β 1 νj + ε j = β 0 + β 1 z j + ζ j .
The error term in this equation, ζ j = − β 1 νj + ε j , is clearly, under our assumptions, correlated with
z j , so OLS will not work. However, w j is correlated with z j because of their common dependence on
x j , and neither of w j ’s two components, x j and ξ j is correlated with ζ j , so the model meets the basic
requirements for validity of instrumental variables. The same arguments with the roles of z and w
reversed imply that we could use z j as an instrument for w j instead. The two estimates would not
be the same, though, since they are
( Z ′ W ) −1 Z ′ Y
(W ′ Z ) − 1 W ′ Y ,
where Z is the matrix consisting of the constant vector and the z j ’s and W is the matrix consisting of
the constant vector and the w j ’s. The asymptotic covariance matrix would not even match between
the two (though you were not asked to check that).
(b) We could get a more accurate measure of x j by taking an average of w j and z j . What about
replacing x j with that average and using z j as an instrument for it? Would that give different
results? Better results?
It would certainly give different results, and not better results. The residual once x j was replaced by
(z j + w j )/2 would depend on the noise terms in both z and w, so neither z nor w would be usable
as an instrument.
(4) 10 minutes. Suppose we have i.i.d. data on y j , x j for j = 1, . . . , n and wish to estimate the regression (with no constant term)
y j = αx j + ε j .
Suppose we know that ε j | x j ∼ N (0, σ2 x2j ), where σ2 is an unknown parameter. Explain why
ordinary least squares estimation is inefficient in this case and explain how to implement a more
Ordinary least squares is maximum likelihood and unbiased when the errors are homoskedastic —
i.e. ε j ∼ N (0, σ2 ) with the same σ2 value for all j. Here the error variance varies, so GLS gives a lower
variance. The weighted least squares version of GLS would tell us here to divide the data (y j , x j ) for
each j by x j , then apply least squares to the transformed data. This results in the equation
= α + νj ,
where νj = ε j /x j and νj ∼ N (0, σ2 ). Applying least squares to this homoskedastic equation is just
taking the sample mean of y j /x j .