Computing
Downloading
Notes
| |
Overview
| In the first Bell Curve analysis, Herrnstein & Murray (HM) used a model--scores on the Armed Services Qualifying Test (AFQT), a socioeconomic status index (SES), and AGE--to "predict" whether or not National Longitudinal Survey of Youth (NLSY) cases lived above or below the POVERTY level. Since the NLSY is a longitudinal study HM were able to take AFQT and SES data from the beginning of the NLSY and use them, along with their AGE covariate, to predict POVERTY a decade later. They computed the first three, their model variables, themselves (Appendix 2, pp. 593-604). They used a logistic regression to determine whether their model was related to their outcome measure (because it was binary, with cases being either above or below the POVERTY level). This is a standard confirmatory approach to research. | ![]() |
|
The variables HM used are a mix of ones from the NLSY--a longitudinal
study described more fully in the Subjects section of this web page--and
ones they computed themselves: If we're referring to pedigree, then
we're referring to the NLSY, a venerable dataset; if we're talking about
what was fed into a statistical package to produce the exact numbers shown
in the book's Appendix 4, that's another story. Appendix 2 describes
HM's computations (pp. 593-604). They evidently computed AFQT, SES, and
AGE themselves, taking POVERTY directly from the NLSY data.
AFQT, SES, and AGE constitute the model HM used to predict POVERTY--and every other dependent measure shown in Appendix 4. |
| AFQT. This variable was computed from scores on a battery of
subtests, collectively called the ASVAB (Armed Services Vocational Aptitude
Test), a test developed by the U.S. Defense Department (D0D) for new military recruits.
At the time this test was administered to the NLSY, the DOD was computing its AFQT by adding the scores on three subtests (ARITHMETIC REASONING, WORD KNOWLEDGE, and PARAGRAPH COMPREHENSION) to 1/2 the score of another (NUMERICAL OPERATIONS):
Since the latter subtest was speeded--with subjects asked to complete more problems than they possibly could in the allotted time--the DOD decided in 1989 to replace scores from it with those from another, MATHEMATICAL REASONING. Smart move. HM used DOD's 89 protocol to compute 2 their AFQT variable. HM next standardized it, 3 expressing individual subjects' AFQT scores as their distance from the sample mean, with most values being in the ± 3 range. 11878 cases had valid values on the AFQT. |
||
| SES. Scores were based on the NLSY's INCOME,
EDUCATION, and OCCUPATION variables (which were asked of
different
household members).
HM note that 7447 cases had values on all four variables; 3612, three; 679, two; and the remaining 138, one (p. 599). When four of these measures were available, HM used all four to compute SES; when three were available, they used three; two, two; and one, one. They calculated SES by adding and averaging:
| ||
|
HM then standardized these resulting scores: there were 7447 + 3612 + 679 + 138 = 11876
cases with valid values on SES.
AGE. HM standardized the NLSY's DATE OF BIRTH variable to obtain the measure of AGE used in their analysis. All cases has valid values on this variable. |
How HM got from that many to the 3367 used in their POVERTY analysis is outlined in the Subjects section. The AFQT and SES variables are simple sums or averages, the typical stuff of social research. I used the variables contained in the online file described below (called "NATION.TXT") to replicate HM's analysis. The preceding just summarizes their Bell Curve description of how they computed the variables they used in their POVERTY analysis.
|
You can order the NLSY dataset like HM did. Contact the BLS and order the NLSY dataset
for about $20. |
Alternatively,
snailmail either address below:
| NLS User Services | National Longitudinal Surveys Bureau of Labor Statistics |
| 921 Chatham Lane, Suite 200 | 2 Massachusetts Ave. NE, Suite 4945 |
| Columbus, OH 43221 | Washington, DC 20212-0001 |
Either way, you'll end up with the same dataset 4 HM started out with, 12686 cases, to start with.
| A more direct alternative is to download the data file NATION.TXT (46 variables and 4 constants for 3.3 MB) from the Bell Curve Home Page--and don't forget variable names and the Codebook: the raw data, the numbers themselves, mean nothing unless you know they represent. Also available for download are other NLSY data files used in other Bell Curve analyses. It is my understanding that this online data resource is available compliments of the second author, Murray. |
1 Nietzsche's observation that there is no "immaculate perception" is as applicable to social research as our day-to-day lives.
* No interpretation is simple, because all interpretation depends on perception. Look at the two figures below.
| How about the necker cube to your left: Is it going away from you? coming towards? at eye-level? above you? all of the above? |
| What do you see to your right: a dark vase on a light background? two light faces on a dark background? both? | |
| No interpretation is context independent. I use the simple figures above to illustrate how easily we see the same things differently. Our interpretations are influenced by miniscule details like where we're standing (physically, intellectually, emotionally) relative to what we're looking at. This is as true for an interpretation of a statistical analysis as for any other interpretation. |
| *Interpretations are also influenced by factors not immediately pertaining to the task-at-hand. We are all rooted in communities of knowledge and belief, which for the most part go unarticulated: What constitutes evidence? What are valid ways of knowing, of gaining knowledge (epistemology)? What for that matter is knowledge? What kinds of questions are legitimate to ask? | Fundamental to HM's
interpretation of their POVERTY analysis was their belief in "g"--standing
for "general intelligence"--which they believed was "by now beyond
significant technical dispute" (p. 22).
They so believed, but
others do not:
No interpretations, not even technical ones, are facts: All are subject to disagreement, and "g" is no exception. | |
|
| *All interpretations, even technical ones, are influenced as much by what is not known at the time as what is. | One thing HM didn't know when their book was published was how to interpret the R-square value from a logistic regression: | The book noted that "the usual measure of goodness of fit for multiple regressions [is] R-square" (p. 617). Murray, however, learned otherwise and in 95 referred to R-square as "ersatz and unsatisfactory." Had he known, when the book was published, what he later learned, Murray might have looked at the classification table and differently interpreted the statistical output from The Bell Curve POVERTY analysis . |
|
||
My point here is simply that we're all human, not all knowing: all interpretations--even statistical ones--are situational and fallible (I refer the reader to Karatoni for a contemporary consideration of these issues). In summary, HM's interpretation of their POVERTY analysis results was based in part on what they already believed--that "g" exists--and their not knowing how to interpret the statistic R-square in the context of a logistic regression (which is not to say they actually considered it). I speak of routine analysis here--something quite different from enron math. back to text
≠
2 No analysis is better than the code that generated it.
|
The steps needed to compute the AFQT in JMP, HM's package, are shown in the left column below.
I use the vanilla expression a + b + c + (.5 * d) to represent their
computational formula.
|
|
|
| The step needed to compute the AFQT in several other packages, all with command-line interfaces, is shown in the right column below. | ||
|
To compute the AFQT in JMP, the package HM used, you'd need to:
| To compute the AFQT score in
a command-line package, you'd type:
The old BMDP and SPlus packages had similar one-liners.
|
Although actually computing anything using the GUI package would require numerous steps, output from it and other packages would quantitatively at least damned close--with one's squared z-score equaling another's chi-square. Despite the number of steps, the computational formulae for the AFQT and SES variables are just simple sums and averages. back to text
≠
3 Shown below is the "standard normal curve." The z-score row runs between ± 4, with its highest point being 0, the center of the distribution, with cases above the mean having positive values and those below negative. Scores in the z-, T-, and SAT rows are identical except for a constant: actually, the "standard normal curve" is a whole family of curves or distributions. Although scrunched towards the center, a percentile equivalent of 50 is the same as the median, the mode--and a z of 0, a T of 50, and an SAT of 500.
Gaussian or Bell Curve Showing Some Standardized Scales.
![]() |
Although the bell curve never quite touches the X-axis, one expects that, with a large enough sample, 98% of all cases will fall within ±3 standard deviations of the mean--96% within ±2, and 68% within ± 1. |
By definition, half of the scores on such scales are above and half below the mean--
|
although former U.S. "President Dwight Eisenhower express[ed] astonishment and alarm on discovering that fully half of all Americans have below average intelligence" (Sagan, 1996). |
≠
4 Some raw data from 10 individual NLSY subjects are shown below (I truncated down to two decimal points, Whitehead's fallacy of misplaced precision and all). These subjects are divided into two columns of 5 each; the one to the right consists of subjects who did not have missing values on any of the 4 analysis variables and the one to the left, those who had [missing values are indicated by a " ."]. The 5 subjects in the right column were among those included in the POVERTY analysis.
| MISSING VALUES? | |||||||||
| YES | NO | ||||||||
| VARIABLES: | VARIABLES: | ||||||||
| ID | AFQT | SES | AGE | POVERTY | ID | AFQT | SES | AGE | POVERTY |
|---|---|---|---|---|---|---|---|---|---|
| 1161 | -0.14 | -0.31 | 1.41 | . | 1290 | -0.13 | -0.67 | 1.17 | 1 |
| 1757 | . | . | 1.46 | . | 4194 | -0.07 | -0.27 | 1.59 | 0 |
| 3004 | . | . | 1.62 | . | 6047 | -1.85 | -0.80 | 1.41 | 1 |
| 7095 | -1.20 | -0.12 | 1.52 | . | 6078 | -0.36 | 0.77 | 1.60 | 1 |
| 7252 | . | . | 1.53 | 1 | 6174 | 0.48 | -0.65 | 1.56 | 0 |
The preceding numbers signify nothing--being only 10 cases, an opportunistic sample selected to show why not all eligible subjects would be included in the POVERTY analysis (HM used the common and accepted practice of casewise deletion, which happened to be their statistical package's default).
|
From a didactic perspective, note though that all AGE scores are positive numbers, just meaning that they're older than some average, and most of the AFQT and SES scores are negative, indicating below average. |
| Root | Variables | Subjects | Analysis | Analyst | Documentation |
The goal of this web page is to make The Bell Curve POVERTY
analysis transparent and thus public.