SUBJECTS     Pedigree     Analysis cases     Notes
---

Overview

In the first Bell Curve analysis, Herrnstein and Murray (HM) empirically tested whether their model--scores on the Armed Services Qualifying Exam (AFQT), a socioeconomic index (SES) and AGE--predicted the POVERTY status of cases in the National Longitudinal Survey of Youth (NLSY), which is comprised of two independent subsamples, the X-Sectional and Supplemental, representative of non-institutionalized people living in the U.S. born between the first day of 1957 and the last day of 1964--and between 14 and 22 when first interviewed--with "economically disadvantaged youth ... oversampled" in the latter. 1 Of these, HM used only the 3367 white non-students without missing values in the NLSY X-Sectional subsample-- surprising that they picked this higher-income subsample since the book expressed interest in low-income whites (pp. 617, 793). 2 philipkdickruler2

This section examines the NLSY, summarizes how HM derived their analysis cases, and identifies the parallel NLSY Supplemental cases. Note 1 compares the NLSY to other U.S. national longitudinal studies; Note 2 delineates the standard ways that HM dealt with the problems of missing values; and Note 3 empirically examines the distributions of HM's four variables in the two NLSY subsamples (from NATION.TXT).

 

NLSY Pedigree

The NLSY interviews are conducted by the National Opinion Research Center at the University of Chicago (some by phone, some in person, and some computer assisted); user services are provided by Center for Human Resources Research at Ohio State University. The most recent interviews, the 19th round, occurred in the year 2000.
Although key questions are repeated, interview content varies from year to year. Funds from the Department of Defense (DOD), for example, supported administration of the Armed Services Vocational Aptitude Battery (ASVAB).

There's information online about the NLSY sample composition, its variables (described in the data section), and the different interview modes. Additionally, there's a FAQ, bibliography and a series of discussion papers, all of which can be e-requested (I've received a few without difficulty).

TABLE 1 below shows the ethnic composition of each of the NLSY subsamples, to the left in a bargraph and to the right in a table (click here to see how I got these numbers): Rows are ethnicity and columns, subsamples [there was also a smaller Military component]. Row percentages show that about 55% of the NLSY sample is White, 16% Hispanic, 25% Black, and less than 5% "Other;" column percentages show that the X-Sectional subsample forms 48% of the original NLSY, with the Supplemental accounting for another 42%, and the Military, about 10%: Whites comprise most of the NLSY-- especially in the X-Sectional subsample--and Blacks, along with Hispanics, the majority of cases in the Supplemental subsample.

 

Table 1. NLSY cases (N=12686) by subsample and ETHNICITY [data from the file NATION.TXT].

Histogram

NLSY: Table 1 histogram

Frequencies and total percentages
ETHNICITYSUBSAMPLEROW TOTAL
  MIL SUP X-SEC 
WHITE
% =
871
6.9
1467
11.6
4603
36.3
6941
54.7
HISPANIC
% =
78
0.6
1480
11.7
444
3.5
2002
15.8
BLACK
% =
 251
2.0
2172
17.1
751
5.9
3174
25.0
OTHER
% =
80
0.6
176
1.4
313
2.5
569
4.5
COLUMN TOTAL
% =
1280
10.1
5295
41.7
6111
48.2
12686
100.0

Only the whites, 55% of the NLSY, were of interest to HM in the context of the question they were asking about POVERTY. Of these, they also excluded cases without valid values on their four analysis variables--as is commonly done--and those who were students in the year POVERTY was measured, which makes sense (except for their not considering the post-graduation student-loan repayment period). HM's actual POVERTY analysis sample (N=3367) was about a quarter of the NLSY.

Analysis Cases

I used these 3367 X-Sectional cases--white non-students without missing values--to replicate (click here for documentation) HM's POVERTY analysis. Figure 1 below identifies the cases HM used in that published analysis [bottom row center] and a second similar independent set of NSLY cases [bottom row left]. The four POVERTY analysis variables distributed similarly across the two subsamples but differed in level. 3 It's surprising that HM only used data from the higher-income subsample, because the book expressed particular interest in low-income whites (pp. 617, 793).

 

Figure 1. Cases HM used in their published POVERTY analysis from the NLSY X-Sectional (N=3367) subsample and another independent group of cases.

 

ETHNICITY variable
coded white(N=6941)
/\
Missing(N=2328)Not_Missing(N=4613)
/\
Student(N=121)Not_Student(N=4492)
/|\
Supplemental(N=1067) X-Sectional(N=3367) Military(N=58)

 

Footnotes

1 The NLSY is one of several national longitudinal surveys conducted by the U.S. Bureau of Labor Statistics: The one HM used in their POVERTY analysis is highlighted in the chart below: it had 12686 cases (third column) total; its subjects were 14- to 22- years old (second column) when first interviewed in 1979 (fourth).

GroupINITIAL AGESIZESURVEY YEARSSTATUS
Older Men45-595,020 1966 - 1990Ceased
Mature Women30-445,083 1967 -Continuing
Young Men14-245,225 1966 - 1981Ceased
Young Women14-245,159 1968 - Continuing
NLS Youth79 14-2212,686 1979 - Continuing
NLSY ChildrenBirth-227,035 1986 -Continuing

HM used only NLSY whites in their POVERTY analysis. back to text

2 Of these, they only used whites without "missing values" on any of the analysis variables. "Missing values" are not at all like missing trains or friends. Missing values create messy problems. Even defining them is difficult. Should the "score" of someone incapable of answering any questions on a test be a zero (none right) or a missing value (no data on this person)? Would, for example, an Alzheimer's patient who can't complete a cognitive functioning test be more appropriately treated as having a score of 0 or as not having a valid score? Famous statistician Gertrude Cox once noted that the best solution to the problem of missing values was not to have them. She is absolutely correct (and I've never encountered a real-world data set without them).

Let me illustrate: Suppose there were test data on 9 students: Albert, Betty, Carl, Dilbert, Egbert, Frank, Gil,* Hortense, and Imogene. Suppose further that the data consisted of scores on Text X and Text Y.

*My advisor, Gilbert Sax, named Albert to Imogene to "humanize" his lectures on applied quantitative testing and measurement (I changed his "Gertrude" to "Gil"). Gil's advisor--no question about it: I'm dropping names--was Joy Guilford, who is mentioned in HM's discussion of "g:" I can't remember ever discussing it with Gil (because of its irrelevance to contemporary--as opposed to early 20th century--psychometrics).

ID Test XTest Y
A 1 1
B 2 4
C 1 .
D 1 .
E 2 8
F 1 6
G 2 .
H 2 .
I 2 .
These student data are listed to the left: 9 rows (9 students/subjects/cases in our sample) and 3 columns--the first an ID and the other two, scores on Tests X and Y. The dots (.) in the Test Y column signify missing values:

Cases C (Carl) and H (Hortense) were home sick the day Test Y was administered; D was helping with the spud harvest; G was thinking about stealing hubcaps and so didn't bother with the questions, and I is a very conscientious student (but her cat died the night before and she was quite upset). In short, for whatever the reason, all 5 students have missing values on Test Y.

Anyone wanting to analyze or scale the two test scores from these 9 students must decide, tacitly or otherwise, how to handle the 5 who, for whatever the reason, didn't have scores on Test Y.
psychedelikurtic curve

One common way is to exclude data from cases with any missing values (called "casewise deletion"). In the present example, data from only 4 cases (A & B and E & F)--the ones with valid values on both analysis variables--would be used.

An alternative is to substitute the missing values with values on other measures. Do scores on some other test give comparable values? Let these other scores be used when Y is not available. Do you have a measure at times 5 and 7 but not 6? Let time 6 be the average of times 5 and 7 (and call the process "interpolation").

Using casewise deletion, as HM did, is commonplace.

Missing values also pose problems when constructing variables. One way to deal with them is to use whatever's there: Suppose we have three variables (V1, V2, and V3) and that we have all three for some cases but only one or two for the other cases. For cases with all three, we could create a new variable as follows:

(V1 + V2 + V3) / 3;

for those with only two variables present,

(V1 + V2) / 2,
(V1 + V3) / 2,
(V2 + V3) / 2;

and

V1 or V2 or V3

for those with only one valid variable. This is how HM constructed their SES measure, by using different combinations of parental INCOME, OCCUPATION, and EDUCATION for different cases. This, too, is standard practice.
topsy ruler Dilemmas can crop up no matter what. Using only cases with valid values on all variables may result in too small a sample. But not requiring valid values on all variables introduces measurement error (since the scale would measure different things about different people). Similarly, the estimate is only as good as the estimator. Potentially messy interpretive problems no matter which way you go.

In summary, HM dealt with the problems of missing values by using whatever combination of SES-related variables were available to create their SES variable and by using casewise deletion when conducting their logistic POVERTY analysis. Both are standard practice. I replicated their POVERTY analysis by using only X-Sectional(N=3367) non-student whites without missing values. back to text

3 The following 3 figures compare data from NLSY Supplemental(N=1067) [left column] white non-students without missing values with similiar cases in the X-SECTIONAL(N=3367) [right column] subsample--on each of the model variables: AFQT, SES, AGE, and POVERTY. Figure 2 (click here for documentation) and Figure 3 (click here) show that the univariate and bivariate distributions of the variables were similar across subsamples. Figure 4 (here) shows that the levels of the variables differed across the two.

 

Figure 2. Univariate distributions of the POVERTY analysis model variables for NLSY white non-student cases without missing values in the Supplemental(N=1067) and X-SECTIONAL(N=3367) subsamples [data from the file NATION.TXT].
Bell Curve data: Figure 2 AFQT and SES scores are rather normally distributed and similar for both subsamples. The distribution of the AGE variable is also similar across the two--but odd, approaching as it does a uniform distribution and ranging ±1.7-- NLSY subjects were between 14 and 22 years old when first interviewed, a range of 9 years. AGE did though correlate perfectly with another variable from NATION.TXT named "DOB," which ranged between 64.9999 (AGE = -1.73) and 57.00274 (AGE = 1.73), with the non-decimal part obviously referring to subjects' birth year--but I don't understand the meaning of the decimal values and the NLSY codebook sheds no light ... likely as not an "error of routine analysis" or interpretation was made somewhere between the original BLS data CD and the analysis file NATION.TXT that resided on HM's hardrive when they conducted their POVERTY analysis: such errors are commonplace, alas. The relationship between AGE and DOB in any case illustrates the common misperception that standardizing distributions normalizes them.

 

 

Figure 3. Bivariate plots of the POVERTY analysis model variables for NLSY white non-student cases without missing values in the Supplemental(N=1067) and X-SECTIONAL(N=3367) subsamples [data from the file NATION.TXT].
Bell Curve data: Figure 3 Bivariate plots for the two subsamples are similar. The first two rows plot AFQT and SES against the suspect variable AGE: nothing going on there. There is though a marked positive linear relationship between AFQT and SES scores--their intercorrelation (r = .49) being considerably higher than either's with the dependent POVERTY variable (-0.22 and -0.17 respectively).

 

Figure 4. Differences in POVERTY between white non-student cases without missing values in the NLSY Supplemental(N=1067) and X-Sectional(N=3367) subsamples [data from the file NATION.TXT].
Bell Curve Data: Figure 4

 

To the left are boxplots--with the "box" representing cases falling between the 25th and the 75th percentiles (that is, the interquartile range); the line through it, the median; and the tails (called "whiskers") most remaining scores (those outside said whiskers being outliers, e.g., HM coded subjects whose INCOME was less than 1k as "-4"). Means and (standard deviations) for the two groups are shown below:

 

Means and standard deviations rounded to two decimal points.
___________________________________
Supplemental X-SECTIONAL
(N=1067) (N=3367)
___________________________________
AGE
.19(1.02) -.09(.97)
AFQT
-.07(1.03) .22(.90)
SES
-.42(1.06) .22(.87)
POVERTY RATE
 .15 .07 
____________________________________

 

The means above and the medians to our left tell the same story. NLSY cases who were white, not students, and didn't have missing values in the X-Sectional(N=3367) subsample HM used in their POVERTY analysis were

*younger, had

*higher AFQT and SES scores, and a

*lower POVERTY RATE.

Since POVERTY is a dichotomous variable--with 0 coded as above and 1 as below some threshold--its mean equals the percentage of cases living below the POVERTY level. That works out to 244 [.0724681 * 3367] cases in the X-SECTIONAL--the one HM used in their published analysis--and 157 [.1471415 * 1067] in the Supplemental subsample (conversely: 244/3367 equals about 7 and 157/1067 about 15 percent of the cases in the two subsamples) living below the POVERTY level . The classification table in the Analysis section shows that none of the former 244--in the published analysis--and not many of the latter 157 were correctly predicted by HM's model.

In summary, observed differences in POVERTY are consistent with differences in the two sampling designs, e.g., the one HM didn't use purposefully oversampled low-income whites. Regarding the three model variables, Figure 2 showed that AFQT and SES were normally distributed (but the suspect AGE not) and Figure 3 that they were highly correlated with each other (but not AGE). This poses problems from both statistical and measurement perspectives: the statistical problem is obvious, the measurement one perhaps less so. The problem is one of "discriminant validity:" if the phenomena that the AFQT and SES measure are conceptually different, as HM posit in the book, then the variables measuring them shouldn't be so highly correlated [that high correlation--replicated in two independent groups of subjects--could be used as evidence consistent with the argument that the two variables measure the same phenomena ... Warts and all, these are the cases and variables I used, from the file NATION.TXT, to replicate HM's published POVERTY analysis. back to text


 

Root Variables Subjects Analysis Analyst Documentation