The Public Health Disparities
Geocoding Project Monograph

Geocoding and Monitoring US Socioeconomic Inequalities in Health:
An introduction to using area-based socioeconomic measures
WHY?
READ MORE
HOW TO
TRY IT OUT!
TOOLS
Executive Summary
Introduction
Publications
Geocoding
Generating ABSMs
Analytic Methods
Multi-level Modeling
Visual Display
Case Example
U.S. Census Tract Poverty Data
Glossary

ANALYTIC METHODS
(click here for a pdf version of this page)

Our primary analytic approach for describing socioeconomic gradients by area-based socioeconomic measures has been to use geocodes to append area-based socioeconomic data to case records, to stratify these records into discrete categories based on ABSM, and to aggregate numerators and denominators over areas, within levels defined by ABSM. This method avoids the problem of unstable rates arising from small areas by assuming that cases and population denominators from areas with similar socioeconomic characteristics can be legitimately combined into the same strata. An alternative approach, which preserves the spatial information of the geocodes, is discussed in the section on multilevel analyses.

The following steps are used to generate age-standardized disease rates stratified by area-based socioeconomic measures, once the case data have been geocoded and appropriate ABSMs have been generated from census data.

Aggregate the case data into numerators (age cells within areas/geocodes).
Aggregate population denominator data into age cells within areas/geocodes.
Merge the numerators and denominators with ABSMs, by area/geocode.
Aggregate over areas into strata defined by categorical ABSM and age category.
Generate age-standardized rates and other summary measures.

Clicking on "Case Example & SAS Programming" will take you to a step by step comparison of the analytic methods, the relevant task of the Case Example, and sample SAS code.

Aggregating Numerator Data

Data from public health databases are typically formatted such that each record represents one person (or case report). Once these data have been geocoded, they need to be aggregated before linking to denominator and ABSM data. Before aggregating, however, one should exclude all records that are not geocoded, do not meet the case definition, or are missing data on the important covariates (e.g. age, in the case of simple age-standardized analyses; age, sex, and race/ethnicity in the case of more complex stratified analyses).

One can think of the basic unit of aggregation as a cell, defined by age and other covariates, within an area/geocode. Once aggregated, this cell within an area can be linked to a relevant population denominator. The cell contains a count of all cases within that area that meet the specified age and other covariate criteria. Since our goal is eventually to create rates, we call this count of cases the “numerator.”

Example:
We intend to age-standardize in 5 broad age categories, 0-14, 15-24, 25-44, 45-64, 65+. Therefore, we need to aggregate the records in each census tract into cells defined by the corresponding ages. As an example, consider the following 23 records from census tracts 25009250500 and 25009250800.

  Before aggregating:
back to top
Record #
Geocode
Age at death
1
25009250500
<1
2
25009250500
<1
3
25009250500
<1
4
25009250500
17
5
25009250500
19
6
25009250500
27
7
25009250500
38
8
25009250500
40
9
25009250500
40
10
25009250500
44
11
25009250800
<1
12
25009250800
<1
13
25009250800
5
14
25009250800
22
15
25009250800
24
16
25009250800
26
17
25009250800
31
18
25009250800
36
19
25009250800
36
20
25009250800
40
21
25009250800
43
22
25009250800
43
23
25009250800
43
After aggregating:
Geocode
Age category
Number of deaths (numerator)
25009250500
0-14
3
25009250500
15-24
2
25009250500
25-44
5
25009250800
0-14
3
25009250800
15-24
2
25009250800
25-44
8
View Case Example & SAS Programming for Step 1
Aggregating Denominator Data

Denominator data at the census tract level typically come from the decennial census. In 1990, the US Census reported population counts by age in 31 categories (<1, 1-2, 3-4, 5, 6, 7-9, 10-11, 12-13, 14, 15, 16, 17, 18, 19, 20, 21, 22-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-61, 62-64, 65-69, 70-74, 75-79, 80-84, 85+). In the 1990 US Census STF3, age specific population counts were reported in table P013. Variable P0130001 gave the count of residents <1 year old, P0130002 gave the count of residents 1-2 years old, etc.

For the purposes of age standardization, these age categories need to be re-aggregated to match the age categories used for categorizing case data (numerators, above) and the age categories from the standard million reference population. Additionally, when using case data from multiple years, in order to calculate an average annual incidence rate, one needs to use a person-time denominator (population count multiplied by number of years of case data). For example, in the case of the Massachusetts all-cause mortality data, we have three years worth of cases (1989-1991). Therefore, we multiply the population count in each age category by 3.

Example:
For census tract 25009250800 in 1990, we wish to age standardize using the same five broad age categories as in the numerator example above (0-14, 15-24, 25-44, 45-64, 65+):

  Before:

Census variable
Ages (years)
Population count
P0130001
<1
115
P0130002
1-2
243
P0130003
3-4
197
P0130004
5
92
P0130005
6
59
P0130006
7-9
237
P0130007
10-11
160
P0130008
12-13
141
P0130009
14
77
P0130010
15
62
P0130011
16
54
P0130012
17
94
P0130013
18
65
P0130014
19
89
P0130015
20
101
P0130016
21
128
P0130017
22-24
387
P0130018
25-29
571
P0130019
30-34
746
P0130020
35-39
422
P0130021
40-44
354
P0130022
45-49
317
P0130023
50-54
176
P0130024
55-59
174
P0130025
60-61
65
P0130026
62-64
214
P0130027
65-69
158
P0130028
70-74
316
P0130029
75-79
178
P0130030
80-84
112
P0130031
85+
69

In order to collapse these variables into the five broad age categories, we have to sum up census variables as follows:

  After:


back to top

Age category
Population count
Person-time denominator
(x 3 years of case data)
0-14
SUM OF (P0130001 -- P0130009)
1321
3963
15-24

SUM OF (P0130010 -- P0130017)

980
2940
25-44
SUM OF (P0130018 -- P0130021)
2093
6279
45-64
SUM OF (P0130022 -- P0130026)
946
2838
65+
SUM OF (P0130027 -- P0130031)
833
2499
Merging numerators with denominators and ABSM.

Once the numerators and denominators have the same structure (AREAKEY x AGECAT), they can be merged together, along with the ABSM data (by AREAKEY). For age cells within areas where no cases were reported, we set the numerator to zero.

Example:

Before merging with ABSM:  

Numerator dataset:

Geocode/Areakey
Age category
Number of deaths (numerator)
25009250500

0-14

3
25009250500
15-24
2
25009250500
25-44
5
25009250500
45-64
7
25009250500
65+
26
25009250800
0-14
4
25009250800
15-24
3
25009250800
25-44
8
25009250800
45-64
13
25009250800
65+
132
Denominator dataset:
Geocode/Areakey
Age category
Person-time denominator (x 3 years of case data)
25009250500

0-14

4152
25009250500
15-24
1953
25009250500
25-44
3489
25009250500
45-64
1233
25009250500
65+
1212
25009250800
0-14
3963
25009250800
15-24
2940
25009250800
25-44
6279
25009250800
45-64
2838
25009250800
65+
2499
            After merging with ABSM:

Geocode
Age category
Poverty
Numerator
Denominator
25009250500
1
4
3
4152
25009250500
2
4
2
1953
25009250500
3
4
5
3489
25009250500
4
4
7
1233
25009250500
5
4
26
1212
25009250800
1
3
4
3963
25009250800
2
3
3
2940
25009250800
3
3
8
6279
25009250800
4
3
13
2838
25009250800
5
3
132
2499
Aggregating OVER areas into ABSM strata

Next, in order to generate rates for categories of a specific ABSM, it is necessary to aggregate OVER areas into strata defined by AGECAT and ABSM. Numerators and denominators from census tracts with missing ABSM data for a particular ABSM are typically excluded from that analysis.

Example:
In Suffolk County, Massachusetts, there are a total of 189 census tracts. We wish to examine all cause mortality rates by poverty, with poverty categorized into 4 strata (0-4.9%, 5-9.9%, 10-19.9%, and 20-100%).

 
ABSM: CT Poverty
Number of census tracts
0.0-4.9% 10
5.0-9.9% 37
10.0-19.9% 56
20.0-100.0% 83
Missing poverty data 3

Thus, to obtain the mortality rates in the least impoverished stratum (0.0-4.9% below poverty), we need to aggregate the cases and the population at risk OVER the ten census tracts in that stratum (preserving the age structure WITHIN each poverty stratum so that we can age standardize in the following step, below). For the next poverty stratum (5.0-9.9%) we need to aggregate the cases and the population denominator over 37 census tracts, and so on. Cases and population at risk in the three census tracts with missing poverty data are excluded from the analysis.

This yields the following table:

.
ABSM: CT poverty
Age category
Numerator
Denominator
                     
0.0-4.9%
0-14
1
10,608
0.0-4.9% 15-24
5
9,984
0.0-4.9% 25-44
54
29,190
0.0-4.9% 45-64
106
16,710
0.0-4.9% 65+
657
15,825
5.0-9.9%
0-14
40
69,939
5.0-9.9% 15-24
39
64,065
5.0-9.9% 25-44
252
179,595
5.0-9.9% 45-64
792
90,042
5.0-9.9% 65+
4,535
80,916
10.0-19.9%
0-14
101
88,989
10.0-19.9% 15-24
93
93,147
10.0-19.9% 25-44
531
224,793
10.0-19.9% 45-64
962
100,479
10.0-19.9% 65+
3,944
71,955
20.0-100.0% 0-14
182
155,193
20.0-100.0% 15-24
170
217,593
20.0-100.0% 25-44
831
288,882
20.0-100.0% 45-64
1,291
108,588
20.0-100.0% 65+
3,645
72,720
Generating Rates and Other Summary Measures/Measures of Effect


1.
Age-standardized incidence rates

The standard practice of public health departments in reporting population rates of mortality and disease incidence is to calculate age-standardized rates, which facilitates comparisons between regions or subgroups of interest. The age-standardized rate is interpretable as the rate that would be observed in a population if that population had the same age distribution as a given reference population. Standardization by the direct method involves taking a weighted average of the age specific incidence rates observed in the area or subgroup of interest, where the weights come from a standard age distribution, such as the year 2000 standard million.1

"Standard million" reference populations are available based on the US population age distribution for 1940, 1970, 1980, 1990, and 2000. Here we present the standard million in 11 age categories.

  Age (years)
Standard million reference population
 
Year 1940
Year 1970
Year 1980
Year 1990
Year 2000
<1
15,343
17,150
15,598
12,936
13,818
1-4
64,718
67,265
56,565
60,863
55,317
5-14
170,355
200,511
154,238
141,584
145,565
15-24
181,677
174,405
187,542
147,860
138,646
25-34
162,066
122,567
163,683
173,600
135,573
35-44
139,237
113,616
113,155
151,095
162,613
45-54
117,811
114,265
100,641
101,416
134,834
55-64
80,294
91,481
95,799
85,030
87,247
65-74
48,426
61,192
68,775
72,802
66,037
75-84
17,303
30,112
34,116
40,429
44,842
85+
2,770
7,436
9,888
12,385
15,508

For our project, we used five broad age categories to age standardize, in order to obtain more stable rates in each age stratum, particularly for outcomes with sparse data. The relationship between our five categories and the standard eleven categories is illustrated in the table below.

            Age in 11 categories
Year 2000 standard million
Age in 5 categories
Year 2000 standard million
                      
<1
13,818
<15
214,700
1-4
55,317
5-14
145,565
15-24
138,646
15-24
138,646
25-34
135,573
25-44
298,186
35-44
162,613
45-54
134,834
45-64
222,081
55-64
87,247
65-74
66,037
65+
126,387
75-84
44,842
85+
15,508

Example:
To calculate the age-standardized all cause mortality rates in each of the four poverty strata in Suffolk County, we start with the age-specific mortality data. In each poverty stratum, the age standardized mortality rate is calculated as a weighted sum of the age-specific mortality rates, with the weights for each age stratum defined by the Year 2000 standard million.

 
ABSM: CT poverty
Age category
Numerator
Denominator
Year 2000 standard million
wj (weight)

IRj (incidence rate per 100,000)

IRst (age standardized rate per 100,000)
 
0.0-4.9%
0-14
1
10,608
214,700
0.215
9.4
    729.7
0.0-4.9% 15-24
5
9,984
138,646
0.139
50.1
0.0-4.9% 25-44
54
29,190
298,186
0.298
185.0
0.0-4.9% 45-64
106
16,710
222,081
0.222
634.4
0.0-4.9% 65+
657
15,825
126,387
0.126
4,151.7
5.0-9.9%
0-14
40
69,939
214,700
0.215
57.2
    966.2
5.0-9.9% 15-24
39
64,065
138,646
0.139
60.9
5.0-9.9% 25-44
252
179,595
298,186
0.298
140.3
5.0-9.9% 45-64
792
90,042
222,081
0.222
879.6
5.0-9.9% 65+
4,535
80,916
126,387
0.126
5,604.6
10.0-19.9%
0-14
101
88,989
214,700
0.215
113.5
    1,014.0
10.0-19.9% 15-24
93
93,147
138,646
0.139
99.8
10.0-19.9% 25-44
531
224,793
298,186
0.298
236.2
10.0-19.9% 45-64
962
100,479
222,081
0.222
957.4
10.0-19.9% 65+
3,944
71,955
126,387
0.126
5,481.2
20.0-100.0% 0-14
182
155,193
214,700
0.215
117.3
    
1,019.3
20.0-100.0% 15-24
170
217,593
138,646
0.139
78.1
20.0-100.0% 25-44
831
288,882
298,186
0.298
287.7
20.0-100.0% 45-64
1,291
108,588
222,081
0.222
1,188.9
20.0-100.0% 65+
3,645
72,720
126,387
0.126
5,012.4

Example:
In the following analysis of mortality due to homicide and legal intervention among hispanic women in Massachusetts, the lower confidence limits on the rate in the 5.0-9.9% poverty stratum is negative, using the traditional normal approximation method. In contrast, the lower confidence limit based on the gamma distribution yields a more reasonable confidence limit.


ABSM: CT poverty Rate per 100,000
Confidence Limits
Deaths
Person-time at risk
          back to top
Normal approximation
"Gamma" interval
Lower
Upper Lower Upper
0.0-4.9%
0.0
(0.0
,0.0)
(0.0

,9.2)

0
40,182
5.0-9.9%
3.5
-(0.5
,7.5)
(0.7
,10.3)
3
67,458
10.0-19.9%
3.8
(0.1
,7.5)
(1.0
,9.7) 
4
87,336
20.0-100.0%
4.2
(1.4
,7.0)
(1.9
,8.0) 
11
228,288


Example:
In the analysis of mortality due to homicide and legal intervention among Hispanic women in Massachusetts, the estimated rate in the least impoverished group is zero, since there were no deaths reported in census tracts with 0-4.9% below poverty. In the table below, the normal approximation method yields a confidence interval of (0,0) for the rate in the least impoverished group, as well (as “impossible” negative lower limits on the rates in the 5.0-9.9% poverty stratum, as we saw above). The gamma method also yields a (0,0) interval for the rate in the least impoverished group, so we have corrected the entry for the upper confidence limit as described above. Using the “exact” upper limit on the count of 3.689, we divide this by the denominator (40,182) to give an upper limit of 9.2 per 100,000.
  ABSM: CT poverty
IRst (age standardized rate per 100,000)
Confidence Limits
Deaths
Person-time at risk
Normal approximation
"Gamma" interval
Lower
Upper
Lower
Upper
0.0-4.9%
0.0
(0.0
,0.0)
(0.0

,9.2)

0
40182
5.0-9.9%
3.5
-(0.5
,7.5)
(0.7
,10.3)
3
67458
10.0-19.9%
3.8
(0.1
,7.5)
(1.0
,9.7)
4
87336
20.0-100.0%
4.2
(1.4
,7.0)
(1.9
,8.0)
11
228288

Example:
To compare the age-standardized incidence rates in the most and least impoverished census tracts in Suffolk County, we start with the age-specific data for these two strata (note: for ease of presentation, we present variances in scientific notation in the table below):

            ABSM: CT Poverty
Age category
Numerator
Denominator
wj (weight)
IRj (age specific rate)
Var(IRj) (variance of the age specific rate)
IRst (age standardized rate)
Var(IRst) (variance of the age standardized rate)
           
0.0-4.9% 0-14
1
10,608
0.2147
0.000094
8.887E-09
0.007297
6.76E-08
0.0-4.9%
15-24
5
9,984
0.1386
0.000501
5.016E-08
0.0-4.9%
25-44
54
29,190
0.2982
0.001850
6.338E-08
0.0-4.9%
45-64
106
16,710
0.2221
0.006344
3.796E-07
0.0-4.9%
65+
657
15,825
0.1264
0.041517
2.623E-06
20.0-100.0% 0-14
182
155,193
0.2147
0.001173
7.557E-09
0.010193
1.77E-08
20.0-100.0%
15-24
170
217,593
0.1386
0.000781
3.591E-09
20.0-100.0%
25-44
831
288,882
0.2982
0.002877
9.958E-09
20.0-100.0%
45-64
1,291
108,588
0.2221
0.011889
1.095E-07
20.0-100.0%
65+
3,645
72,720
0.1264
0.050124
6.893E-07


5.
Relative Index of Inequality (RII)

Comparisons of socioeconomic gradients based on categorical ABSM may be complicated by differences in the population distributions of area-based socioeconomic measures. For example, it may be expected that the classifications producing smaller groups at the margins would lead to larger incidence rate ratios, comparing the most deprived to the most affluent, because finer discrimination of extremes of socioeconomic position is achieved. The relative index of inequality (RII) has been proposed as a measure which explicitly addresses this problem.5-7 Assuming ordinality of the ABSM categories, the RII is calculated by regressing the incidence rate in each ABSM category on the total proportion of the population that is more deprived in the socioeconomic hierarchy. Because the RII combines information about the magnitude of the socioeconomic gradient with information about the distribution of the socioeconomic variable in the population, it can be conceptualized as a measure of "total population input".

In practice, this latter quantity is represented by the cumulative distribution function (cdf). We approximate the cdf for the jth level of a given ABSM by summing the proportion of the population represented by the categories ABSM1, …, ABSMj-1, and adding one-half the proportion of the population represented by the category ABSMj.

Example:
In order to calculate the RII for poverty and all cause mortality in Massachusetts, we begin by calculating the approximate cumulative distribution function as follows:

      ABSM: CT poverty
Population denominator
Proportion
Formula
Approximate cdf
     
0.0-4.9%
7,626,117
0.423
=0.423/2
0.211
5.0-9.9%
5,508,912
0.305
=0.423+0.305/2
0.576
10.0-19.9%
2,782,194
0.154
=0.423+0.305+0.154/2
0.805
20.0-100.0%
2,120,208
0.118
=0.423+0.305+0.154+0.118/2
0.941

Example:
The column of red numbers shows the expected number of cases in each poverty stratum.

      ABSM: CT Poverty IRst (age standardized rate per 100,000) Observed deaths
Population denominator
Expected deaths
Approximate cdf
     
0-4.9%
757.0
57,256
7,626,117
57,731.7
0.211
5-9.9%
840.3
52,583
5,508,912
46,291.7
0.576
10-19.9%
915.9
27,730
2,782,194
25,482.0
0.805
20-100%
1,035.3
17,842
2,120,208
21,950.7
0.941

 
 

Example:
To calculate the population attributable fraction of all cause mortality due to poverty, we begin by tabulating the cases and population person-time at risk in each poverty stratum j within each age group i. Within each age group, the case fraction CFij is the number of cases in that poverty stratum, divided by the total number of cases within the age group. The incidence rate ratio IRRij for a particular poverty stratum, relative to the reference category of the least impoverished group, is calculated by dividing the rate in that poverty stratum by the rate in the least impoverished group. For each age stratum, we calculate a separate age-specific PAF, as seen in the column of red numbers in the table below. These age-specific PAFs range from 5% to 23%.

                              Age category (i)
ABSM: CT poverty (j)
Cases
Person-time denominator
Rate per 100,000
Case Fraction (CFij)
Incidence rate ratio (IRRij)
 Population attributable fraction (PAFi)
                             
     0-14  0-4.9% (reference)
303
 727,947
  41.6
 40.7%
 1.00
    0.1626
 5.0-9.9%
 253
 461,958
 54.8
 34.0%
 1.32
 10.0-19.9%
 113
 206,214
 54.8
 15.2%
 1.32
 20.0-100.0%
 75
 100,716
 74.5
 10.1%
 1.79
 Total cases:
 744
 
15-24
 
 0-4.9% (reference)
 377
 510,645
 73.8
 40.6%
 1.00
0.0506 
 5.0-9.9%
 323
 349,518
 92.4
 34.8%
 1.25
 10.0-19.9%
 152
 179,928
 84.5
 16.4%
 1.14
 20.0-100.0%
 76
 153,273
 49.6
 8.2%
 0.67
  Total cases:
 928
 
       
25-44  
 
 0-4.9% (reference)
 1,569
 1,201,002
 130.6
 34.7%
 1.00
0.2266
 5.0-9.9%
 1,392
 873,072
 159.4
 30.7%
 1.22
 10.0-19.9%
 933
   405,366
 230.2
 20.6%
 1.76
 20.0-100.0%
 633
 200,457
 315.8
 14.0%
 2.42
Total cases:
 4,527  
 
45-64
 0-4.9% (reference)
 5,314
 763,464
 696.0
 39.7%
 1.00
  0.2210
 5.0-9.9%
 4,429
 461,451
 959.8
 33.1%
 1.38
 10.0-19.9%
 2,287
 191,934
 1,191.6
 17.1%
 1.71
 20.0-100.0%
 1,369
 82,674
 1,655.9
 10.2%
 2.38
 Total cases:
13,399
 
  65+
 
 0-4.9% (reference)
 19,470
 376,002
 5,178.2
 38.8%
 1.00
    0.0725
 5.0-9.9%
 17,784
   314,181
 5,660.4
 35.4%
 1.09
 10.0-19.9%
 8,734
146,091
 5,978.5
 17.4%
 1.15
 20.0-100.0%
 4,248
  63,594
 6,679.9
 8.5%
 1.29
  Total cases:
 50,236
 

To aggregate these PAFs across age strata, we weight the contribution of each age stratum by the proportion of cases in that age stratum. As seen in the table below, this results in an aggregated population attributable fraction PAFagg of 11%.

      Age category (i)
Cases
 Population attributable fraction (PAFi)

Aggregated population attributable fraction (PAFagg)
     
0-14 744
0.1626
(744*0.1626 + 928*0.0506 + 4527*0.2266 + 13399*0.2210 + 50236*0.0725)/ 69834
= 0.1116
15-24 928 0.0506
25-44 4,527 0.2266
45-64 13,399 0.2210
65+ 50,236 0.0725
 Total cases:
69,834      
REFERENCES
1. Breslow NE, Day NE (eds). Statistical Methods in Cancer Research, Vol. II: The Design and Analysis of Cohort Studies. Oxford, UK: Oxford University Press, 1987.
2. Anderson RN, Rosenberg HM. Age standardization of death rates: implementation of the year 2000 standard; National Vital Statistics Reports: Vol 37, No. 3. Hyattsville, MD: National Center for Health Statistics, 1998.
3. Fay MP, Feuer EJ. Confidence intervals for directly standardized rates: a method based on the gamma distribution. Statistics in Medicine 1997;16:791-801.
4. Rothman KJ, Greenland S. Modern Epidemiology. 2nd Edition. Philadelphia: Lippincott-Raven, 1998.
5. Pamuk ER. Social class inequality in mortality from 1921 to 1972 in England and Wales. Popul Stud 1985;39:17-31.
6. Wagstaff A, Paci P, van Doorslaer E. On the measurement of inequalities in health. Soc Sci Med 1991;33:545-57.
7. Davey Smith G, Hart C, Hole D, et al. Education and occupational social class: which is the more important indicator of mortality risk? J Epidemiol Community Health 1998;52:153-60.
8. JA Hanley, A heuristic approach to the formulas for population attributable fraction. J Epidemiol Community Health 2001,55:508-514.
Home Page
Glossary
back to top
Who We Are
Acknowledgments
Contact Us
This work was funded by the National Institutes of Health (1RO1HD36865-01) via the National Institute of Child Health & Human Development (NICHD) and the Office of Behavioral & Social Science Research (OBSSR).
Copyright © 2004 by the President and Fellows of Harvard College - The Public Health Disparities Geocoding Project.