ANALYTIC
METHODS
(click here
for a pdf version of this page)








Aggregating
Numerator Data 
Data
from public health databases are typically formatted such that
each record represents one person (or case report). Once these
data have been geocoded, they need to be aggregated before linking
to denominator and ABSM data. Before aggregating, however, one
should exclude all records that are not geocoded, do not meet
the case definition, or are missing data on the important covariates
(e.g. age, in the case of simple agestandardized analyses; age,
sex, and race/ethnicity in the case of more complex stratified
analyses).
One
can think of the basic unit of aggregation as a cell, defined
by age and other covariates, within an area/geocode. Once aggregated,
this cell within an area can be linked to a relevant population
denominator. The cell contains a count of all cases within that
area that meet the specified age and other covariate criteria.
Since our goal is eventually to create rates, we call this count
of cases the “numerator.”
Example:
We intend to agestandardize in 5 broad age categories, 014,
1524, 2544, 4564, 65+. Therefore, we need to aggregate the
records in each census tract into cells defined by the corresponding
ages. As an example, consider the following 23 records from
census tracts 25009250500 and 25009250800.


Before
aggregating: 
back
to top 
Record
# 
Geocode 
Age
at death 
1 
25009250500 
<1 
2 
25009250500 
<1 
3 
25009250500 
<1 
4 
25009250500 
17 
5 
25009250500 
19 
6 
25009250500 
27 
7 
25009250500 
38 
8 
25009250500 
40 
9 
25009250500 
40 
10 
25009250500 
44 
11 
25009250800 
<1 
12 
25009250800 
<1 
13 
25009250800 
5 
14 
25009250800 
22 
15 
25009250800 
24 
16 
25009250800 
26 
17 
25009250800 
31 
18 
25009250800 
36 
19 
25009250800 
36 
20 
25009250800 
40 
21 
25009250800 
43 
22 
25009250800 
43 
23 
25009250800 
43 
After
aggregating: 
Geocode

Age
category 
Number
of deaths (numerator) 
25009250500 
014 
3 
25009250500 
1524 
2 
25009250500 
2544 
5 
25009250800 
014

3 
25009250800 
1524 
2 
25009250800 
2544 
8 
View
Case Example & SAS Programming for Step 1

Aggregating
Denominator Data 
Denominator
data at the census tract level typically come from the decennial
census. In 1990, the US Census reported population counts by age
in 31 categories (<1, 12, 34, 5, 6, 79, 1011, 1213, 14,
15, 16, 17, 18, 19, 20, 21, 2224, 2529, 3034, 3539, 4044,
4549, 5054, 5559, 6061, 6264, 6569, 7074, 7579, 8084,
85+). In
the 1990 US Census STF3, age specific population counts were reported
in table P013. Variable P0130001 gave the count of residents <1
year old, P0130002 gave the count of residents 12 years old,
etc.
For
the purposes of age standardization, these age categories need
to be reaggregated to match the age categories used for categorizing
case data (numerators, above) and the age categories from the
standard million reference population. Additionally, when using
case data from multiple years, in order to calculate an average
annual incidence rate, one needs to use a persontime denominator
(population count multiplied by number of years of case data).
For example, in the case of the Massachusetts allcause mortality
data, we have three years worth of cases (19891991). Therefore,
we multiply the population count in each age category by 3.
Example:
For census tract 25009250800 in 1990, we wish to age standardize
using the same five broad age categories as in the numerator
example above (014, 1524, 2544, 4564, 65+):


Before: 

Census
variable 
Ages
(years) 
Population
count 
P0130001 
<1 
115 
P0130002 
12 
243 
P0130003 
34 
197 
P0130004 
5 
92 
P0130005 
6 
59 
P0130006 
79 
237 
P0130007 
1011 
160 
P0130008 
1213 
141 
P0130009 
14 
77 
P0130010 
15 
62 
P0130011 
16 
54 
P0130012 
17 
94 
P0130013 
18 
65 
P0130014 
19 
89 
P0130015 
20 
101 
P0130016 
21 
128 
P0130017 
2224 
387 
P0130018 
2529 
571 
P0130019 
3034 
746 
P0130020 
3539 
422 
P0130021 
4044 
354 
P0130022 
4549 
317 
P0130023 
5054 
176 
P0130024 
5559 
174 
P0130025 
6061 
65 
P0130026 
6264 
214 
P0130027 
6569 
158 
P0130028 
7074 
316 
P0130029

7579 
178 
P0130030 
8084 
112 
P0130031 
85+

69 
In
order to collapse these variables into the five broad age categories,
we have to sum up census variables as follows:


After: 
back
to top 
Age
category 
Population
count 
Persontime
denominator
(x 3 years of case data) 
014

SUM
OF (P0130001  P0130009) 
1321

3963 
1524

SUM
OF (P0130010  P0130017) 
980 
2940 
2544

SUM
OF (P0130018  P0130021) 
2093 
6279 
4564 
SUM
OF (P0130022  P0130026) 
946 
2838 
65+ 
SUM
OF (P0130027  P0130031) 
833 
2499 

Merging
numerators with denominators and ABSM. 
Once
the numerators and denominators have the same structure (AREAKEY
x AGECAT), they can be merged together, along with the ABSM data
(by AREAKEY). For age cells within areas where no cases were reported,
we set the numerator to zero.
Example:


Before
merging with ABSM: 

Numerator
dataset: 
Geocode/Areakey 
Age
category 
Number
of deaths (numerator) 
25009250500 
014 
3 
25009250500 
1524 
2 
25009250500 
2544 
5 
25009250500 
4564

7 
25009250500 
65+

26 
25009250800 
014 
4 
25009250800 
1524 
3 
25009250800 
2544 
8 
25009250800 
4564 
13 
25009250800 
65+ 
132 
Denominator
dataset: 
Geocode/Areakey 
Age
category 
Persontime
denominator (x 3 years of case data) 
25009250500 
014 
4152 
25009250500 
1524 
1953 
25009250500 
2544 
3489 
25009250500 
4564

1233 
25009250500 
65+

1212 
25009250800 
014 
3963 
25009250800 
1524 
2940 
25009250800 
2544 
6279 
25009250800 
4564 
2838 
25009250800 
65+ 
2499 

After
merging with ABSM: 

Geocode

Age
category 
Poverty

Numerator 
Denominator 
25009250500 
1 
4 
3 
4152 
25009250500 
2 
4 
2 
1953 
25009250500 
3 
4 
5 
3489 
25009250500 
4 
4 
7 
1233 
25009250500 
5 
4 
26 
1212 
25009250800 
1 
3 
4 
3963 
25009250800 
2 
3 
3 
2940 
25009250800 
3 
3 
8 
6279 
25009250800 
4 
3 
13 
2838 
25009250800 
5 
3 
132 
2499 

Aggregating
OVER areas into ABSM strata 
Next,
in order to generate rates for categories of a specific ABSM,
it is necessary to aggregate OVER areas into strata defined by
AGECAT and ABSM. Numerators and denominators from census tracts
with missing ABSM data for a particular ABSM are typically excluded
from that analysis.
Example:
In Suffolk County, Massachusetts, there are a total of 189 census
tracts. We wish to examine all cause mortality rates by poverty,
with poverty categorized into 4 strata (04.9%, 59.9%, 1019.9%,
and 20100%).


ABSM:
CT Poverty 
Number
of census tracts 

0.04.9% 
10 
5.09.9% 
37 
10.019.9% 
56 
20.0100.0% 
83 
Missing
poverty data 
3 

Thus,
to obtain the mortality rates in the least impoverished stratum
(0.04.9% below poverty), we need to aggregate the cases and
the population at risk OVER the ten census tracts in that stratum
(preserving the age structure WITHIN each poverty stratum so
that we can age standardize in the following step, below). For
the next poverty stratum (5.09.9%) we need to aggregate the
cases and the population denominator over 37 census tracts,
and so on. Cases and population at risk in the three census
tracts with missing poverty data are excluded from the analysis.
This
yields the following table:

. 
ABSM:
CT poverty 
Age
category 
Numerator

Denominator


0.04.9%

014 
1

10,608

0.04.9% 
1524 
5

9,984

0.04.9% 
2544 
54

29,190

0.04.9% 
4564 
106 
16,710

0.04.9% 
65+ 
657 
15,825

5.09.9%

014 
40

69,939 
5.09.9% 
1524 
39

64,065

5.09.9% 
2544 
252

179,595

5.09.9% 
4564 
792 
90,042

5.09.9% 
65+ 
4,535

80,916

10.019.9%

014 
101 
88,989

10.019.9% 
1524 
93

93,147

10.019.9% 
2544 
531

224,793

10.019.9% 
4564 
962

100,479

10.019.9% 
65+ 
3,944

71,955

20.0100.0% 
014 
182

155,193 
20.0100.0% 
1524 
170

217,593

20.0100.0% 
2544 
831 
288,882

20.0100.0% 
4564 
1,291

108,588

20.0100.0% 
65+ 
3,645

72,720

Generating
Rates and Other Summary Measures/Measures of Effect 
1. Agestandardized incidence rates
The standard practice of public health departments in reporting
population rates of mortality and disease incidence is to calculate
agestandardized rates, which facilitates comparisons between
regions or subgroups of interest. The agestandardized rate is
interpretable as the rate that would be observed in a population
if that population had the same age distribution as a given reference
population. Standardization by the direct method involves taking
a weighted average of the age specific incidence rates observed
in the area or subgroup of interest, where the weights come from
a standard age distribution, such as the year 2000 standard million.1
"Standard million" reference populations are available
based on the US population age distribution for 1940, 1970, 1980,
1990, and 2000. Here we present the standard million in 11 age
categories.


Age
(years) 
Standard
million reference population 

Year
1940 
Year
1970 
Year
1980 
Year
1990 
Year
2000 
<1 
15,343 
17,150 
15,598 
12,936 
13,818 
14 
64,718 
67,265 
56,565 
60,863 
55,317 
514 
170,355 
200,511 
154,238 
141,584 
145,565 
1524 
181,677 
174,405 
187,542 
147,860 
138,646 
2534 
162,066 
122,567 
163,683 
173,600 
135,573 
3544 
139,237 
113,616 
113,155 
151,095 
162,613 
4554 
117,811 
114,265 
100,641 
101,416 
134,834 
5564 
80,294 
91,481 
95,799 
85,030 
87,247 
6574 
48,426 
61,192 
68,775 
72,802 
66,037 
7584 
17,303 
30,112 
34,116 
40,429 
44,842 
85+ 
2,770 
7,436 
9,888 
12,385 
15,508 
For
our project, we used five broad age categories to age standardize,
in order to obtain more stable rates in each age stratum, particularly
for outcomes with sparse data. The relationship between our five
categories and the standard eleven categories is illustrated in
the table below.


Age
in 11 categories 
Year
2000 standard million 
Age
in 5 categories

Year
2000 standard million


<1 
13,818

<15 
214,700

14 
55,317

514 
145,565

1524 
138,646

1524 
138,646

2534 
135,573

2544 
298,186

3544 
162,613

4554 
134,834

4564 
222,081

5564 
87,247

6574 
66,037

65+ 
126,387

7584 
44,842

85+ 
15,508

Example:
To calculate the agestandardized all cause mortality rates
in each of the four poverty strata in Suffolk County, we start
with the agespecific mortality data. In each poverty stratum,
the age standardized mortality rate is calculated as a weighted
sum of the agespecific mortality rates, with the weights for
each age stratum defined by the Year 2000 standard million.


ABSM:
CT poverty 
Age
category 
Numerator

Denominator

Year
2000 standard million

wj
(weight)

IRj
(incidence rate per 100,000)

IRst
(age standardized rate per 100,000) 

0.04.9%

014 
1

10,608

214,700 
0.215 
9.4

729.7

0.04.9% 
1524 
5

9,984

138,646

0.139 
50.1

0.04.9% 
2544 
54

29,190

298,186

0.298 
185.0

0.04.9% 
4564 
106 
16,710

222,081 
0.222 
634.4

0.04.9% 
65+ 
657 
15,825

126,387 
0.126 
4,151.7 
5.09.9%

014 
40

69,939 
214,700 
0.215 
57.2

966.2

5.09.9% 
1524 
39

64,065

138,646

0.139 
60.9 
5.09.9% 
2544 
252

179,595

298,186

0.298 
140.3

5.09.9% 
4564 
792 
90,042

222,081

0.222 
879.6

5.09.9% 
65+ 
4,535

80,916

126,387 
0.126 
5,604.6

10.019.9%

014 
101 
88,989

214,700

0.215 
113.5

1,014.0

10.019.9% 
1524 
93

93,147

138,646

0.139 
99.8

10.019.9% 
2544 
531

224,793

298,186

0.298 
236.2

10.019.9% 
4564 
962

100,479

222,081

0.222 
957.4

10.019.9% 
65+ 
3,944

71,955

126,387

0.126 
5,481.2

20.0100.0% 
014 
182

155,193 
214,700

0.215 
117.3

1,019.3 
20.0100.0% 
1524 
170

217,593

138,646

0.139 
78.1 
20.0100.0% 
2544 
831 
288,882

298,186

0.298 
287.7 
20.0100.0% 
4564 
1,291

108,588

222,081 
0.222 
1,188.9

20.0100.0% 
65+ 
3,645

72,720

126,387

0.126 
5,012.4

Example:
In the following analysis of mortality due to homicide and
legal intervention among hispanic women in Massachusetts,
the lower confidence limits on the rate in the 5.09.9% poverty
stratum is negative, using the traditional normal approximation
method. In contrast, the lower confidence limit based on the
gamma distribution yields a more reasonable confidence limit.


ABSM:
CT poverty 
Rate
per 100,000 
Confidence
Limits 
Deaths 
Persontime
at risk 

Normal
approximation 
"Gamma"
interval 
Lower 
Upper 
Lower 
Upper 
0.04.9% 
0.0 
(0.0 
,0.0) 
(0.0 
,9.2) 
0 
40,182 
5.09.9%

3.5 
(0.5 
,7.5) 
(0.7 
,10.3) 
3 
67,458 
10.019.9%

3.8 
(0.1 
,7.5) 
(1.0 
,9.7) 
4 
87,336

20.0100.0%

4.2 
(1.4 
,7.0) 
(1.9 
,8.0) 
11 
228,288

Example:
In the analysis of mortality due to homicide and legal intervention
among Hispanic women in Massachusetts, the estimated rate in
the least impoverished group is zero, since there were no deaths
reported in census tracts with 04.9% below poverty. In the
table below, the normal approximation method yields a confidence
interval of (0,0) for the rate in the least impoverished group,
as well (as “impossible” negative lower limits on
the rates in the 5.09.9% poverty stratum, as we saw above).
The gamma method also yields a (0,0) interval for the rate in
the least impoverished group, so we have corrected the entry
for the upper confidence limit as described above. Using the
“exact” upper limit on the count of 3.689,
we divide this by the denominator (40,182) to give an upper
limit of 9.2 per 100,000.


ABSM:
CT poverty 
IRst
(age standardized rate per 100,000)

Confidence
Limits 
Deaths 
Persontime
at risk 

Normal
approximation 
"Gamma"
interval 
Lower 
Upper 
Lower 
Upper 
0.04.9%

0.0 
(0.0

,0.0)

(0.0

,9.2)

0 
40182 
5.09.9%

3.5 
(0.5 
,7.5) 
(0.7 
,10.3) 
3 
67458 
10.019.9%

3.8 
(0.1 
,7.5) 
(1.0 
,9.7) 
4 
87336 
20.0100.0%

4.2 
(1.4 
,7.0) 
(1.9 
,8.0) 
11 
228288 
Example:
To compare the agestandardized incidence rates in the most
and least impoverished census tracts in Suffolk County, we start
with the agespecific data for these two strata (note: for ease
of presentation, we present variances in scientific notation
in the table below):


ABSM:
CT Poverty

Age
category

Numerator

Denominator

wj
(weight)

IRj
(age specific rate)

Var(IRj )
(variance of the age specific rate)

IRst
(age standardized rate)

Var(IRst )
(variance of the age standardized rate)


0.04.9%

014 
1 
10,608 
0.2147 
0.000094 
8.887E09 
0.007297

6.76E08 
0.04.9%

1524 
5

9,984

0.1386 
0.000501 
5.016E08 
0.04.9%

2544 
54

29,190

0.2982 
0.001850 
6.338E08 
0.04.9%

4564 
106

16,710

0.2221 
0.006344 
3.796E07 
0.04.9%

65+ 
657 
15,825

0.1264 
0.041517 
2.623E06 
20.0100.0%

014 
182

155,193

0.2147 
0.001173 
7.557E09 
0.010193 
1.77E08 
20.0100.0%

1524 
170

217,593

0.1386

0.000781 
3.591E09 
20.0100.0%

2544 
831

288,882

0.2982 
0.002877 
9.958E09 
20.0100.0%

4564 
1,291

108,588

0.2221 
0.011889 
1.095E07 
20.0100.0%

65+ 
3,645

72,720

0.1264 
0.050124 
6.893E07 
5. Relative Index of Inequality (RII)
Comparisons of socioeconomic gradients based on categorical ABSM
may be complicated by differences in the population distributions
of areabased socioeconomic measures. For example, it may be expected
that the classifications producing smaller groups at the margins
would lead to larger incidence rate ratios, comparing the most
deprived to the most affluent, because finer discrimination of
extremes of socioeconomic position is achieved. The relative index
of inequality (RII) has been proposed as a measure which explicitly
addresses this problem.57
Assuming ordinality of the ABSM categories, the RII is calculated
by regressing the incidence rate in each ABSM category on the
total proportion of the population that is more deprived in the
socioeconomic hierarchy. Because the RII combines information
about the magnitude of the socioeconomic gradient with information
about the distribution of the socioeconomic variable in the population,
it can be conceptualized as a measure of "total population
input".
In
practice, this latter quantity is represented by the cumulative
distribution function (cdf). We approximate the cdf for the jth
level of a given ABSM by summing the proportion of the population
represented by the categories ABSM1 , …, ABSMj1 ,
and adding onehalf the proportion of the population represented
by the category ABSMj .
Example:
In order to calculate the RII for poverty and all cause mortality
in Massachusetts, we begin by calculating the approximate cumulative
distribution function as follows:


ABSM:
CT poverty 
Population
denominator 
Proportion 
Formula 
Approximate
cdf


0.04.9%

7,626,117

0.423 
=0.423/2 
0.211 
5.09.9% 
5,508,912 
0.305 
=0.423+0.305/2 
0.576 
10.019.9%

2,782,194

0.154 
=0.423+0.305+0.154/2 
0.805 
20.0100.0%

2,120,208

0.118 
=0.423+0.305+0.154+0.118/2 
0.941 
Example:
The column of red numbers shows the expected number of cases
in each poverty stratum.


ABSM:
CT Poverty 
IRst
(age standardized rate per 100,000) 
Observed
deaths

Population
denominator

Expected
deaths

Approximate
cdf


04.9% 
757.0

57,256

7,626,117

57,731.7

0.211 
59.9% 
840.3

52,583

5,508,912

46,291.7 
0.576 
1019.9% 
915.9

27,730 
2,782,194

25,482.0

0.805 
20100% 
1,035.3

17,842

2,120,208

21,950.7

0.941 
Example:
To calculate the population attributable fraction of all cause
mortality due to poverty, we begin by tabulating the cases and
population persontime at risk in each poverty stratum j within
each age group i. Within each age group, the case fraction CFij
is the number of cases in that poverty stratum, divided by the
total number of cases within the age group. The incidence rate
ratio IRRij for a particular poverty stratum, relative
to the reference category of the least impoverished group, is
calculated by dividing the rate in that poverty stratum by the
rate in the least impoverished group. For each age stratum,
we calculate a separate agespecific PAF, as seen in the column
of red numbers in the table below. These agespecific PAFs range
from 5% to 23%.


Age
category (i)

ABSM:
CT poverty (j)

Cases

Persontime
denominator

Rate
per 100,000

Case
Fraction (CFij )

Incidence
rate ratio (IRRij )

Population
attributable fraction (PAFi )


014 
04.9%
(reference) 
303 
727,947 
41.6 
40.7% 
1.00 
0.1626 
5.09.9% 
253

461,958 
54.8

34.0% 
1.32 
10.019.9% 
113

206,214

54.8

15.2% 
1.32 
20.0100.0% 
75 
100,716 
74.5

10.1%

1.79 
Total
cases: 
744 


1524

04.9%
(reference) 
377

510,645 
73.8 
40.6% 
1.00 
0.0506 
5.09.9% 
323 
349,518

92.4

34.8% 
1.25 
10.019.9% 
152

179,928

84.5

16.4% 
1.14 
20.0100.0% 
76 
153,273

49.6

8.2% 
0.67 
Total
cases: 
928



2544

04.9%
(reference) 
1,569 
1,201,002

130.6

34.7% 
1.00 
0.2266 
5.09.9% 
1,392

873,072

159.4

30.7% 
1.22 
10.019.9% 
933

405,366 
230.2

20.6% 
1.76 
20.0100.0% 
633

200,457

315.8 
14.0% 
2.42 
Total
cases: 
4,527 


4564

04.9%
(reference) 
5,314 
763,464 
696.0

39.7% 
1.00 
0.2210 
5.09.9% 
4,429

461,451

959.8

33.1% 
1.38 
10.019.9% 
2,287

191,934

1,191.6

17.1% 
1.71 
20.0100.0% 
1,369

82,674 
1,655.9 
10.2% 
2.38 
Total
cases: 
13,399



65+

04.9%
(reference)

19,470 
376,002

5,178.2 
38.8% 
1.00 
0.0725 
5.09.9% 
17,784

314,181 
5,660.4

35.4% 
1.09 
10.019.9% 
8,734

146,091

5,978.5 
17.4% 
1.15 
20.0100.0% 
4,248 
63,594 
6,679.9 
8.5% 
1.29 
Total
cases: 
50,236 

To
aggregate these PAFs across age strata, we weight the contribution
of each age stratum by the proportion of cases in that age stratum.
As seen in the table below, this results in an aggregated population
attributable fraction PAFagg of 11%.


Age
category (i) 
Cases 
Population
attributable fraction (PAFi ) 

Aggregated
population attributable fraction (PAFagg )


014 
744

0.1626 
(744*0.1626
+ 928*0.0506 + 4527*0.2266 + 13399*0.2210 + 50236*0.0725)/ 69834 
=
0.1116 
1524 
928

0.0506 
2544 
4,527

0.2266 
4564 
13,399

0.2210 
65+ 
50,236

0.0725 
Total
cases: 
69,834





1. 
Breslow NE, Day NE (eds). Statistical Methods in Cancer Research,
Vol. II: The Design and Analysis of Cohort Studies. Oxford, UK: Oxford
University Press, 1987. 
2. 
Anderson
RN, Rosenberg HM. Age standardization of death rates: implementation
of the year 2000 standard; National Vital Statistics Reports: Vol
37, No. 3. Hyattsville, MD: National Center for Health Statistics,
1998. 
3. 
Fay
MP, Feuer EJ. Confidence intervals for directly standardized rates:
a method based on the gamma distribution. Statistics in Medicine 1997;16:791801. 
4. 
Rothman
KJ, Greenland S. Modern Epidemiology. 2nd Edition. Philadelphia: LippincottRaven,
1998. 
5. 
Pamuk
ER. Social class inequality in mortality from 1921 to 1972 in England
and Wales. Popul Stud 1985;39:1731. 
6. 
Wagstaff
A, Paci P, van Doorslaer E. On the measurement of inequalities in
health. Soc Sci Med 1991;33:54557. 
7. 
Davey
Smith G, Hart C, Hole D, et al. Education and occupational social
class: which is the more important indicator of mortality risk? J
Epidemiol Community Health 1998;52:15360. 
8. 
JA
Hanley, A heuristic approach to the formulas for population attributable
fraction. J Epidemiol Community Health 2001,55:508514. 
