Our primary analytic approach for describing socioeconomic gradients by area-based socioeconomic measures has been to use geocodes to append area-based socioeconomic data to case records, to stratify these records into discrete categories based on ABSM, and to aggregate numerators and denominators over areas, within levels defined by ABSM. This method avoids the problem of unstable rates arising from small areas by assuming that cases and population denominators from areas with similar socioeconomic characteristics can be legitimately combined into the same strata. An alternative approach, which preserves the spatial information of the geocodes, is discussed in the section on multilevel analyses.
The following steps are used to generate age-standardized disease rates stratified by area-based socioeconomic measures, once the case data have been geocoded and appropriate ABSMs have been generated from census data.
• Aggregate the case data into numerators (age cells within areas/geocodes).
• Aggregate population denominator data into age cells within areas/geocodes.
• Merge the numerators and denominators with ABSMs, by area/geocode.
• Aggregate over areas into strata defined by categorical ABSM and age category.
• Generate age-standardized rates and other summary measures.
[See the ‘Tools” section for a step by step comparison of the analytic methods, the relevant task of the Case Example, and sample SAS code.]
Aggregating Numerator Data
Data from public health databases are typically formatted such that each record represents one person (or case report). Once these data have been geocoded, they need to be aggregated before linking to denominator and ABSM data. Before aggregating, however, one should exclude all records that are not geocoded, do not meet the case definition, or are missing data on the important covariates (e.g. age, in the case of simple age-standardized analyses; age, sex, and race/ethnicity in the case of more complex stratified analyses).
One can think of the basic unit of aggregation as a cell, defined by age and other covariates, within an area/geocode. Once aggregated, this cell within an area can be linked to a relevant population denominator. The cell contains a count of all cases within that area that meet the specified age and other covariate criteria. Since our goal is eventually to create rates, we call this count of cases the “numerator.”
We intend to age-standardize in 5 broad age categories, 0-14, 15-24, 25-44, 45-64, 65+. Therefore, we need to aggregate the records in each census tract into cells defined by the corresponding ages. As an example, consider the following 23 records from census tracts 25009250500 and 25009250800.
|Record #||Geocode||Age of death|
|Geocode||Age category||Number of deaths (numerator)|
Aggregating Denominator Data
Denominator data at the census tract level typically come from the decennial census. In 1990, the US Census reported population counts by age in 31 categories (<1, 1-2, 3-4, 5, 6, 7-9, 10-11, 12-13, 14, 15, 16, 17, 18, 19, 20, 21, 22-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-61, 62-64, 65-69, 70-74, 75-79, 80-84, 85+). In the 1990 US Census STF3, age specific population counts were reported in table P013. Variable P0130001 gave the count of residents <1 year old, P0130002 gave the count of residents 1-2 years old, etc.
For the purposes of age standardization, these age categories need to be re-aggregated to match the age categories used for categorizing case data (numerators, above) and the age categories from the standard million reference population. Additionally, when using case data from multiple years, in order to calculate an average annual incidence rate, one needs to use a person-time denominator (population count multiplied by number of years of case data). For example, in the case of the Massachusetts all-cause mortality data, we have three years worth of cases (1989-1991). Therefore, we multiply the population count in each age category by 3.
For census tract 25009250800 in 1990, we wish to age standardize using the same five broad age categories as in the numerator example above (0-14, 15-24, 25-44, 45-64, 65+):
|Census variable||Ages (years)||Population count|
In order to collapse these variables into the five broad age categories, we have to sum up census variables as follows:
|Age category||Population count||Person-time denominator
(x 3 years of case data)
Merging numerators with denominators and ABSM
Once the numerators and denominators have the same structure (AREAKEY x AGECAT), they can be merged together, along with the ABSM data (by AREAKEY). For age cells within areas where no cases were reported, we set the numerator to zero.
|Before merging with ABSM:|
|Geocode/Areakey||Age category||Number of deaths (numerator)|
|Geocode/Areakey||Age category||Person-time denominator (x 3 years of case data)|
|After merging with ABSM:|
Aggregating OVER areas in ABSM strata
Next, in order to generate rates for categories of a specific ABSM, it is necessary to aggregate OVER areas into strata defined by AGECAT and ABSM. Numerators and denominators from census tracts with missing ABSM data for a particular ABSM are typically excluded from that analysis.
Example: In Suffolk County, Massachusetts, there are a total of 189 census tracts. We wish to examine all cause mortality rates by poverty, with poverty categorized into 4 strata (0-4.9%, 5-9.9%, 10-19.9%, and 20-100%).
|ABSM: CT Poverty||Number of census tracts|
|Missing poverty data||3|
Thus, to obtain the mortality rates in the least impoverished stratum (0.0-4.9% below poverty), we need to aggregate the cases and the population at risk OVER the ten census tracts in that stratum (preserving the age structure WITHIN each poverty stratum so that we can age standardize in the following step, below). For the next poverty stratum (5.0-9.9%) we need to aggregate the cases and the population denominator over 37 census tracts, and so on. Cases and population at risk in the three census tracts with missing poverty data are excluded from the analysis.
This yields the following table:
|ABSM: CT poverty||Age category||Numerator||Denominator|
Generating Rates and Other Summary Measures/Measures of Effect
1. Age-standardized incidence rates
The standard practice of public health departments in reporting population rates of mortality and disease incidence is to calculate age-standardized rates, which facilitates comparisons between regions or subgroups of interest. The age-standardized rate is interpretable as the rate that would be observed in a population if that population had the same age distribution as a given reference population. Standardization by the direct method involves taking a weighted average of the age specific incidence rates observed in the area or subgroup of interest, where the weights come from a standard age distribution, such as the year 2000 standard million .
“Standard million” reference populations are available based on the US population age distribution for 1940, 1970, 1980, 1990, and 2000. Here we present the standard million in 11 age categories.
|Age (years)||Standard million reference population|
|Year 1940||Year 1970||Year 1980||Year 1990||Year 2000|
For our project, we used five broad age categories to age standardize, in order to obtain more stable rates in each age stratum, particularly for outcomes with sparse data. The relationship between our five categories and the standard eleven categories is illustrated in the table below.
|Age in 11 categories||Year 2000 standard million||Age in 5 categories||Year 2000 standard million|
If represents the number of cases in age group j of the group or region of interest and represents the population associated with that age group, then the standardized rate for the group or region is
where is the weight associated with category j in the reference (standardizing) population (e.g. the population size or the proportion of the total population). The estimated variance of the standardized rate is given by:
(When the s are proportions, then and ).
To calculate the age-standardized all cause mortality rates in each of the four poverty strata in Suffolk County, we start with the age-specific mortality data. In each poverty stratum, the age standardized mortality rate is calculated as a weighted sum of the age-specific mortality rates, with the weights for each age stratum defined by the Year 2000 standard million.
|ABSM: CT poverty||Age category||Numerator||Denominator||Year 2000 standard million||(weight)||(incidence rate per 100,000)||(age standardized rate per 100,000)|
2. Confidence intervals for directly standardized rates
Traditional confidence limits for the direct standardized rates are based on the normal distribution and require large cell counts. In our analyses, we found that they can also occasionally result in “impossible” lower limits that are less than zero. Because of this, we adopted an alternate method for calculating the confidence limits based on the inverse gamma function . This method assumes that the direct standardized rate is a linear combination of independent Poisson random variables. Assuming that this linear combination also follows a Poisson distribution, the age-standardized rate E(X) = x follows a gamma distribution Γ(a,b) as follows:
where x is the age-standardized rate ( as estimated above) and v is its variance, as described above. Converting this to the gamma distribution in its standard form, i.e. where b=1, this yields
which greatly simplifies calculations. Then the lower 100(1-α) confidence limit for is given by
and the upper 100(1-α) confidence limit for is given by
where is a continuity correction necessitated by using a continuous distribution to estimate confidence limits for a discrete random variable.
Increasing the number of events by 1 in an age stratum i results in a increase in the age-standardized rate. If is constant for all age intervals, then . However, since the and typically vary across age strata, it is unclear what value of k to use. A very conservative upper limit can be obtained by using the maximum value of . However, following the recommendation of the NCHS, we used a close approximation that alleviates the need to calculate :
To transform these intervals to obtain the desired confidence limits for X, we use L(X) = and U(X) = .
In the following analysis of mortality due to homicide and legal intervention among hispanic women in Massachusetts, the lower confidence limits on the rate in the 5.0-9.9% poverty stratum is negative, using the traditional normal approximation method. In contrast, the lower confidence limit based on the gamma distribution yields a more reasonable confidence limit.
|ABSM: CT poverty||Rate per 100,000||Confidence Limits||Deaths||Person-time at risk|
|Normal approximation||“Gamma” interval|
3. Confidence intervals for IRst
When the observed rate is zero (i.e. there were zero cases), the gamma method is unable to produce confidence limits for the direct standardized rates. In this situation, we adopt the following convention for the confidence limit. The lower limit is simply set to zero. For the upper limit, we assume that the number of cases (i.e. the count) follows a Poisson distribution, and use the formula for the “exact” upper confidence limit of a Poisson random variable :
where y is the count, i.e. zero. When α = 0.05 (i.e. for a 95% confidence limit) this simplifies to U(Y) = = 3.689.
We can then divide this upper limit on the count by the population denominator to give the upper limit on the rate.
In the analysis of mortality due to homicide and legal intervention among Hispanic women in Massachusetts, the estimated rate in the least impoverished group is zero, since there were no deaths reported in census tracts with 0-4.9% below poverty. In the table below, the normal approximation method yields a confidence interval of (0,0) for the rate in the least impoverished group, as well (as “impossible” negative lower limits on the rates in the 5.0-9.9% poverty stratum, as we saw above). The gamma method also yields a (0,0) interval for the rate in the least impoverished group, so we have corrected the entry for the upper confidence limit as described above. Using the “exact” upper limit on the count of 3.689, we divide this by the denominator (40,182) to give an upper limit of 9.2 per 100,000.
|ABSM: CT poverty||(age standardized rate per 100,000)||Confidence Limits||Deaths||Person-time at risk|
|Normal approximation||“Gamma” interval|
4. Age-standardized incidence rate difference and rate ratio
Two commonly used measures for comparing incidence rates from two different groups are the incidence rate difference (IRD) and the incidence rate ratio (IRR). The incidence rate difference compares the rates on the absolute scale, and summarizes the excess rate comparing the larger to the smaller rate. The incidence rate ratio compares the rates on a relative scale, summarizing the size of one rate relative to the other rate.
To compare two age-standardized incidence rates on the absolute scale, the age-standardized incidence rate difference () is the rate in one group minus the rate in the other, i.e. . The variance of this age-standardized incidence rate difference is simply the sum of the estimated variance of the two age-standardized rates ,
To compare age-standardized rates from two different groups or regions on the relative scale, the age-standardized incidence rate ratio () is simply /. Confidence intervals can be calculated using the variance estimator :
To compare the age-standardized incidence rates in the most and least impoverished census tracts in Suffolk County, we start with the age-specific data for these two strata (note: for ease of presentation, we present variances in scientific notation in the table below):
|ABSM: CT Poverty||Age category||Numerator||Denominator||(weight)||(age specific rate)||Var() (variance of the age specific rate)||(age standardized rate)||Var() (variance of the age standardized rate)|
The age-standardized rate difference is simply 1,019.3 per 100,000 – 729.7 per 100,000 = 289.6 per 100,000 (or, in scientific notation, 2.896 x 10-3).
Using the formula above, we calculate the variance of .
Then the lower and upper confidence limits are derived as follows:
or, expressed per 100,000, 232.2 to 346.9 per 100,000.
The age-standardized rate ratio is simply 1,019.3 per 100,000/729.7 per 100,000 = 1.40.
Using the formula above, we calculate the variance of log:
Then the lower and upper confidence limits are derived as follows:
5. Relative Index of Inequality (RII)
Comparisons of socioeconomic gradients based on categorical ABSM may be complicated by differences in the population distributions of area-based socioeconomic measures. For example, it may be expected that the classifications producing smaller groups at the margins would lead to larger incidence rate ratios, comparing the most deprived to the most affluent, because finer discrimination of extremes of socioeconomic position is achieved. The relative index of inequality (RII) has been proposed as a measure which explicitly addresses this problem . Assuming ordinality of the ABSM categories, the RII is calculated by regressing the incidence rate in each ABSM category on the total proportion of the population that is more deprived in the socioeconomic hierarchy. Because the RII combines information about the magnitude of the socioeconomic gradient with information about the distribution of the socioeconomic variable in the population, it can be conceptualized as a measure of “total population input”.
In practice, this latter quantity is represented by the cumulative distribution function (cdf). We approximate the cdf for the jth level of a given ABSM by summing the proportion of the population represented by the categories ,…, , and adding one-half the proportion of the population represented by the category .
In order to calculate the RII for poverty and all cause mortality in Massachusetts, we begin by calculating the approximate cumulative distribution function as follows:
|ABSM: CT poverty||Population denominator||Proportion||Formula||Approximate cdf|
In order to compare RII meaningfully across groups with differing age composition, we developed an age-standardized RII, standardized to the year 2000 standard million, as follows. Let observed be the observed number of cases in the ith age group and the jth category of ABSM, and pop be the population at risk in the corresponding category. First, we calculate the age-standardized rate IR in each stratum j defined by ABSM, as described above. For each stratum j, we estimate the expected number of cases in stratum j, , by multiplying the age-standardized rate IR by the population denominator, pop = Σpop. We determine the “marginal” cumulative distribution function, cdf(ABSM), of the ABSM over the entire population, as noted above.
The column of red numbers shows the expected number of cases in each poverty stratum.
|ABSM: CT Poverty||IRst (age standardized rate per 100,000)||Observed deaths||Population denominator||Expected deaths||Approximate cdf|
To calculate the age-standardized RIIst, we fit the following Poisson model for the expected cases:
Exponentiation of the yields the RII, which is interpretable as an incidence rate ratio comparing the rates in the bottom to the top of the socioeconomic hierarchy. A larger RII indicates a greater the degree of inequality across a socioeconomic hierarchy, which may be due to a steep socioeconomic gradient or large inequalities in the distribution of the ABSM itself.
Fitting this model to the data presented above yields a of 0.379. Exponentiating this, we obtain an RII of 1.46. In the figures below, we can see how the RII for poverty is obtained. In the left figure, the height of light blue bars represents the all cause mortality rate per 100,000 in each of the four poverty strata (0-4.9, 5-9.9, 10-19.9, 20-100%), with width of bars proportional to population size of poverty stratum (in order from least to most impoverished). Open circles are plotted along the x-axis at the interpolated midpoints of each bar, approximating the cumulative distribution function of CT level poverty. The solid line represents fitted RII line. In the left figure, this line is not a straight line since the fitted line comes from a Poisson model. The right figure shows the plotted points and fitted RII line on the log scale, where the line is truly straight.
6. Population Attributable Fraction
The population attributable fraction (PAF) is a useful summary measure for characterizing the public health impact of an exposure on population patterns of health and disease. It is defined as “the fraction of all cases (exposed and unexposed) that would not have occurred if exposure had not occurred.” For a polytymous exposure, the population attributable fraction is a weighted sum of the attributable fractions for each level of the exposure, with the weights defined by the case fractions (number of exposed cases divided by overall number of cases):
In order to aggregate multiple PAFs over several age strata i=1,…,I, note that
that is, a weighted average of stratum specific PAFs, with the number of cases in each age stratum as weights.
To calculate the population attributable fraction of all cause mortality due to poverty, we begin by tabulating the cases and population person-time at risk in each poverty stratum j within each age group i. Within each age group, the case fraction is the number of cases in that poverty stratum, divided by the total number of cases within the age group. The incidence rate ratio for a particular poverty stratum, relative to the reference category of the least impoverished group, is calculated by dividing the rate in that poverty stratum by the rate in the least impoverished group. For each age stratum, we calculate a separate age-specific PAF, as seen in the column of red numbers in the table below. These age-specific PAFs range from 5% to 23%.
|Age category (i)||ABSM: CT poverty (j)||Cases||Person-time denominator||Rate per 100,000||Case Fraction ()||Incidence rate ratio ()||Population attributable fraction ()|
To aggregate these PAFs across age strata, we weight the contribution of each age stratum by the proportion of cases in that age stratum. As seen in the table below, this results in an aggregated population attributable fraction of 11%.
|Age category (i)||Cases||Population attributable fraction ()||Aggregated population attributable fraction ()|
|0-14||744||0.1626||(744*0.1626 + 928*0.0506 + 4527*0.2266 + 13399*0.2210 + 50236*0.0725)/ 69834||
1. Breslow NE, Day NE (eds). Statistical Methods in Cancer Research, Vol. II: The Design and Analysis of Cohort Studies. Oxford, UK: Oxford University Press, 1987.
2. Anderson RN, Rosenberg HM. Age standardization of death rates: implementation of the year 2000 standard; National Vital Statistics Reports: Vol 37, No. 3. Hyattsville, MD: National Center for Health Statistics, 1998.
3. Fay MP, Feuer EJ. Confidence intervals for directly standardized rates: a method based on the gamma distribution. Statistics in Medicine 1997;16:791-801.
4. Rothman KJ, Greenland S. Modern Epidemiology. 2nd Edition. Philadelphia: Lippincott-Raven, 1998.
5. Pamuk ER. Social class inequality in mortality from 1921 to 1972 in England and Wales. Popul Stud 1985;39:17-31.
6. Wagstaff A, Paci P, van Doorslaer E. On the measurement of inequalities in health. Soc Sci Med 1991;33:545-57.
7. Davey Smith G, Hart C, Hole D, et al. Education and occupational social class: which is the more important indicator of mortality risk? J Epidemiol Community Health 1998;52:153-60.
8. JA Hanley, A heuristic approach to the formulas for population attributable fraction. J Epidemiol Community Health 2001;55:508-514.