|
Guidelines
for Working with Small Numbers
The following guidelines were adapted from those originally developed by the Assessment Operations Group of the Washington State Department of Health. CoHID gratefully acknowledges this group's willingness to share their hard work to benefit the citizens of Colorado. Persons interested in more technical details or have questions regarding data analysis are encouraged to consult statisticians and epidemiologists who have expertise in the analysis of public health data or to contact Health Statistics Section, Colorado Department of Public Health and Environment by phone (303-692-2160) or by email cohid@state.co.us |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Scope of the
"Guidelines for Working with Small Numbers"
Guidelines for working with small numbers
Public health data are typically presented in tabular form or released as files of record level data. The following guidelines address data presented in tabular form on paper or available electronically, produced for or readily accessed by the public, for uses other than mandated activities of state and local health departments. These guidelines apply to total population data, not to survey data based on a proportion of the population. Why are small numbers a concern in public health assessment? Public health policy decisions are fuelled by information. Often, this information is in the form of statistical data. Questions concerning health outcomes and related health behaviors and environmental factors often are studied within small subgroups of a population. Continuing improvements in the performance and availability of computing resources, including geographic information systems, and the need to better understand the relationships between environment, behavior, and consequent health effects have led to increased demand for data on small populations. These demands are often at odds with the need to preserve privacy and data confidentiality. Small numbers also raise statistical issues concerning the accuracy, and thus usefulness, of the data. In general, problems with confidentiality arise when there are small denominators (population size represented in a specific cell in a table); and, problems with data reliability arise when there are small numerators (cases in a specific cell in a table). What constitutes a breach of confidentiality?A breach of confidentiality occurs when analysts release information in a way that identifies an individual and reveals confidential information about that person. The following guidelines provide some cues to situations that present high risk for a breach of confidentiality and some suggestions on how to reduce this risk. In addition to these guidelines, analysts should become familiar with the confidentiality requirements for specific databases as defined in state laws and regulations. Please note that state laws and regulations supercede guidance provided in this document. Why do we question the reliability of statistics based on small numbers?Estimates based on a random sample of a population are subject to error due to sampling variability. Rates and percentages based on a full population count are also subject to random variation. The random variation may be substantial when the measure, such a rate or percentage, has a small number of events in the numerator. Typically, rates based on large numbers provide stable estimates of the true, underlying rate. Conversely, rates based on small numbers may fluctuate dramatically from year to year, or differ considerably from one small place to another small place, even when there is no meaningful difference. Meaningful analysis of differences in rates between geographic areas or over time requires that the random variation in rates be quantified; this is especially important when rates or percentages have small numerators. Guidelines for working with small numbers These guidelines address both confidentiality and statistical issues in working with small numbers. A first step in using these guidelines is to determine if the data set(s) you are working with contain confidential information. If so, the following section on protecting confidentiality needs to be carefully reviewed. Otherwise, you need only concern yourself with the statistical issues section. In general, problems with confidentiality arise when there are small denominators (population size represented in a specific cell in a table); and, problems with data reliability arise when there are small numerators (cases in a specific cell in a table). In larger populations, it is more difficult to identify individuals from data released in tables. For example, if there are 5,000 individuals in a specific age-race-sex group in a single county, the likelihood of identifying a single individual from data in a published table is quite small. In smaller populations, it is more likely that an individual might be identifiable. Further, even in larger populations, it is conceivable that a single individual might be identifiable, if there are only one or two individuals with some special characteristic. For example, in a modest-sized community, it may be common knowledge that there is only one child who is frequently hospitalized, and a table showing that this community has one case of pediatric HIV-AIDS could unintentionally disclose confidential information. Thus, it is desirable to have rules for privacy protection which consider both denominator size and numerator size. Rules to address statistical reliability can be limited to consideration of numerator size. Examine denominator size for each cell. Prior to disseminating tables that contain confidential information, analysts should first consider the size of the denominators: the population size represented in each cell in the table. Generally, tabular data based on denominators greater than 300 persons per cell present minimal risk for individual identification. The risk of violating confidentiality increases substantially when data are tabulated for small subgroups of the population within small geographic areas. The analyst should exercise caution if the population size is between 100 and 300, and extreme caution is warranted when the population is less than 100. Examine numerator size for each cell. Second, data analysts should consider the number of events in each cell of a table to be released (i.e., the numerator for a rate calculation). If the count of cases or events in a cell is less than three, the data analyst needs to consider whether a breach of confidentiality is likely. A count of no events in the cell is clearly no threat to confidentiality, but a count of one or two events may be. How to reduce risk of confidentiality breach The general approach to privacy protection involves what has been termed "computational disclosure control," which includes both aggregation of data values in the dataset before analysis, and cell suppression in a table after analysis (Sweeney 1997). Aggregation. Aggregation of data values is appropriate for fields with large numbers of values, such as dates, diagnoses, and geographic areas; it is the primary method used to collapse a dataset in order to create tables with no small numbers as denominators or numerators in cells. The following table shows examples.
Cell Suppression. When it is not possible, or desirable, to create a table with no small numbers as denominators or numerators in cells, then cell suppression is used, together with complementary suppression. "Primary" cell suppression is used to withhold the data (numerator) in the cell which fails to meet the threshold, followed by suppression of three other cells in order to avoid inadvertent disclosure through back-calculation. Note that cell suppression is a method of last resort, due to the often unavoidable side-effect of suppression of releasable data values as a consequence of complementary suppression, and due to the amount of manual labor necessary to implement the method. The following table shows an example: supposing that the cell in the upper left (0-34 Black) did not meet the threshold for release, regardless of whether for reason of numerator or denominator size.
Other Methods.When neither of these methods (aggregation of data values to increase granularity, and cell suppression) are satisfactory, two alternatives remain. The first, and better choice, is to combine multiple years of data (which is a form of aggregation). The effect will be to increase the effective population size, since the (usually unstated) denominator is actually "person-time" in rate calculations, and the numerators are likely to rise correspondingly as well. The second alternative is to omit certain fields from analysis entirely. A recent example involved the release of asthma data: it was not possible to achieve adequately large cell denominators in annual county-level data showing both age-specific and gender-specific counts and rates. An advisory group opted to omit the gender-specific data, and display only tables of age-specific data, on the grounds that no intervention programs targeted groups differently on the basis of gender, but most intervention programs target age groups differently. Group identification.In addition to individual identification, analysts need to be alert to risks for group identification. Here, something confidential is revealed about a group of individuals identifiable by their age, race, or other reported characteristics. While this type of disclosure has received less attention than individual disclosure, it represents an emerging concern and should be considered when deciding whether to publish data. Note that this is more of a problem when the prevalence is high (over 80%) than it is when the prevalence in the group is low. In summary: The following practices can help assess and reduce confidentiality risks:
How to address the statistical issuesIncrease numerator size. In preparing a data table for dissemination, it is recommended that analysts first examine the counts in each cell of the table. If rates are desired and the numerator of any cell is less than 20, an effort should be made to increase the size of the numerator. (Use of 20 events as the threshold for reliability is consistent with standard CDC practice.) Techniques to accomplish this include the following:
Include confidence intervals. The inclusion of confidence intervals for rates is strongly recommended regardless of the number of health events, but it is especially important when the count is less than 20. Generally, rates with fewer than 20 events in the numerator have very wide confidence intervals. For example, an infant death rate of 10 per 1,000, based on 20 deaths out of a population of 2,000 live births, has a Poisson-based 95% confidence interval between 6 and 15. Clearly, this is not very precise information and users of the data need to know this. In instances where it is not feasible to incorporate confidence intervals into a data table it is recommended that analysts:
Suppress rates. Suppress rates based on very small numbers (i.e., fewer than 5 health events), reporting only the count (numerator).2 When rates are suppressed, tables should be constructed such that an indicator (e.g., asterisk) appears in the cell and a legend under the table explains the reason for suppression. Suppress confidence intervals. When rates based on very small numbers (i.e., fewer than 5 health events) are suppressed, confidence intervals should also be suppressed. When confidence intervals are suppressed, tables should be constructed such that an indicator (e.g., asterisk) appears in the cell and a legend under the table explains the reason for suppression. Confidential data/information: Information that an individual or establishment has provided in a relationship of trust, with the expectation that it will not be divulged in an identifiable form. The confidentiality of specific data elements or information in individual databases or record systems is defined by state laws and regulations and/or policies and procedures developed for those systems. Confidentiality breach: an unauthorized release of identifiable or confidential data/information, which may result from a security failure, intentional inappropriate behavior, human error, or natural disaster. A breach of confidentiality may or may not result in harm to one or more individuals. Individually identifiable data/information: Data/information that identifies, or is reasonably likely to be used to identify, an individual or an establishment protected under confidentiality laws. Identifiable data/information may include, but is not limited to, name, address, telephone number, social security number, and medical record number. Data elements used to identify an individual or protected establishment can vary depending on the geographic location and other variables (e.g., rarity of person's health condition or patient demographics). For purposes of this guideline, "identifiable information" includes potentially identifiable information. Number of events: The number of persons or events represented in any given cell of tabulated data (e.g., numerator). Population size: The total number of persons or events included in the calculation of an event rate (e.g., denominator). Public use dataset: A "de-identified" dataset with all individually identifiable data/information removed, and with remaining data fields modified (through cell suppression, aggregation of data values, or field omission) such that it is not possible to create any tables which have any cells sized less than 100 in the denominator. Rate: A measure of the frequency of an event per population unit. Sensitive personal information: Whereas confidential personal information means information collected about a person that is readily identifiable to that specific individual, sensitive personal information extends beyond that to information which may be inferred about individuals, where that information is associated with some stigma. Examples are certain diseases, health conditions, or health practices. The sensitivity of certain personal information may vary between communities. Cox LH. Protecting Confidentiality in Small Population Health and Environmental Statistics. Statistics in Medicine, Vol. 15, 1996: 1895-1905. Dever GA. Outcome assessment: Small area analysis and quality improvement methods. In: Improving Outcomes in Public Health Practice . Gaithersburg MD: Aspen, 1997: 341-77. Gold EB. Confidentiality and Privacy Protection in Epidemiologic Research. In: Coughlin SS, Beauchamp TL [eds]. Ethics and Epidemiology . New York: Oxford, 1996: 128-41. Kleinman JC. Assessing Stability of Rates and Changes in Rates. In: Statistical Notes for Health Planners . National Center for Health Statistics, Number 2, July, 1976: 9-12. NCHS Staff Manual on Confidentiality . Dept. of Health and Human Services, Public Health Service, National Center for Health Statistics. Hyattsville, MD. September, 1984. Sweeney L. Weaving Technology and Policy together to Maintain Confidentiality. Journal of Law, Medicine & Ethics 1997;25:98-110. Endnotes: 1 except for: 1) death records for children under the age of 12 who die as the result of AIDS transmitted at birth, 2) fetal death records (where, like birth certificates, portions are confidential), and (3) death records for infants who are born alive following an induced termination of pregnancy (in which case mother's identity is protected). 2 Geographic modeling, including the use of Bayesian "smoothing," has been used as an alternative to suppression of rates. A discussion of this method is beyond the scope of these guidelines. Guidelines For Guidelines for Using Small Numbers (Word Document) |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Colorado Department of Public Health and Environment Health Statistics Section 4300 Cherry Creek Drive South Denver, Colorado 80246-1530
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||