Data Linkage and Confidentiality

J. Michael Dean
August 23, 1995

I had the opportunity to review Generic Data and thought I would convey my thoughts on this subject. I currently am using probabilistic linkage techniques to link motor vehicle crash records with ambulance run records and hospital based medical records. The resulting data set is then stripped of identifiers such as the obvious ones and Zips, birthdates, etc. that would permit a misuse of the information. The purpose of my own research is to relate medical and economic outcomes to crash characteristics such as seat belt use.

We have obtained data from data owners with strict constraints. We never release raw data files; only the final linked data file is available for others to access. As we constructed the file, we consulted with each of the data owners and went field by field through the file to obtain everyone's approval that a specific field would be acceptable in a public data base. Thus, while we have, in our research center, data files that contain identifiers, the data file that is used as a result of probabilistic linkage is sterilized.

The worry that I have is that people will legislate a sterilization of data files to avoid the possibility of probabilistic linkage. This would be a mistake, because this valuable tool allows researchers and policy makers an opportunity to answer new questions without collecting new data sets. One example is relating seat belt use and car damage to the number of dollars spent on medical care for the next 12 months in a population of 1.2 million people. In order to accomplish linkage, identifiers need to be in the original files. I would even encourage names, etc. to be maintained in such files.

Societal protection should be obtained by outlawing malicious USE of data. Thus, I have agreed not to publish data that relates to specific hospitals or ambulance companies. If the purpose of probabilistic linkage and subsequent data analysis is proper, and if personal confidentiality is respected, then there is no downside. If a person links data and then misuses it, then the person should be liable for civil or potentially criminal damages.

I like to use the following analogy. Sterilizing data sets by legislation, such as has been done in numerous states, in order to prevent misuse of the data, is like amputating ones legs in order that jay-walking not be commited. We should punish misuse of data, but not eliminate the identifiers that would permit the most value to be obtained from such data.

J. Michael Dean, M.D.
University of Utah School of Medicine
mdean@msscc.med.utah.edu

Please mail comments to ddil@episun1.harvard.edu.

Return to Generic Data.

Back to the DDIL Homepage.