Once protected health information (PHI) has been de-identified, it is no longer PHI, and the restrictions and requirements of federal and state privacy laws no longer apply. However, if a re-identification code is added to the data, certain privacy and security rules apply to the code.
There are two methods of de-identification: 1) use of statistical methods proven to render information not individually identifiable, and 2) deletion of 18 specified identifiers.
A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable may de-identify data by:
- Applying such principles and methods and determining that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and
- Documenting the methods and results of the analysis that justify such determination.
Further guidance from DHHS regarding implementing this method is below.
Deletion of 18 Identifiers
To de-identify using this method, the following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:
- All geographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code if, according to the current publicly available data from the Bureau of the Census:
- The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and
- The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.
- Currently, 036, 059, 063, 102, 203, 556, 592, 790, 821, 823, 830, 831, 878, 879, 884, 890, and 893 are all recorded as "000".
- All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;
- Telephone numbers;
- Fax numbers;
- Electronic mail addresses;
- Social security numbers;
- Medical record numbers;
- Health plan beneficiary numbers;
- Account numbers;
- Certificate/license numbers;
- Vehicle identifiers and serial numbers, including license plate numbers;
- Device identifiers and serial numbers;
- Web Universal Resource Locators (URLs);
- Internet Protocol (IP) address numbers;
- Biometric identifiers, including finger and voice prints;
- Full face photographic images and any comparable images; and
- Any other unique identifying number, characteristic, or code, except as permitted by the re-identification rules, below; and
The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.
A covered entity may assign a code or other means of record identification to allow information de-identified under this section to be re-identified by the covered entity, provided that:
- Derivation. The code or other means of record identification is not derived from or related to information about the individual and is not otherwise capable of being translated so as to identify the individual; and
- Security. The covered entity does not use or disclose the code or other means of record identification for any other purpose, and does not disclose the mechanism for re-identification.
Further Guidance on Statistical Methods of De-identification
DHHS has provided some guidance regarding statistical methods of de-identification:
As requested by some commenters, we include in the final rule a requirement that covered entities (not following the safe harbor approach) apply generally accepted statistical and scientific principles and methods for rendering information not individually identifiable when determining if information is de-identified. Although such guidance will change over time to keep up with technology and the current availability of public information from other sources, as a starting point the Secretary approves the use of the following as guidance to such generally accepted statistical and scientific principles and methods:
- Statistical Policy Working Paper 22 - Report on Statistical Disclosure Limitation Methodology (prepared by the Subcommittee on Disclosure Limitation Methodology, Federal Committee on Statistical Methodology, Office of Management and Budget) and
- the Checklist on Disclosure Potential of Proposed Data Releases (prepared by the Confidentiality and Data Access Committee, Federal Committee on Statistical Methodology, Office of Management and Budget).
We agree with commenters that such guidance will need to be updated over time and we will provide such guidance in the future.
According to the Statistical Policy Working Paper 22, the two main sources of disclosure risk for de-identified records about individuals are the existence of records with very unique characteristics (e.g., unusual occupation or very high salary or age) and the existence of external sources of records with matching data elements which can be used to link with the de-identified information and identify individuals (e.g., voter registration records or driver's license records). The risk of disclosure increases as the number of variables common to both types of records increases, as the accuracy or resolution of the data increases, and as the number of external sources increases. As outlined in Statistical Policy Working Paper 22, an expert disclosure analysis would also consider the probability that an individual who is the target of an attempt at re-identification is represented on both files, the probability that the matching variables are recorded identically on the two types of records, the probability that the target individual is unique in the population for the matching variables, and the degree of confidence that a match would correctly identify a unique person.
Statistical Policy Working Paper 22 also describes many techniques that can be used to reduce the risk of disclosure that should be considered by an expert when de-identifying health information. In addition to removing all direct identifiers, these include the obvious choices based on the above causes of the risk; namely, reducing the number of variables on which a match might be made and limiting the distribution of the records through a "data use agreement" or "restricted access agreement" in which the recipient agrees to limits on who can use/receive the data. The techniques also include more sophisticated manipulations: recoding variables into fewer categories to provide less precise detail (including rounding of continuous variables); setting top-codes and bottom-codes to limit details for extreme values; disturbing the data by adding noise by swapping certain variables between records, replacing some variables in random records with mathematically imputed values or averages across small random groups of records, or randomly deleting or duplicating a small sample of records; and replacing actual records with synthetic records that preserve certain statistical properties of the original data.