Identifying Information in HIPAA: What Makes Data PHI—and How to De‑Identify It
Protected Health Information Overview
Under HIPAA, Protected Health Information (PHI) is Individually Identifiable Health Information (IIHI) that is created or received by a covered entity or its business associate and relates to a person’s health, care, or payment. Information is “individually identifiable” when it either directly identifies someone or there is a reasonable basis to believe it could identify them.
PHI status depends on context. The same data element can be PHI when handled by a covered entity (such as a hospital, health plan, or clearinghouse) or a business associate, but not necessarily when used by an unrelated party outside HIPAA’s scope. When data is properly de-identified, it is no longer PHI and falls outside HIPAA Privacy Rule requirements.
Direct Identifiers vs. Indirect Identifiers
Direct Identifiers (for example, name or Social Security number) can single out a person on their own. Indirect Identifiers (also called quasi-identifiers), such as full ZIP code, precise dates, or device IDs, may not point to someone alone but can enable identification when combined with other data. Recognizing both types is central to any De-Identification Risk Assessment.
De-Identification Methods
HIPAA permits two pathways to de-identification under its HIPAA Regulatory Standards: Safe Harbor and Expert Determination. Both aim to reduce the likelihood that data could identify an individual, but they do so differently and support different levels of data utility.
Common Transformations
- Suppression and generalization (for example, removing an exact birth date or replacing it with age bands).
- Pseudonymization via random codes not derived from identifiers and kept separate from recipients.
- Aggregation and sampling to limit uniqueness.
- Perturbation or noise infusion where appropriate to reduce linkage risk.
Whatever method you choose, document your approach, justify the residual risk as very small (for Expert Determination) or ensure all enumerated identifiers are removed (for Safe Harbor), and implement Privacy Safeguards to keep risk low over time.
Safe Harbor Criteria
Safe Harbor requires removing specific identifiers of the individual or of relatives, employers, or household members, and ensuring you have no actual knowledge that remaining data could identify someone. The 18 identifiers are:
- Names.
- Geographic subdivisions smaller than a state (street address, city, county, precinct, full ZIP code, and equivalent geocodes). The first three digits of ZIP may remain only if the combined area has more than 20,000 people; otherwise replace with 000.
- All elements of dates (except year) directly related to an individual (for example, birth, admission, discharge, death), and all ages over 89; you may group such ages into a single “90 or older” category.
- Telephone numbers.
- Fax numbers.
- Email addresses.
- Social Security numbers.
- Medical record numbers.
- Health plan beneficiary numbers.
- Account numbers.
- Certificate/license numbers.
- Vehicle identifiers and serial numbers, including license plates.
- Device identifiers and serial numbers.
- Web URLs.
- IP address numbers.
- Biometric identifiers (for example, finger and voice prints).
- Full-face photographs and comparable images.
- Any other unique identifying number, characteristic, or code (other than an internal code that is not derivable from identifiers and is kept confidential).
Safe Harbor is straightforward and defensible, but it can reduce data utility by removing granular geography and dates. If you need more precision while keeping risk very small, consider Expert Determination.
Expert Determination Process
Expert Determination relies on a qualified expert who applies generally accepted statistical and scientific principles to conclude that the risk of identification, alone or in combination with reasonably available data, is very small. The expert must document the methods and results.
Typical Steps
- Define use cases and adversary models (for example, prosecutor, journalist, or marketer with auxiliary data).
- Profile data for uniqueness and linkage risk (for example, rare combinations of indirect identifiers).
- Apply de-identification techniques (generalization, suppression, noise, swapping) tailored to utility needs.
- Quantify residual risk using metrics like record uniqueness, k-anonymity–style measures, or simulated linkage tests.
- Validate results and produce a written report detailing controls, assumptions, and re-assessment triggers.
Choose Expert Determination when you need finer-grained dates, sub-state geography, or longitudinal linkage via stable, non-derivable codes under tight governance.
Ready to simplify HIPAA compliance?
Join thousands of organizations that trust Accountable to manage their compliance needs.
Limited Data Sets Explained
A Limited Data Set (LDS) is still PHI but with certain Direct Identifiers removed. It can include dates (for example, admission, discharge, date of birth) and geographic data at the city, state, and ZIP level. An LDS may be used only for research, public health, or health care operations and must be governed by a Data Use Agreement (DUA).
What Must Be Removed
Names; street addresses (other than town/city, state, ZIP); telephone and fax numbers; email addresses; Social Security, medical record, and account numbers; certificate/license numbers; vehicle and device identifiers; URLs and IP addresses; full-face photos and comparable images; biometric identifiers; and any other unique identifying number or code not expressly permitted in the LDS.
Data Use Agreements
- Specify permitted uses and disclosures and prohibit re-identification or contact, except as allowed.
- Limit who may use or receive the data and require appropriate Privacy Safeguards.
- Mandate reporting of violations, mitigation steps, and termination for breaches.
- Require downstream controls (no further disclosure without authorization consistent with the DUA).
Because an LDS remains PHI, it is subject to the Privacy Rule and the “minimum necessary” standard, even with a DUA in place.
Risks of Re-Identification
Re-identification risk arises when indirect identifiers are linkable to outside data (for example, voter rolls, commercial datasets) or when rare combinations make a person stand out. Small cell sizes, precise timestamps, and detailed locations heighten risk.
Managing the Risk
- Perform an initial and periodic De-Identification Risk Assessment, especially when data, recipients, or external data landscapes change.
- Minimize indirect identifiers, cap precision (for example, 3-digit ZIP where allowed, month/year instead of full dates), and control small cell counts in outputs.
- Enforce access limits, auditing, and purpose restrictions through policy and contracts (DUAs or BAAs).
- Use technical Privacy Safeguards such as secure enclaves, differential privacy for analytic releases where appropriate, and robust key management for pseudonymous codes.
- Educate users about prohibited re-identification attempts and monitor for anomalies.
HIPAA Privacy Rule Compliance
Compliance blends sound de-identification with governance. Establish written policies and procedures aligned to HIPAA Regulatory Standards, designate responsible roles, and train your workforce. Apply the minimum necessary principle, maintain Business Associate Agreements for vendors handling PHI, and log disclosures as required.
Operationalize safeguards: access controls, encryption in transit and at rest, environment segregation, and audit trails. For Limited Data Sets, ensure DUAs are executed and enforced. Document Safe Harbor removals or retain the expert’s report for Expert Determination. Review controls regularly and update them as data, technology, and use cases evolve.
Conclusion
To handle identifying information in HIPAA responsibly, first decide whether data is PHI, then choose Safe Harbor for simplicity or Expert Determination for precision. Use Limited Data Sets with DUAs when PHI is necessary for defined purposes. Continuously manage re-identification risk with technical, administrative, and contractual safeguards to keep privacy strong while preserving data utility.
FAQs
What constitutes identifying information under HIPAA?
Identifying information is Individually Identifiable Health Information that is created or received by a covered entity or business associate and relates to health, care, or payment. It includes Direct Identifiers (for example, names, SSNs) and Indirect Identifiers (for example, detailed dates, full ZIP codes) that, alone or combined with other data, could reasonably identify a person.
How does the Safe Harbor method ensure de-identification?
Safe Harbor requires removing 18 specific identifiers and ensuring you have no actual knowledge that the remaining data could identify an individual. When all required elements are removed—such as names, exact addresses, most date details, and device or network identifiers—the dataset is considered de-identified under HIPAA.
What is the role of expert determination in de-identifying data?
An independent expert evaluates the dataset and applies established statistical and scientific techniques to reduce and measure identification risk. If the expert concludes the risk is very small, documents the methods, and defines applicable safeguards, the dataset qualifies as de-identified without strictly following the Safe Harbor list.
Can de-identified data be re-identified and how is the risk managed?
Re-identification is possible if remaining indirect identifiers can be linked to outside data. Manage this risk by limiting precision, suppressing small cells, using pseudonymous codes carefully, enforcing DUAs or BAAs, auditing use, and conducting periodic De-Identification Risk Assessments to confirm residual risk stays very small.
Ready to simplify HIPAA compliance?
Join thousands of organizations that trust Accountable to manage their compliance needs.