How to De-Identify PHI Under HIPAA: Requirements, Risks, and Controls

Kevin Henry

HIPAA

May 01, 2024

7 minutes read

Share this article

De-identifying Protected Health Information (PHI) lets you share data while honoring the HIPAA Privacy Rule. This guide explains how to de-identify PHI under HIPAA, what to remove, how to manage re-identification risks, and which controls and Data Sharing Protocols keep releases safe.

You’ll learn when to use Safe Harbor or expert-driven Statistical De-Identification, how Limited Data Sets differ from fully de-identified data, and how to operationalize a strong Data Use Agreement to sustain compliance over time.

De-Identification Methods

HIPAA permits two pathways to de-identify PHI so it no longer identifies an individual: Safe Harbor and Expert Determination. Both aim to lower re-identification risk to an acceptable level, but they do so in different ways and support different use cases.

Safe Harbor (rule-based)

Under Safe Harbor, you remove a specific list of direct and quasi-identifiers. If none remain and you take no other steps that could identify a person, the dataset is considered de-identified for HIPAA purposes. This approach is straightforward and repeatable but can reduce data utility.

Expert Determination (risk-based)

An independent expert applies Statistical De-Identification techniques and certifies that the risk of re-identification is “very small,” given reasonably available external data and contemplated use. This path preserves more analytical value but requires modeling, documentation, and ongoing controls.

Choosing a method

Use Safe Harbor for simple, broad releases where maximum simplicity and low administrative burden matter.
Use Expert Determination when you need finer-grained detail (for example, full dates or granular geography) and can implement stronger safeguards.
In both paths, maintain governance artifacts: methods, assumptions, risk metrics, and release conditions.

Safe Harbor Identifiers

To satisfy Safe Harbor, remove the following 18 identifiers of the individual or of relatives, employers, or household members:

Names.
All geographic subdivisions smaller than a state, including street address, city, county, precinct, and ZIP code, except the initial three digits of a ZIP code if the combined area has more than 20,000 people; otherwise, use 000.
All elements of dates (except year) directly related to an individual, including birth date, admission, discharge, and date of death; also aggregate ages over 89 into a single category of 90 or older.
Telephone numbers.
Fax numbers.
Email addresses.
Social Security numbers.
Medical record numbers.
Health plan beneficiary numbers.
Account numbers.
Certificate or license numbers.
Vehicle identifiers and serial numbers, including license plate numbers.
Device identifiers and serial numbers.
Web URLs.
IP address numbers.
Biometric identifiers, including finger and voice prints.
Full-face photographs and comparable images.
Any other unique identifying number, characteristic, or code.

Implementation tips

Scan free text for residual PHI; redact or tokenize consistently.
If you maintain a re-identification key, ensure it is not derived from personal information and is never disclosed or used for another purpose.
Validate outputs with sampling and automated checks before release.

Limited Data Set

A Limited Data Set (LDS) is still PHI, not fully de-identified. It removes direct identifiers but may retain certain fields—such as city, state, ZIP code, and elements of dates—useful for research, public health, and health care operations.

Direct identifiers removed

Names; full postal addresses (other than town/city, state, and ZIP); phone and fax numbers; email addresses.
SSNs; medical record, health plan, and account numbers; certificate/license numbers.
Vehicle and device identifiers; URLs and IP addresses; biometric identifiers; full-face photos or comparable images.

Data that may remain

Town/city, state, and ZIP code.
Dates related to the individual (for example, admission, discharge, date of birth, and date of death).

You must execute a Data Use Agreement before sharing an LDS and apply appropriate Data Sharing Protocols to control access, uses, and onward disclosures.

Re-Identification Risks

Even when direct identifiers are removed, individuals can sometimes be re-identified by linking quasi-identifiers (for example, birth date, gender, and geography) with external data. Risk depends on uniqueness, availability of outside datasets, and the context of release.

Ready to simplify HIPAA compliance?

Join thousands of organizations that trust Accountable to manage their compliance needs.

Common risk vectors

Linkage attacks using public records or commercial data.
Small-cell disclosures in tables and dashboards.
Longitudinal trails that reveal identity through patterns over time.
Free-text notes, images, or rare diagnoses that act as fingerprints.

Re-Identification Risk Mitigation controls

Generalize and suppress high-risk values; enforce k-anonymity, l-diversity, or t-closeness where appropriate.
Add calibrated noise or apply differential privacy for queryable systems.
Use enclave or trusted-researcher access with output checking rather than open release.
Limit data sharing to the minimum necessary; restrict linkage and prohibit re-identification in contracts.
Monitor cumulative risk across multiple releases; rotate or retire datasets as risk evolves.

Expert Determination Challenges

Expert Determination hinges on defining “very small” risk for your context. That requires setting measurable thresholds, modeling reasonably available external data, and stress-testing assumptions against realistic attacker capabilities.

Technical hurdles

Selecting fit-for-purpose risk metrics (e.g., prosecutor, journalist, and marketer models) and validating them on the actual population.
Balancing utility and privacy while handling outliers, rare events, and small cells.
Managing dataset drift: new linkable sources can raise risk after release.

Operational hurdles

Qualifying the expert’s knowledge and experience and documenting methods, parameters, and validation steps.
Embedding controls in your Data Sharing Protocols: access tiers, approvals, retention limits, and output review.
Reassessment triggers: material changes in data, purpose, or external data landscape.

Compliance Challenges

Compliance goes beyond the de-identification step. You must govern purpose limitations, minimum necessary disclosures, vendor oversight, and auditability to maintain compliance and trust.

Frequent pitfalls

Releasing free text or images without robust PHI scrubbing.
Repeated releases that, in combination, increase linkage risk.
Insufficient vetting of recipients or weak contractual controls.
Overlooking state privacy requirements and Personally Identifiable Information (PII) obligations alongside HIPAA.

Program essentials

Data inventory and classification covering PHI, PII, and Limited Data Sets.
Standard operating procedures for de-identification, validation, and approval.
Training, access logging, incident response, and periodic risk reviews.

Data Use Agreements

A Data Use Agreement (DUA) is mandatory for sharing a Limited Data Set and is strongly recommended for expert-determined releases. It defines the guardrails that keep use aligned with the HIPAA Privacy Rule and your organizational risk appetite.

Core DUA terms

Permitted purposes and users; prohibition on uses beyond the stated purpose.
No re-identification or attempts to contact individuals.
Prohibition on further disclosure; required flow-down to agents and subcontractors.
Administrative, technical, and physical safeguards proportionate to risk.
Reporting of improper uses or disclosures and timely mitigation.
Data Sharing Protocols: access method, retention period, output checks, and publication rules.
Return or destruction of data at term end, with certification.
Audit rights, sanctions for violations, and dispute terms.

Operationalizing DUAs

Gate access through an approval workflow tied to IRB or data governance review.
Use tiered environments (sandbox, enclave, production) with least-privilege access.
Automate enforcement: masking policies, query controls, and export restrictions.

Conclusion

To de-identify PHI under HIPAA effectively, align your method with the use case, remove or transform risky fields, and manage residual risk with contractual and technical controls. Pair solid methodology with a strong DUA and living Data Sharing Protocols to sustain compliance and analytical value.

FAQs

What are the two main HIPAA de-identification methods?

HIPAA recognizes Safe Harbor and Expert Determination. Safe Harbor removes a fixed set of identifiers. Expert Determination uses Statistical De-Identification to show a “very small” re-identification risk based on modeling, transformations, and documented controls.

What identifiers must be removed under Safe Harbor?

The 18 categories include names; most geographic details below state (with a ZIP code three-digit exception and 000 rule); all elements of dates except year plus ages over 89; phone and fax numbers; emails; SSNs; medical record and health plan numbers; account and license numbers; vehicle and device IDs; URLs and IPs; biometric identifiers; full-face photos; and any other unique identifying numbers or characteristics.

How is re-identification risk assessed in expert determination?

An expert measures how easily records could be linked to individuals using reasonably available external data. They quantify risk (e.g., via k-anonymity and related metrics), apply transformations like generalization, suppression, and noise, and document that the residual risk is very small for the defined purpose and sharing context.

Key challenges include governing purpose limits and minimum necessary disclosures, controlling cumulative linkage risk across releases, managing recipient obligations through a Data Use Agreement, securing environments and outputs, and aligning HIPAA obligations with broader PII and state privacy requirements.

Table of Contents

De-Identification Methods
Safe Harbor Identifiers
- Implementation tips
Limited Data Set
- Direct identifiers removed
- Data that may remain
Re-Identification Risks
- Common risk vectors
- Re-Identification Risk Mitigation controls
Expert Determination Challenges
- Technical hurdles
- Operational hurdles
Compliance Challenges
- Frequent pitfalls
- Program essentials
Data Use Agreements
FAQs

Share this article

How to De-Identify PHI Under HIPAA: Requirements, Risks, and Controls