HIPAA De-Identification Checklist: Steps, Permitted Data, and Common Pitfalls

Kevin Henry

HIPAA

May 01, 2024

8 minutes read

Share this article

HIPAA De-Identification Methods

Under HIPAA, once Protected Health Information (PHI) is properly de-identified, it is no longer regulated as PHI. You have two permitted paths: the Safe Harbor method and the Expert Determination method. Each aims to reduce the risk of re-identification to an acceptably low level while preserving utility for analytics, research, or operations.

Two permitted paths at a glance

Safe Harbor: remove all 18 specified identifiers and apply special rules for dates, ZIP codes, and ages 90+. This is rule-driven and straightforward to audit.
Expert Determination: a qualified expert applies a Statistical Risk Assessment and certifies that the risk of re-identification is very small, documenting methods and results.

Select the method based on your use case, dataset complexity, and tolerance for utility vs. risk. For static, routine releases, Safe Harbor often suffices; for complex, high-dimensional data (e.g., longitudinal EHR, images, or geospatial attributes), Expert Determination offers flexibility with documented rigor.

Safe Harbor Identifiers

To satisfy Safe Harbor, you must remove these 18 identifiers of the individual or relatives, employers, or household members:

Names
Geographic subdivisions smaller than a state (street address, city, county, precinct, ZIP), except the initial three digits of a ZIP code if the combined area exceeds 20,000 people; otherwise use 000
All elements of dates directly related to an individual (except year): birth, admission, discharge, death, and all ages over 89, which must be grouped as “90 or older”
Telephone numbers
Fax numbers
Email addresses
Social Security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/license numbers
Vehicle identifiers and serial numbers, including license plates
Device identifiers and serial numbers
Web URLs
IP address numbers
Biometric identifiers (including finger and voice prints)
Full-face photographs and comparable images
Any other unique identifying number, characteristic, or code

Permitted data under Safe Harbor

State-level geography; three-digit ZIPs only when the population threshold is met (otherwise 000)
Year (for birth, admission, discharge, death) without month or day
Age in years if under 90; otherwise report as “90 or older”
Clinical variables not inherently identifying (e.g., lab values, diagnoses) when they do not uniquely identify an individual
A non-derivable re-identification code retained internally by the disclosing entity, provided the mapping mechanism is not shared with recipients

Be cautious with rare conditions, outlier values, or granular geolocation, which can function as quasi-identifiers even if they are not on the explicit list.

Expert Determination Method

Expert Determination allows you to retain more detail by quantifying risk. A qualified expert applies statistical or scientific principles to conclude that the re-identification risk is very small, given your data, context, and controls.

Core elements

Context definition: who will access the data, for what purpose, and under what safeguards
Statistical Risk Assessment: evaluate identifiability via k-anonymity, l-diversity, t-closeness, record linkage simulations, and small-cell analyses
Transformations: generalization, suppression, noise addition, aggregation, and controlled retention of dates or geography where justified
Documentation: the expert’s rationale, methods, assumptions, and results, plus residual risk and release conditions
Ongoing review: reassess when data scope, external data availability, or use conditions change

This pathway supports high-utility releases (e.g., year-month dates, limited geocodes, longitudinal timelines) with evidence-backed controls and auditable decisions.

Ready to simplify HIPAA compliance?

Join thousands of organizations that trust Accountable to manage their compliance needs.

Common Pitfalls in De-Identification

Leaving free text unredacted: clinical notes may contain names, full dates, addresses, or device serials
Overlooking images and metadata: DICOM headers, EXIF data, and full-face photos often carry identifiers
Small cell sizes: publishing highly granular breakdowns (e.g., rare disease by facility and week) can enable singling out
Inconsistent transformations: applying different generalization rules across tables enables linkage
Reusing pseudonymous keys across releases: cross-release consistency raises re-identification risk
Confusing “Limited Data Sets” with de-identified data: a Limited Data Set is still PHI and requires a Data Use Agreement
Vendor gaps: processing PHI without appropriate Business Associate Agreements (BAAs) or without verifying a vendor’s safeguards
Derived identifiers: hash codes or record IDs derived from PHI violate Safe Harbor unless handled per re-identification code rules
Ignoring data drift: new external datasets or changing population dynamics can increase linkage risk over time

Data De-Identification Techniques

Data Masking

Masking replaces sensitive values with obfuscated surrogates (e.g., redaction, partial masking, format-preserving masking). It protects direct identifiers while retaining data format for testing or analytics.

Pseudonymization

Pseudonymization swaps identifiers for stable tokens keyed to a secret mapping. Under HIPAA, if you include a code, it must not be derived from PHI and the mapping must remain undisclosed to recipients; otherwise the dataset can remain identifiable.

Anonymization

Anonymization removes or irreversibly transforms identifiers so that re-identification is not reasonably possible. In practice, Expert Determination often combines multiple methods to reach a “very small” residual risk.

Data Tokenization

Tokenization replaces identifiers with random tokens stored in a secure vault. Use it to segment risk: analytics systems see tokens, while re-identification (when permitted) occurs only through controlled vault access.

Generalization and Suppression

Generalization coarsens values (e.g., age bands, year-only dates, state-level geography). Suppression removes risky attributes or outlier records that defeat group anonymity guarantees.

Perturbation and Differential Privacy

Perturbation adds calibrated noise to prevent exact reconstruction of individuals. Differential privacy offers formal guarantees when releasing aggregates, curbing risks from repeated queries and linkage attacks.

Re-Identification Risks

Even after de-identification, risk persists through quasi-identifiers and external data. Adversaries can link gender, year, and coarse location to voter rolls, social media, or commercial data in a “mosaic effect.”

Linkage attacks: joining de-identified records to public registries, breach corpora, or news reports
Uniqueness: rare combinations (e.g., very old age plus uncommon diagnosis) can single out individuals
Temporal trails: dense visit timelines or device telemetry create recognizable patterns
Unstructured content: notes, images, and audio often leak identifiers
Cumulative releases: multiple snapshots gradually erode privacy through consistency checks

Mitigate by limiting granularity, enforcing minimum cell sizes, varying transformations across releases, and conducting periodic Statistical Risk Assessments.

De-Identification Process Steps

Define the use case: specify users, purposes, retention, and required data utility.
Inventory PHI: map data elements, provenance, unstructured fields, images, and metadata.
Select a pathway: Safe Harbor for rule-based removal; Expert Determination for nuanced, higher-utility needs.
Design transformations: choose generalization, suppression, Data Masking, Pseudonymization, Anonymization, or Data Tokenization aligned to risk and utility.
Implement controls: remove 18 identifiers (Safe Harbor) or apply expert-approved transformations with documented parameters.
Validate with a Statistical Risk Assessment: test k-anonymity/small cells, attempt linkage, and review edge cases (rare diagnoses, geolocation, images, free text).
Document decisions: record methods, assumptions, test results, and permitted data retained (e.g., year, 3-digit ZIP where allowed).
Set governance and contracts: restrict access, enable auditing, and ensure vendors handling PHI have appropriate Business Associate Agreements (BAAs).
Release and monitor: version datasets, monitor for drift and new external data sources, and refresh the assessment on material changes.
Plan re-identification handling: if you retain a code, ensure it is non-derivable, keep the mapping separate, and never disclose the re-identification mechanism to recipients.

Summary

Use Safe Harbor for predictable, auditable removal of the 18 identifiers and their special rules; choose Expert Determination when you need finer detail backed by a documented, very-small-risk finding. Combine robust techniques, contracts, and ongoing assessments to keep utility high and re-identification risk low.

FAQs

What are the main methods for HIPAA de-identification?

The two recognized methods are Safe Harbor and Expert Determination. Safe Harbor removes 18 specific identifiers and applies rules for ZIP codes, dates, and ages 90+. Expert Determination relies on a qualified expert to assess and document that re-identification risk is very small, often allowing more granular data with appropriate safeguards.

How does the Safe Harbor Method protect patient privacy?

It protects privacy by eliminating direct identifiers and constraining quasi-identifiers: city and street are removed, ZIPs are limited to three digits only when population thresholds are met, dates are reduced to year, and ages over 89 are grouped. This standardization curbs singling out and linkage while preserving general analytic utility.

What risks exist in re-identifying de-identified data?

Primary risks include linkage to external datasets, uniqueness of rare attribute combinations, temporal pattern recognition, and leaks in unstructured data like images or notes. Repeated releases and consistent pseudonyms can also erode privacy over time unless mitigated with aggregation, noise, and periodic Statistical Risk Assessments.

How can organizations avoid common de-identification pitfalls?

Inventory and sanitize unstructured fields, enforce minimum cell sizes, apply consistent transformations, avoid derivation of codes from PHI, and prevent key reuse across releases. Distinguish de-identified datasets from Limited Data Sets, use Data Use Agreements where applicable, and ensure vendors are covered under appropriate Business Associate Agreements (BAAs). Periodically reassess risk as data or external context evolves.

Table of Contents

HIPAA De-Identification Methods
- Two permitted paths at a glance
Safe Harbor Identifiers
- Permitted data under Safe Harbor
Expert Determination Method
- Core elements
Common Pitfalls in De-Identification
Data De-Identification Techniques
Re-Identification Risks
De-Identification Process Steps
- Summary
FAQs

Share this article

HIPAA De-Identification Checklist: Steps, Permitted Data, and Common Pitfalls

HIPAA De-Identification Methods

Two permitted paths at a glance

Safe Harbor Identifiers

Permitted data under Safe Harbor

Expert Determination Method

Core elements

Ready to simplify HIPAA compliance?

Common Pitfalls in De-Identification

Data De-Identification Techniques

Data Masking

Pseudonymization

Anonymization

Data Tokenization

Generalization and Suppression

Perturbation and Differential Privacy

Re-Identification Risks

De-Identification Process Steps

Summary

FAQs

What are the main methods for HIPAA de-identification?

How does the Safe Harbor Method protect patient privacy?

What risks exist in re-identifying de-identified data?

How can organizations avoid common de-identification pitfalls?

Ready to simplify HIPAA compliance?

Dental Compliance Training for Your Team: OSHA, HIPAA & Infection Control Made Simple

Comparing Popular HIPAA-Compliant Telehealth Tools

Top Cloud Storage Mistakes That Can Lead to HIPAA Violations