Examples and Best Practices: Removing HIPAA Identifiers for De-Identification

Check out the new compliance progress tracker


Product Pricing Demo Video Free HIPAA Training
LATEST
video thumbnail
Admin Dashboard Walkthrough Jake guides you step-by-step through the process of achieving HIPAA compliance
Ready to get started? Book a demo with our team
Talk to an expert

Examples and Best Practices: Removing HIPAA Identifiers for De-Identification

Kevin Henry

HIPAA

May 03, 2024

8 minutes read
Share this article
Examples and Best Practices: Removing HIPAA Identifiers for De-Identification

HIPAA De-Identification Methods

Overview

HIPAA allows two lawful paths to de-identify Protected Health Information so that it can be shared with a “very small” risk of re-identification. The first is the Safe Harbor Rule, which removes specific identifiers. The second is the Expert Determination Method, which uses statistical and scientific principles to manage risk while preserving more data utility.

Safe Harbor Rule

Under Safe Harbor, you remove all 18 enumerated identifiers and ensure you do not have actual knowledge that the remaining information could identify an individual. This approach is straightforward, scalable, and well-suited when precise dates or fine-grained locations are not essential for your analysis.

Expert Determination Method

With Expert Determination, a qualified expert performs a Re-Identification Risk Assessment, applies proven Anonymization Techniques, documents the methods, and concludes the residual risk is very small. This path supports richer data (for example, shifted dates or coarser geographies) when Safe Harbor would strip too much detail.

Choosing a method and example workflow

  • Define your use case and required variables.
  • Inventory direct identifiers and quasi-identifiers (e.g., dates, ZIP codes, combinations of age, sex, and geography).
  • Pick Safe Harbor for simplicity, or Expert Determination if you need more precision.
  • Apply transformations, validate risk, and document every step for audit readiness.

Example: Replace names with a random study ID not derived from PHI, convert birth date 02/14/1962 to year 1962, change age 92 to 90+, and reduce ZIP 02139 to 021XX or 02100/000 depending on population rules.

Safe Harbor Identifiers to Remove

  • Names.
  • All geographic subdivisions smaller than a state (street address, city, county, precinct, ZIP, geocodes), except the initial three digits of a ZIP code if the combined area has more than 20,000 people; otherwise use 000.
  • All elements of dates (except year) directly related to an individual, including birth, admission, discharge, and death dates; and ages over 89, which must be aggregated as 90 or older.
  • Telephone numbers.
  • Fax numbers.
  • Email addresses.
  • Social Security numbers.
  • Medical record numbers.
  • Health plan beneficiary numbers.
  • Account numbers.
  • Certificate and license numbers.
  • Vehicle identifiers and serial numbers, including license plates.
  • Device identifiers and serial numbers.
  • Web URLs.
  • IP address numbers.
  • Biometric identifiers, including finger- and voiceprints.
  • Full-face photographs and comparable images.
  • Any other unique identifying number, characteristic, or code (except a re-identification code not derived from PHI kept separately).

De-Identification Tools and Techniques

Direct removal and irretrievable deletion

Drop columns containing direct identifiers and overwrite any residual values so they cannot be reconstructed from logs or backups. Never keep masked versions of prohibited identifiers for Safe Harbor (for example, last four digits of SSN are not permitted to remain).

Generalization and aggregation

Coarsen quasi-identifiers to reduce uniqueness while retaining analytic value: convert precise dates to years or relative days from an index event, bin ages into ranges, and map locations to counties, states, or 3-digit ZIP areas when permitted.

Data Masking and tokenization

Use tokenization to replace identifiers with random IDs that are not derived from PHI; store the key separately with strict access controls. Pattern-preserving masking can help testing, but do not rely on it to satisfy Safe Harbor for prohibited identifiers.

Anonymization Techniques for structured data

  • k-anonymity: ensure every record shares quasi-identifier values with at least k−1 others.
  • l-diversity and t-closeness: add protection against attribute disclosure by diversifying sensitive values within groups and keeping distributions close to the overall dataset.
  • Microaggregation: replace values with group centroids to blur uniqueness.
  • Noise addition and perturbation: add small, documented noise to continuous variables to reduce linkage risk.
  • Data Suppression: drop high-risk cells, rare codes, or outliers after evaluating analytic impact.

Unstructured data and images

Apply NLP-driven redaction to clinical notes to detect names, contact details, places, dates, and ID patterns. For images, remove full-face features and strip metadata (e.g., EXIF) that may contain hidden identifiers.

Challenges in Data De-Identification

Mosaic effect and external linkages

Even when direct identifiers are removed, combinations of quasi-identifiers can uniquely identify someone when linked with public or commercial datasets. Plan for this mosaic effect in your risk model.

Small cells and rare conditions

Rare diagnoses, uncommon procedures, or small geographic areas increase re-identification risk. Use cell suppression, category collapsing, or controlled blurring to mitigate these edge cases.

Longitudinal and free-text data

Repeated visits, timestamp patterns, care pathways, and narrative notes can leak identity. Use consistent date shifting, event sequencing (Day 0, Day 7), and robust redaction to reduce traceability.

Ready to simplify HIPAA compliance?

Join thousands of organizations that trust Accountable to manage their compliance needs.

Best Practices for Compliance

  • Adopt a written de-identification policy that distinguishes Safe Harbor and Expert Determination Method and defines approval workflows.
  • Minimize data at the source; collect and share only what you need for the stated purpose.
  • Catalog data elements and classify each as direct identifier, quasi-identifier, or sensitive attribute.
  • Use segregation of duties: engineering applies transformations; an independent expert validates the Re-Identification Risk Assessment.
  • Generate audit artifacts: transformation specs, risk reports, test results, and sign-offs.
  • Control the environment: access controls, encryption, data use agreements, and prohibition on re-linkage attempts.
  • For re-identification keys, use random codes not derived from PHI and manage them in a separate, secured system.

Assessing Re-Identification Risk

Define attacker models and quasi-identifiers

Model realistic adversaries (prosecutor, journalist, marketer) and list quasi-identifiers they might know (age band, sex, 3-digit ZIP, admission year). This clarifies what to measure and where to transform.

Compute risk metrics

  • Equivalence class size (k): count records sharing the same quasi-identifiers; increase generalization or suppression until k meets your target.
  • Uniqueness rate: percentage of records unique on quasi-identifiers; drive this toward zero.
  • Attribute disclosure: apply l-diversity or t-closeness tests for sensitive fields (e.g., HIV status).
  • Replicability and consistency: ensure transformations (e.g., date shifting) are consistent within a patient yet unpredictable externally.

Document “very small” risk

Under the Expert Determination Method, the expert must document methods, assumptions, thresholds, tests, and results supporting a conclusion that residual risk is very small. Reassess whenever data, context, or external linkages change.

Maintaining Data Utility

Design for the analysis

Start from analytic goals. Preserve signal-carrying features (e.g., comorbidity indices, length of stay, relative timing) while generalizing or masking only what materially drives risk.

Utility-preserving transformations

  • Dates: convert to study days relative to an index event, or year plus quarter if permitted by your method.
  • Geography: roll up to county, health service area, or 3-digit ZIP where Safe Harbor allows.
  • Clinical codes: group to higher-level categories (e.g., CCS groupers) to reduce sparsity and small cells.
  • Tiered releases: provide highly de-identified public data and richer, access-controlled data under agreements.

Measure and iterate

Quantify data loss with downstream model accuracy, bias checks, and coverage metrics. If utility falls short, adjust generalization levels, swap techniques (e.g., microaggregation instead of suppression), and re-run your risk assessment.

Conclusion

Effective de-identification balances privacy and usefulness. Use the Safe Harbor Rule for simplicity, or the Expert Determination Method for flexibility, apply sound Anonymization Techniques with targeted Data Suppression, and document a rigorous Re-Identification Risk Assessment to protect individuals while keeping your dataset actionable.

FAQs

What are the 18 HIPAA identifiers that must be removed?

The 18 are: names; all geographic subdivisions smaller than a state (street address, city, county, precinct, ZIP, geocodes) except 3-digit ZIPs when population exceeds 20,000; all elements of dates (except year) related to an individual and ages over 89 (aggregate to 90+); telephone numbers; fax numbers; email addresses; Social Security numbers; medical record numbers; health plan beneficiary numbers; account numbers; certificate/license numbers; vehicle identifiers and serial numbers including license plates; device identifiers and serial numbers; web URLs; IP addresses; biometric identifiers (e.g., finger- and voiceprints); full-face photos and comparable images; and any other unique identifying number, characteristic, or code not permitted for re-identification.

How does the Safe Harbor method ensure data de-identification?

Safe Harbor removes the 18 specific identifiers and requires you to avoid actual knowledge that the remaining data could identify someone. When applied correctly, direct identifiers are eliminated and quasi-identifiers are limited (e.g., dates to years, ZIP to 3-digit or 000), reducing linkage risk to a very small likelihood consistent with HIPAA.

Use a combination of capabilities: statistical disclosure control tools for k-anonymity, l-diversity, and microaggregation; NLP-based redaction for free-text notes; tokenization services for generating non-PHI study IDs; and data masking utilities within your ETL or analytics platform. Commonly used options include ARX (for anonymization modeling), sdcMicro (R), Amnesia, and open-source text de-identifiers (e.g., NLM Scrubber, MITRE MIST). Evaluate against your governance, logging, and validation needs before adoption.

How can re-identification risk be effectively assessed?

Define attacker knowledge, enumerate quasi-identifiers, and compute k-anonymity and uniqueness rates. Test for attribute disclosure (l-diversity/t-closeness), check small cells, and simulate linkage using representative external data. Document assumptions, thresholds, and results, then re-run the assessment after any data or context change to maintain a “very small” risk profile.

Share this article

Ready to simplify HIPAA compliance?

Join thousands of organizations that trust Accountable to manage their compliance needs.

Related Articles