General Data Protection Regulation: Pseudonymization vs. Anonymization

06 Jun

General Data Protection Regulation: Pseudonymization vs. Anonymization

What is Pseudonymization?

The General Data Protection Regulation (GDPR) is now in effect, with strong requirements to protect the personal data of European Union (EU) data subjects “by design and by default.“ Though the GDPR doesn’t contain detailed technical requirements for data security, it does call out the use of pseudonymization as an appropriate mechanism for data protection.  So, what is pseudonymization?

Pseudonymization is defined in Article 4(5) of the GDPR as:

The processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person.

Stated another way, pseudonymization is the replacing of identifying or sensitive data with a pseudonym. This is synonymous with tokenization, the replacing of sensitive data with a token, a technology utilized by the Payment Card Industry for years to protect payment card information (PCI).

Pseudonymization Versus Anonymization

In addition to pseudonymization, the GDPR also makes a reference to anonymous information in Recital 26:

The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.

So, what’s the difference be anonymized and pseudonymized information?

If you imagine a continuum of personal data, with fully identifiable explicit personal data at one end that incudes things like full name and social security numbers, and anonymized data with no identifiable personal information at the other end, pseudonymized data lies somewhere in the middle. A data subject’s pseudonymized data can be re-identified, or associated, with that individual by replacing the pseudonyms with the actual data. Fully anonymized data, on the other hand, cannot be linked back to the individual(s) it corresponds with.

The Future of Privacy Forum (FPF) has put together an excellent infographic detailing the continuum of data de-identification:


As noted in Recital 26, anonymous data is not subject to the data protection obligations created by the GDPR. So why not just anonymize all the sensitive data in your organization’s systems? For one, fully anonymizing a data set is a difficult task. Second, by definition, anonymous data can’t be linked back to identifiable individuals, which renders it useless for almost anything but very high-level data aggregation and analysis.

Benefits of Data Pseudonymization

In contrast to anonymized data, pseudonymized data retains some statistical utility relative to the level of pseudonymization. And in contrast to explicitly identified data, pseudonymized data provides obvious data protection benefits. Because of this, the GDPR provides several incentives for organizations to implement pseudonymization.

Both Article 25 and Recital 78 make reference to “appropriate technical and organisational measures” for data protection and cite pseudonymization as one of those measures. Recital 78 also cites “pseudonymising personal data as soon as possible” as a method that can be used for demonstrating compliance with the GDPR.

Like many other data protection compliance frameworks, the GDPR advocates a risk-based approach. Under Article 32, organizations are directed to “ensure a level of security appropriate to the risk” and again, pseudonymization is described as an appropriate technical measure.

Another incentive to pseudonymize data appears in Article 34(1) regarding the obligation to notify affected data subjects in the event of a data breach when it “is likely to result in a high risk to the rights and freedoms of natural persons.” If the breached data has been appropriately pseudonymized, the risk is lower, potentially mitigating the need for notification. Many U.S. breach notification laws make similar allowances for pseudonymized data sets.

Pseudonymization may also enable processing of personal data beyond the purpose for which it was originally collected. The GDPR requires that personal data be collected only for “specific, explicit and legitimate purposes” although further processing may be permissible if that processing is compatible with the original purpose. Article 6(4) describes the factors that must be taken into account when determining if further processing is compatible, including “the existence of appropriate safeguards, which may include encryption or pseudonymization.”

Methods of Pseudonymization

There are multiple methods for pseudonymizing data including: data masking, encryption, and tokenization. At a high-level, encryption entails the use of a key to encode or protect a data set. Consequently, encryption is mathematically reversible and subject to the complexities of key management. Tokenization by comparison, involves replacing identifying or sensitive data with a mathematically unrelated value. Therefore, the tokens cannot be mathematically reversed. Both encryption and tokenization can be format preserving and tokens may optionally include elements of the original value for data processing purposes. Data masking is a process for obfuscating data that is typically accomplished via encryption.

The most suitable method of pseudonymization will depend on the specific use case and needs of an organization, although it’s worth noting that from a compliance standpoint, tokenization is the only method that enables an organization to completely remove sensitive or identifying data from its systems by utilizing a cloud-based tokenization provider. This is a significant differentiator from both a compliance and a data security perspective.

Pseudonymization in the Cloud

As mentioned above, the definition of pseudonymization in the GDPR mentions that the identifying attributes of personal data be “kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable person.” Utilizing cloud-based tokenization for pseudonymization allows an organization to keep identifying attributes in a cloud token vault completely separate from the remaining data. Again, contrast this with encryption, where the encryption keys are typically stored in the same environment as the encrypted data.

Cloud-based tokenization also allows an organization to tokenize data before it is stored and meet the requirement for demonstrating compliance with the GDPR in Recital 78 “pseudonymising personal data as soon as possible.”

TokenEx can help your organization meet the data protection obligations created by the GDPR by pseudonymizing personal data at the ingestion point. The TokenEx data protection platform provides flexible technologies and methodologies to make tokenizing, encrypting and data vaulting work with any acceptance channel your organization uses. Follow us on Twitter and LinkedIn for updates and news.

John Noltensemeyer, CIPP/E/US CIPM, CISSP, ISA, is a Privacy and Compliance Solutions Architect for TokenEx.