5.3 Privacy preservation, anonymization and feature extraction

The term personal data relates not only to data used in the actual analysis, but also to any other pieces of information connected to the dataset that could somehow identify a person (e.g., any references to a person in a file on your local computer or a printed document stored in a safe with contact information to a participant). In the US, the HIPAA Privacy Rule (the US Code of Federal Regulations 45CFR Parts 160 and 164, https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html) lists 18 elements as direct identifiers, including the following data types commonly used when performing a study: names, zip codes, all elements of dates (except year), telephone numbers, mail addresses, social security numbers, account numbers, vehicle identifiers and serial numbers (including license plate numbers, full face photographic images and any comparable images). GDPR is not as explicit, however defining the term personal data as “any information relating to an identified or identifiable natural person“.

De-identified data are obfuscated data, making a person’s identity less obvious and minimizing the risk of unintended disclosure, whereas anonymised data are data that cannot be traced back to an individual by any means (http://support.sas.com/resources/papers/proceedings15/1884-2015.pdf). A Data Provider must strike a balance between identifiable and de-identifiable data, using different approaches to obfuscate the participant’s identity by implementing a variety of features or algorithms. Possible methods include record suppression, randomization, pseudo-identification, and masking and sub-sampling (see “Primer on data privacy” in the link above). It is important to underline that even if data is de-identifiable, it is still personal data according to GDPR.

To adhere to GDPR it is important for the organisations to act according to Art. 25 on data protection by design and by default. This means that any organisation managing personal data must implement technical and organisational measures from the earliest stage of data processing. This could mean that any data transferred must be encrypted, safely and securely stored, and that any direct identifiers (e.g., driver name or vehicle registration number) would be replaced in a pseudonymization step (e.g., using driver id or vehicle id). In addition, the European Union Agency for Network and Information Security (ENISA) propose different strategies for “Privacy by design in the era of big data” (https://www.enisa.europa.eu/publications/big-data-protection/at_download/fullReport, described in Table 13.

Table 13: Privacy by-design strategies

Privacy by-design strategy	Description
Minimize	The amount of personal data should be restricted to the minimal amount possible (data minimization).
Hide	Personal data and their interrelations should be hidden from plain view.
Separate	Personal data should be processed in a distributed fashion, in separate compartments whenever possible.
Aggregate	Personal data should be processed at the highest level of aggregation and with the least possible detail in which it is useful.
Inform	Data subjects should be adequately informed whenever processed (transparency).
Control	Participants should be provided agency over the processing of their personal data.
Enforce	A privacy policy compatible with legal requirements should be in place and enforced.
Demonstrate	Data controllers must be able to demonstrate compliance with privacy policy into force and any applicable legal requirements.

The actors must take the two concepts of data protection by design and by default, and privacy by design, and balance them to the research needs. It is important to document the decisions, implementation, and have the organizational means to monitor the compliance with the required data protection and privacy measures.

Having a completely anonymised dataset could mean that the usefulness and value for analysis is reduced. In any case, legal and ethical restrictions on how long one is allowed to keep a personal dataset may force its deletion.

For rich media (such as video or images), feature extraction is a method to preserving privacy, still allowing valuable datasets to be used for analysis. Feature extraction could be used to translate rich media data into measures, thus removing the identifiable elements. Efficient feature extraction could solve two major issues when having rich data collected in a project and current datasets could be shared, and features could be extracted from data before purged.

The first decision to make is which features should be extracted from the data; if the extraction is being performed prior to data deletion, Data Owners, Providers and Consumers must collaborate on this difficult task. Finally, the project must decide if it has the extensive computational resources required to extract features from a large dataset.

The main benefit of feature extracting is the possibility of enhancing existing datasets with new attributes or measures, previously only available from costly manual video coding processes. GPS traces are also considered personal data, albeit indirect, as they can potentially reveal where people live and work and even their children´s schools. Similarly, no detailed travel diaries covering long periods of time can be made public if they contain addresses, even though a person making a single trip in the diary could be anyone living or working at those addresses. There are many approaches being explored to ensure personal integrity, e.g., k-anonymity and differential privacy (see the ENISA report linked above). The trade-off here, between anonymization and maintaining usefulness of the data for research, is difficult.