Big data, big traps: How massive stores of personal information can be misused
Advances in information and communications technology have led to the phenomenon of “big data”. Vast quantities of data are generated, gathered, stored, linked and analysed with phenomenal ease and efficiency.
Much of this big data is personal data generated online through our social interactions, our relationship with organisations, and our connection with smart devices which record and process data.
This includes “content” data such as tweets, texts, emails, phone calls, social network posts, photos and videos; as well as “contextual” data (called metadata) relating to these communications, for example, the time, origin, destination and duration of a communication.
It also includes internet transactions such as our web search, purchase and browsing histories. Together, it can reveal the details of our personal, political, social, financial, and working lives.
No doubt big data can bring enormous economic and societal benefits as companies and governments use it to unleash powerful analytic capabilities. They are connecting data from different sources to find patterns and generate new insights for optimising customer relationships, targeted behavioural advertising, combating criminal activities, improving health care and many other aspects of our lives.
While these efforts are to be welcomed, they have potential ramifications for privacy and data protection.
At its core, big data analytics uncovers the correlations between data. However, correlation does not necessarily imply causality. Hence, while clinical researchers have found a correlation between skipping breakfast and obesity, it would be wrong for us to conclude that eating breakfast will beat obesity.
It is possible that the research participant was physically inactive and that was why he did not feel hungry in the morning and at the same time tended to gain weight. Encouraging him to eat breakfast would only aggravate obesity.
Another example showing how big data can be misleading is the Street Bump community project initiated by Boston in the United States in 2012 to help residents improve their neighbourhood streets. As the volunteers drove, the mobile app Street Bump identified potholes by recording “bump” data, providing the city with real-time information with which to fix the potholes.
However, the results recorded were skewed in favour of wealthier neighbourhoods with greater smartphone penetration. Had the skewed data not been adjusted, social prejudice would be perpetuated.
Second, big data may be used in profiling, with its attendant risks. For example, some insurance companies tried to use credit reports and lifestyle data as proxies for the analysis of blood and urine samples for determinations on eligibility and offers.
This had the advantage of offering a more convenient and affordable service, as the customer could complete the transaction online by answering a number of apparently neutral questions and is relieved of the painful and costly lab tests.
However, such predictive modelling always entails some margins of error. Perfectly healthy applicants may either be rejected or be accepted but have to pay a higher insurance premium unknowingly, nor would they be able to access and correct any misleading information about them.
Similarly, in the fight against terrorism, the use of blacklists based on statistical inferences is bound to result in false positives and false negatives. It offers no absolute guarantee that terrorist passengers will be intercepted while some innocent passengers would inevitably be prevented from boarding a plane.
You can only hope that you will not someday be one of the unfortunate ones in the latter category.
Third, the use of big data could be creepy. The retail giant Target, through analysing its customers’ purchasing patterns, was able to identify two dozen products which could be used as proxies for predicting pregnancy so that it could send relevant coupons to the target customer.
This predictive capability was uncovered following a complaint by the father of a teenage girl who found out that her daughter was three months pregnant, based on the increased amount of pregnancy-related advertisements from Target arriving in the mail.
That Target has “data-mined” its way into the customer’s womb is clearly very intrusive to privacy.
The Snowden revelations in 2013 offered perhaps the most illuminating example of how governments can exploit big data to undertake mass surveillance on their own citizens and worldwide, causing extreme intrusiveness to the daily lives of ordinary people.
The US National Security Agency, together with its intelligence partners worldwide, run programmes which collect telephone metadata from US telephone companies, and monitor international internet traffic.
Users of big data may claim that privacy is a non-issue because they are working with de-identified information, that is, data stripped of the name and other personal identifiers. Such an assertion may be a fallacy.
Our online tracks are tied to smartphones or personal computers through UDIDs (unique device identifiers), IP addresses, “fingerprinting” and other means. Given how closely these personal communication devices are associated with each of us, information linked to these devices is, to all intents and purposes, linked to us as individuals.
Furthermore, big data can increase the risk of re-identification, and in some cases, inadvertently re-identify large swathes of de-identified data all at once.
In 2006, the internet giant AOL released 20 million old search queries of 658,000 subscribers for public view in connection with the company’s newly launched research site. Identification numbers were used instead of names, user IDs or IP addresses when listing the search logs.
However, within days, The New York Times, based on search queries like “60 single men”, “tea for good health” and “landscapers in Lillburn, Ga” identified correctly one of the subscribers to be a 62-year-old widow from Lillburn, Georgia. Her whole personal life was exposed immediately as people reviewed her search queries.
The ensuing public outcry led to a public apology from AOL and the removal of all the search logs in a matter of 10 days.
While the intelligent use of big data holds great promise for enriching the quality of life and enhancing productivity, consumer privacy and data protection must remain a priority.
The challenge before us is how to ensure a win-win outcome by exploiting big data’s potential while addressing its downsides.
Allan Chiang is the Privacy Commissioner for Personal Data. This article is an abridged version of a blog published by the Privacy Commissioner for Personal Data on April 28 at pcpd.org.hk.