4
Big Data’s Dark Side: problems and bias in large data sets Business & media intelligence solutio If there is one emerging technology trend that’s been at the forefront of the business world’s collective mind recently, it’s Big Data (recent Hype surveys notwithstanding). Due to the rapid rise and proliferation of data collection methods like mobile internet, smartphones with GPS and internet and sensor networks it’s become easier than ever to collate very large sets of data. This, in turn, has lead to something known colloquially as “data fundamentalism”. Put simply it’s the belief that analysis of big data always leads to clear objective results. I’m sure you reading this can see any number of problems with this mindset but, before we explore that a bit further, let me pose a question. What causes 99% of all plane crashes? It’s not mechanical failure, not birds being sucked into the engines and it’s not the on-board computers crashing. 99% of plane crashes results from one simple, but very poignant cause: Human Error. It’s not just aviation that suffers from the human element though; big data sets are, at their core, human creations and it’s that mortal touch that gives big data it’s first stumbling block. In our previous blog post ‘Big Data in 6 easy pieces’, we touched on hidden biases inherent in big data and, while I’m not going to completely rehash myself here, I will just touch on and expand a few of the ideas I explored previously.

Big data's dark side: problems and bias in large data sets

  • Upload
    m-brain

  • View
    449

  • Download
    1

Embed Size (px)

DESCRIPTION

Big data's dark side: problems and bias in large data sets by M-Brain

Citation preview

Page 1: Big data's dark side: problems and bias in large data sets

Big Data’s Dark Side: problems and bias in large

data sets

Business & media intelligence solutions

If there is one emerging technology trend that’s been at the forefront of the business world’s collective mind recently, it’s Big Data (recent Hype surveys notwithstanding). Due to the rapid rise and proliferation of data collection methods like mobile internet, smartphones with GPS and internet and sensor networks it’s become easier than ever to collate very large sets of data. This, in turn, has lead to something known colloquially as “data fundamentalism”. Put simply it’s the belief that analysis of big data always leads to clear objective results. I’m sure you reading this can see any number of problems with this mindset but, before we explore that a bit further, let me pose a question. What causes 99% of all plane crashes? It’s not mechanical failure, not birds being sucked into the engines and it’s not the on-board computers crashing. 99% of plane crashes results from one simple, but very poignant cause: Human Error.

It’s not just aviation that suffers from the human element though; big data sets are, at their core, human creations and it’s that mortal touch that gives big data it’s first stumbling block. In our previous blog post ‘Big Data in 6 easy pieces’, we touched on hidden biases inherent in big data and, while I’m not going to completely rehash myself here, I will just touch on and expand a few of the ideas I explored previously.

Page 2: Big data's dark side: problems and bias in large data sets

Ok, so lets start with an example. The City of Boston has a major problem with potholes. They have to patch around 20,000 a year so, to help identify and deal with problems they developed and launched an app called StreetBump, which was widely lauded on it’s release. Using accelerometer and GPS data the idea was that the app would find the potholes and transmit data back up the ladder with a view to getting them patched. However, what the app failed to take into account was certain cultural elements. People in lower income brackets have less access to smartphones and are, inversely, more likely to have potholes in their neighbourhood. It’s not just the human element that causes problems however, even the algorithms that run big data analysis are fallible. Take, for instance, programmes used to grade student papers. These tools rely on things like identifying long sentences and complexity of language but, once students work that out, it becomes easy to fool the system by using long sentences and obscure polysyllabic words, rather than formulating good ideas and writing clear, coherent text. Even more powerful tools, like Google’s widely praised search algorithm, are not immune to things like Google Bombing or Spamdexing, proving that there’s always a loophole if you look hard enough. Another big issue is too many or false correlations. If you look 100 times for correlations between two variables chances are as many as 5% of these will be false correlations that, while they appear statistically significant, have no meaningful connection. With close supervision absent from big data analysis the size of these massive data sets can easily amplify these little errors.

Page 3: Big data's dark side: problems and bias in large data sets

This brings us nicely to what’s called the Echo Chamber Effect, which stems from the fact that a lot of the data collected comes from the web. In layman’s terms what this means is that whenever the source of information from which big data is collected is itself a product of big data, opportunities for a vicious circle emerge.

Look at translation programs like Google Translate, which draws on many pairs of parallel texts from different languages; like the same Wikipedia entry in two different languages, for example. Translate then uses this to discern the patterns of translation between those languages. This seems like a perfectly reasonable strategy, except for the fact that with some of the less prolific languages, many of the Wikipedia articles themselves may have been written using Google Translate. In those cases, any initial errors in Google Translate infect Wikipedia, which is fed back into Google Translate, reinforcing the error and creating a nightmare feedback loop. I couldn’t mention Google without at least touching on their Google Flu Trends (GFT) tool. GFT was developed to accurately track patterns of flu infection by looking at Google searches for flu and related enquiries, the reasoning being that there were a higher number of searches in infected areas. Google was very proud of their tool initially, saying that it was more responsive and accurate than the tracking models used by the Centre for Disease Control (CDC). Sounds like a good use of the 500 million plus Google searches per day right? Sadly, it wasn’t. Between August 2011 and September 2013 GFT overstated the prevalence of flu in 100 out of 108 weeks. During the winter of 2013 GFT stated that 11% of the continental US population was infected with the flu virus, almost double the CDC quotes of 6%. It also completely missed the nonseasonal H1N1-A flu pandemic, which has to be a pretty big faux pas. This just goes to show that even though a company like Google is able to amass a vast quantity of data, there are no assurances that they’ll be able to process the information correctly. There is also the problem that GFT was based on Google’s search algorithm, which is constantly being updated to not only produce more accurate search results but to also maximise advertising revenue.

Page 4: Big data's dark side: problems and bias in large data sets

So these are all problems, but what about solutions? The good news is that as technology evolves and proliferates, the algorithms used become a lot more complex and able to identify anomalies, false positives and the like. Another important lesson to take away from this is that big data is not the be all and end all and must be used simply as one tool among many to leverage accurate and insightful results. Given that the hype around big data seems to be dropping, giving way to a more studied and intelligent approach to big data gathering and analysis. We learn, as a species, from our mistakes and, despite our relatively short time with big data as a tool, are beginning to move towards a more social science orientated approach to analysis that, whilst it does make things more complex, yields much more useful and insightful results. Big data is here to stay, there can be no doubt, and the next couple of years should provide some very interesting developments in the field of data science. Ben Olive-JonesM-Brain UK

www.m-brain.com 0118 956 5834 [email protected]

Follow us for more Big Data scoops @MBrainUk

M-Brain Uk