Big Data Innovation, Annual Review 2013

1

ISSUE 6

1ST ANNIVERSARYWe take a look back over the past year

2

Welcome to the Big Data Innovation Annual Review.

We have achieved so much in the past 12 months, from the launch of the first magazine to the first printed edition and the introduction of the iPad app.

Thanks to everybody involved, from those who have contributed fantastic stories to those who have shared the magazine and made it the truly global publication that it now is.

This issue is a celebration of everything that has been done in the past 12 months and I have personally chosen two articles from each edition that I think you will all find fascinating. These are some of the best articles that we have included within the magazine and each has received significant praise.

From our humble beginnings in December last year we have grown to include some of the most influential data minds as both contributors, readers and experts.

In 2014 we hope to be able to bring you more magazines, more innovative and thought provoking articles and also bring you the latest big data news. In addition to this we are hoping to add a dedicated website for the magazines, allowing people to catch up with the latest content daily as well as finding more in depth analysis in the magazines.

As mentioned before, this issue is a celebration of some of the best articles from the past 12 months. I have hand picked these as both personal favourites as well as from audience appreciation. That this process was so difficult is testament to the quality of writing that we have seen in the past year, a level of quality that we are dedicated to maintaining throughout 2014.

If you are interested in writing, advertising or even an advisory role within the magazine then please get in touch.

George HillChief Editor

§

Managing EditorGeorge Hill

Assistant EditorsRichard AngusHelena MacAdam

PresidentJosie King

Art DirectorGavin Bailey

AdvertisingHannah [email protected]

ContributorsDamien DemajChris TowersTom DeutschHeather JamesClaire WalmsleyDaniel MillerGil Press

General [email protected]

Letter From The Editor

Editor’s Letter

3

4 Drew Linzer lifts the lid on his famous work in predicting the Obama re-election win

7 Ashok Srivastava talks NASA’s Big Data in this interview from the first issue of the magazine

10 One of our most popular articles, Chris Towers looks at the impact of Quantum Computing on Big Data

13 The skills gap was discussed in detail by Tom Deutsch in the second issue of the magazine this year

Contents

17 Damien Demaj’s work on spacial analytics in tennis was well received when published in Issue 3

24 Gil Press wrote a fantastic piece on the history of Big Data, to much acclaim

33 Andrew Claster, the man behind Obama’s Big Data team spoke to Daniel Miller in Issue 4

37 Education was the issue when Chris Towers spoke to Gregory Shapiro-Sharp

42 Heather James spoke to the famous Stephen Wolfram in September about his work at Wolfram Alpha

45 Data Transparency was the focus of this brilliant article by Claire Walmsley in Issue 5

Originally Published

Issue

1

1

2

2

3

3

4

4

5

5

Contents

Page Number

Description

4

An Interview with Drew Linzer: The Man Who Predicted The 2012 Election


4 4 Drew Linzer

5

Drew Linzer is the analyst who predicted the results of the 2012 election four months in advance. His algorithms even correctly predicted the exact number of votes and the winning margin for Obama.

Drew has documented his analytical processes in his blog votamatic.org which details the algorithms used, their selections and also the results.

He also appeared on multiple national and international news programs due to his results.

I caught up with him to discuss not only his analytical techniques but also his opinions on what was the biggest data driven election ever.

George: What kind of reaction has there been to your predictions?

Drew: Most of the reaction has focused on the difference in accuracy between those of us who studied the public opinion polls, and the “gut feeling” predictions of popular pundits and commentators. On Election Day, data analysts like me, Nate Silver (New York Times FiveThirtyEight blog), Simon Jackman (Stanford University and Huffington Post), and Sam Wang (Princeton Election Consortium) all placed Obama’s reelection chances at over 90%, and correctly foresaw 332 electoral votes for Obama as the

most likely outcome. Meanwhile, pundits such as Karl Rove, George Will, and Steve Forbes said Romney was going to win and in some cases, easily. This has led to talk of a “victory for the quants” which I’m hopeful will carry through to future elections.

How do you evaluate the algorithm used in your predictions?

My forecasting model estimated the state vote outcomes and the final electoral vote, on every day of the campaign, starting in June. I wanted the assessment of these forecasts to be as fair and objective as possible and not leave me any wiggle room if they were wrong. So, about a month before the election, I posted on my website a set of eight evaluation criteria I would use once the results were known. As it turned out, the model worked perfectly. It predicted over the summer that Obama would win all of his 2008 states minus Indiana and North Carolina, and barely budged from that prediction even after support for Obama inched upward in September, then dipped after the first presidential debate.

The amount of data used throughout this campaign both by independent analysts and campaign teams has been huge, what kind of implications does this have for data usage in 2016?

The 2012 campaign proved that multiple, diverse sources of quantitative information could be

Drew Linzer

6

managed, trusted, and applied successfully towards a variety of ends. We outsiders were able to predict the election outcome far in advance. Inside the campaigns, there were enormous strides made in voter targeting, opinion tracking, fundraising, and voter turnout. Now that we know these methods can work, I think there’s no going back. I expect reporters and campaign commentators to take survey aggregation much more seriously in 2016. And although Obama and the Democrats currently appear to hold an advantage in campaign technology, I would be surprised if the Republicans didn’t quickly catch up.

Do you think that the success of this data driven campaign has meant that campaign managers now need to be an analyst as well as a strategist?

The campaign managers may not need to be analysts themselves, but they should have a greater appreciation for how data and technology can be harnessed to their advantage. Campaigns have always used survey research to formulate strategy and measure voter sentiment. But now there are a range of other powerful tools available: social networking websites, voter databases, mobile smartphones, and email marketing, to name only a few. That is in addition to recent advances in polling methodologies and statistical opinion modeling. There is a lot of innovation happening in

American campaign politics right now.

You managed to predict the election results 4 months beforehand, what do you think is the realistic maximum timeframe to accurately predict a result using your analytics techniques?

About four or five months is about as far back as the science lets us go right now; and that’s even pushing it a bit. Prior to that, the polls just aren’t sufficiently informative about the eventual outcome: too many people are either undecided or haven’t started paying attention to the campaign. The historical, economic and political factors that have been shown to correlate with election outcomes also start to lose their predictive power once we get beyond the roughly 4-5 month range. Fortunately, that still gives the campaigns plenty of time to plot strategy and make decisions about how to allocate their resources.

Drew Linzer

7

Edwin Verin / Shutterstock.com

NASA’S Big Data:An Interview With Ashok Srivastava


7 Data At NASA

8

On a sunny September morning in Boston this year, Ashok Srivastava was waiting to stand at the podium and present to a room of 600 people at the Big Data Innovation summit - the largest, dedicated big data conference in the world.

Giving his perspectives on the growth of big data, its uses in aviation safety and how his employer, NASA, have utilised and innovated through its uses, Ashok came out as one of the most popular speakers at the summit.

After the success of the presentation and the summit in general, I was lucky enough to sit down with Ashok to discuss the way that big data has changed within NASA and the success that it has had in the wider business community.

So why has big data come to prominence in the last three or four years?

Ashok argues that this is not a change that has taken place solely over the past three or four years. It is a reaction to society’s change to becoming more data driven. The last 25 years have seen people increasingly needing data to either make or back up decisions.

Recent advancements in technology and the ability

of data scientists to analyze large data sets have meant that there has been an acceleration in the speed at which this happens. With new types of databases and the ability to record and analyze data quickly the levels of technology required have been reached.

NASA has been at the forefront of technology innovation for the past 50 years, bringing us everything from the modern computer to instant coffee. Ashok explains how NASA is still innovating today and with the huge amounts of data that they are consuming, what is happening in big data there is going to affect businesses.

For instance, NASA are currently discussing the use of their big data algorithms and systems with companies ranging from medical specialists to CPG organizations. The work that they have done within data in the past few years has created the foundations allowing many companies to become successful.

One of the issues that is really affecting companies looking to adopt big data is the current gap in skilled big data professionals. The way to solve this in Ashok’s opinion is through a different set of teaching parameters.

Data At NASA

9

The training for these should revolve around machine learning and optimization, allowing people to learn the “trade of big data” meaning that they can learn how systems work from the basics upwards, allowing them to have full insight when analyzing.

Given the relative youth of big data, I wanted to know what Ashok thought would happen with big data at NASA in the next 10 years in addition to the wider business community.

NASA in ten years will be dealing with a huge amount of data, on a scale that is currently unimaginable. This could include things like full global observations as well as universe observations, gathering and analyzing petabytes of information within seconds.

With public money being spent on these big data projects, Ashok makes it clear that the key benefit should always boil down to ultimately providing value for the public.

This is a refreshing view of NASA who have traditionally been seen as secretive due to the h i g h l y

confidential nature of their operations and the lack of public understanding.

Ashok also had some pieces of advice for people currently looking to make waves in the big data world:

“It is important to understand the business problem that is being solved”

“Making sure the technologies that are being deployed are scalable and efficient”

Data At NASA9

10

Quantum Computing: The Future of Big Data?Chris TowersOrganiser, Big Data Innovation Summit

Quantum Computing10

11

When we look back at the computers in the 1980’s, with the two tone terminal screen and the command driven format, then look at our colorful and seemingly complex laptops, we think that there must have been a huge revolution in the basic building blocks of these machines. However in reality they are still built on the same principles and the same technological theories.

In essence the way that all modern computers currently work is through using silicon transistors on a chip. These transistors can be either on or off, dictating the function of the computer in order to perform tasks and calculations. The computers that you are using today are using this same technology as the first computer in the 1940’s albeit in a far more advanced state.

Whereas the transistors in the 1940’s were bulky, today IBM can fit millions of transistors onto a single chip, allowing superior processing power albeit using the same theories as when computers were first invented. One of the ways that this is possible is through the miniaturization of transistors allowing people to fit far more into a smaller space.

The issue with this, in terms of future iterations, is that a

transistor can only be so small. When it gets to an atomic level they will cease to function, meaning that when this point is reached computers will again have to start growing in order to maintain the demand for faster computing.

With the current need for faster computers with the data revolution and the ever increasing numbers of companies adopting analytics and big data, in a few years time there will need to be a change.

The idea that there could be a different method to computing aside from transistors was originally theorized in 1982 by Richard Feynman, who pioneered the idea of quantum computing.

So what does this have to do with big data?

The idea behind quantum computing is complex as it uses the highly confusing quantum theory, described famously by Niels Bohr as “anyone who is not shocked by quantum theory has not understood it”.

Current computers work through utilizing ‘bits’ which are either on or off making it difficult to understand the way that quantum computing works. This is through ‘qubits’ which can be on or off or both, meaning that the qubit can be

Quantum Computing11

12

essentially in two states at the same time.

The element that makes this complex is that this goes beyond regular thinking which is digital. It is essentially watching somebody flick a coin and it landing on both heads and tails at the same time.

The way that this has implications for big data is that due to the multiple options with states, the information processing power is almost incomprehensible when compared to current traditional computers. In fact, the processing power of these computers has had experts describing the power of quantum computing as being the hydrogen bomb of cyber warfare.

Although several years away, it is predicted that with the current uptake of cloud usage amongst companies, within 15 years the majority of civilian companies will have access to quantum computers to deal with their ever increasing volumes of information.

The first commercially available quantum computer has been produced and one has been sold, although they are currently being sold by their makers, D-Wave for $10 million and they are housed in a 10mx10m room.

However, despite the issues

with both the size and price for companies at the moment, Google has begun experimentations with quantum computers. After conducting tests with 20,000 images, half with cars and half without, the quantum computer could sort the photos into those including cars and those not including cars considerably faster than anything in Google’s current data centers.

This kind of real world experimentation and investment by major civilian companies will help push quantum computing into the mainstream, allowing companies working with big data to be able to access and use larger data sets quicker than they ever could before.

Although at the moment prohibitive pricing and logistics of quantum computers makes it unviable for regular companies, in a decade’s time the changes that will be wrung out by Google and other early adopters may well spark a genuine revolution in computing.

A potential way to overcome this would be through the use of the cloud to access quantum computers. This would not only drive the prices down for infrequent usage but would also be a viable option for most companies.

Quantum Computing

13

Bridging The Big Data Skills GapTom DeutschBig Data Solution Architect

13 Bridging The Gap

14

Big Data technologies certainly haven’t suffered for a lack of market hype over the past 18 months, and if anything, the noise in the space is continuing to grow. At the same time there’s been a lot of discussion around a skills gap that exists to make effective use of these new technologies. Not surprisingly one of the most frequent questions I am asked is “how do those two come together in the real world”? So what should smart firms be aware of as they investigate how to get started using these new technologies with the people and skillsets they currently have?

It makes sense to break this question down into three skills components; administration, data manipulation and data science. The skills required in each of these areas varies from others and functionally they represent different challenges. You are not likely to face the same challenges and staffing considerations in all three.

Administration – this includes the functional responsibilities for setting up, deploying and maintaining the underlying systems and architectures. I would argue that a real skills gap exists here today for Enterprises of any significant size. It may take a little

bit of time to understand how some of the systems, such as a Hadoop, differs in scaling horizontally and handles availability. Generally speaking setup, configuration and administration tasks are reasonably well-documented and existing server and network administrators can successfully manage them. In fact, in comparison to vertically scaled traditional technologies the administration component here can be dramatically simplified compared to what your are used to. Keep in mind that this is a space where the hardware/software

vendors are starting to compete on manageability and an increasing number of appliance options exists. To oversimplify in the interest of space, if your team can manage a LAMP stack you should be fine here.

Data Manipulation – so here is when the fun really starts and also where you may first encounter issues. This is when you start working with the data in new ways and not surprisingly this is likely to be the first place that a skills gap appear. In practice I would suggest planning a gap appearing here – how mild or severe of a gap depends upon several factors. These factors boil down to two big buckets – first, can they manipulate the data in the new platforms and second do they know how to manipulate the data in valid ways.

The first issue – can you manipulate the data - often comes down to how much scripting and/or java development experience your teams have. While the tools and abstraction options are improving rapidly across most of these technologies there is usually no escaping having to dive into scripting or even write some Java code at some point. If your teams are already doing that, no big deal. If they aren’t already familiar

Bridging The Gap

15

and using scripting languages then there is some reason for pause. Similarly, while there are interface options that are increasingly SQL-like if your teams aren’t experienced in multiple development languages you should expect them to have some kind of learning curve. Can they push through it? Almost certainly, just budget some time to allow that to happen. As noted above, this will get easier and easier over time, but do not assume that tools will prevent the need for some coding. This is where you are going to spend the bulk of your time and people so make sure you are being realistic about your entry-point skills. Also keep in mind this isn’t the hardest part. In many cases the second challenge here is the bigger one – not how can you manipulate the data but how should you manipulate the data. The real open question here is what to collect in the first place and how to actually use it in a meaningful way. That, of course, is a bigger issue which brings us to data scientist question.

Data Science – so finally to the hotly debated data scientist role. Popular press would have you believe that there is a plus or minus 10 year shortage of people that are skilled in data science. At the same time literally tens

of thousands of people have completed open coursework from MIT and others on data science. Another variable is the evolution and progress of tools that make data collection and analytic routines more easily understood. So where does that put us?

First, it is important to note that there are many use cases that never get to this

level such as creating a data landing zone, data warehouse augmentation and alternative ELT approaches. No data science needed there – and as I’ve written elsewhere diving directly into a data science driven projects is a lousy idea. What if you have a project that has a data science dependency though, what should you expect?

Bridging The Gap

16

Frankly, your experience here will vastly differ depending on the depth and robustness of your existing analytics practice. Most large Enterprises already have pockets of expertise to draw on here from their SPSS, SAS or R communities. The data sources may be new (and faster moving or bigger) but math is math, statistics are statistics. These tools increasingly work with these technologies (especially Hadoop) so in some cases they won’t even have to leave their existing

environments. If you already have the skills, so far so good. If you don’t have these skills you are going to have to grow, buy or rent them. Growing is slow, buying expensive and renting somewhere in between. Do not expect to be successful taking people with reporting or BI backgrounds and throwing them into data science issues. If you cannot honestly say “yes, we have advanced statisticians that are flexible in their thinking and understand the business” you are going to struggle and

need to adopt a grow, buy or rent strategy. We’ll pick up effective strategies for dealing with grown, buy or rent issue, including notions of Center of Excellence, in future topics.

Bridging The Gap

17

Using Spatial Analytics To Study Spatio-Temporal Patterns In Sport Damien DemajGeospatial Product Engineer

Damien Demaj17

18

Late last year I introduced ArcGIS users to sports analytics, an emerging and exciting field within the GIS industry. Using ArcGIS for sports analytics can be read here. Recently I expanded the work by using a number of spatial analysis tools in ArcGIS to study the spatial variation of serve patterns from the London Olympics Gold Medal match played between Roger Federer and Andy Murray. In this blog I present results that suggest there is potential to better understand players’ serve tendencies using spatio-temporal analysis.The Most Important Shot in Tennis?The serve is arguably the most important shot in tennis. The location and predictability of a player’s serve has a big influence on their overall winning serve percentage. A player is who is unpredictable with their serve and can consistently place their serve wide into the service box, at the body or down the T is more likely to either win a point outright, or at least weaken their opponent’s return [1].The results of tennis matches are often determined by a small number of important points during the game. It is common to see a player win a match who has won the same number of points as his opponent. The scoring system in tennis also makes it possible for a

player to win fewer points than his opponent yet win the match [2]. Winning these big points is critical to a player’s success. For the player serving their aim is to produce an ace or, force their opponent into an outright error, as this could make the difference between winning and losing. It is of particular interest to coaches and players to know the success of the player’s serve at these big points.Geospatial AnalysisIn order to demonstrate the effectiveness of geo-visualizing spatio-temporal data using GIS we conducted a case study to determine the following: Which player served with more spatio-temporal variation at important points during the match?To find out where each player

Figure 1: Igniting further exploration using visual analytics. Created in ArcScene, this 3D visualization depicts the effectiveness of Murray’s return in each rally and what effect it had on Federer’s second shot after his serve.

Figure 2. The K Means algorithm in the Grouping Analysis tool in ArcGIS groups features based on attributes and optional spatial temporal constraints.

Damien Demaj

19

Figure 3. Calculating the Euclidean distance (shortest path) between two sequential serve locations to identify spatial variation within a player’s serve pattern.

Figure 4. The importance of points in a tennis match as defined by Morris. The data for the match was classified into 3 categories as indicated by the sequential color scheme in the table (dark red, medium red and light red).

served during the match we plotted the x,y coordinate of the serve bounce. A total of 86 points were mapped for Murray, and 78 for Federer. Only serves that landed in were included in the analysis. Visually we could see clusters formed by wide serves, serves into the body and serves hit down the T. The K Means algorithm [3] in the Grouping Analysis tool in ArcGIS (Figure 2) enabled us to statically

replicate the characteristics of the visual clusters. It enabled us to tag each point as either a wide serve, serve into the body or serve down the T. The organization of the serves into each group was based on the direction of serve. Using the serve direction allowed us to know which service box the points belong to. Direction gave us an advantage over proximity as this would have grouped points in neighbouring service boxes.To determine who changed the location of their serve the most we arranged the serve bounces into a temporal sequence by ranking the data according to the side of the net (left or right), by court location (deuce or ad court), game number and point number. The sequence of bounces then allowed us to create Euclidean lines (Figure 3) between p1 (x1,y1) and p2 (x2,y2), p2 (x2,y2) and p3 (x3,y3), p3 (x3,y3) and p4 (x4,y4) etc in each court location. It is possible to determine, with greater spatial variation, who was the more predictable server using the mean Euclidean distance between each serve location. For example, a player who served to the same part of the court each time would exhibit a smaller mean Euclidean distance than a player who frequently changed the position of their serve. The

Damien Demaj

20

mean Euclidean distance was calculated by adding all of the distances linking the sequence of serves in each service box divided by the total number of distances.To identify where a player served at key points in the match we assigned an importance value to each point based on the work by Morris [4]. The table in Figure 4 shows the importance of points to winning a game, when a server has 0.62 probability of winning a point on serve. This shows the two most important points in tennis are 30-40 and 40-Ad, highlighted in dark red. To simplify the rankings we grouped the data into three classes, as shown in Figure 4.In order see a relationship between outright success on a serve at the important points we mapped the distribution of successful serves and overlaid the results onto a layer containing the important points. If the player returning the serve made an error directly on their return, then this was deemed to be an outright success for the player. An ace was also deemed to be an outright success for the server.ResultsFederer’s spatial serve cluster in the ad court on the left side of the net was the most spread of all his clusters. However, he served out wide with great accuracy into the deuce court

on the left side of the net by hugging the line 9 times out 10 (Figure 5). Murray’s clusters appeared to be grouped overall more tightly in each of the service boxes. He showed a clear bias by serving down the T in the deuce court on the right side of the net. Visually there appeared to be no other significant differences between each player’s patterns of serve.By mapping the location of the players serve bounces and grouping them into spatial serve clusters we were able to quickly identify where in the service box each player was hitting their serves. The spatial

Figure 6. A comparison of spatial serve variation between each player. Federer’s mean Euclidean distance was 1.72m (5.64 ft) - Murrray’s was 1.45m (4.76 ft). The results suggest that Federer’s serve had greater spatial variation than Murray’s. The lines of connectivity represent the Euclidean distance (shortest path) between each sequential service bounce in each service box.

Figure 5. Mapping the spatial serve clusters using the K Means Algorithm. Serves are grouped according to the direction they were hit. The direction of each serve is indicated by the thin green trajectory lines. The direction of serve was used to statistically group similar serve locations.

Damien Demaj

21

Figure 7. A proportional symbol map showing the relationship of where each player served at big points during the match and their outright success at those points.

serve clusters, wide, body or T were symbolized using a unique color, making it easier for the user to identify each group on the map. To give the location of each serve some context we added the trajectory (direction) lines for each serve. These lines helped link where the serve was hit from to where the serve landed. They help enhance the visual structure of each cluster and improve the visual summary of the serve patterns.The Euclidean distance calculations showed Federer’s mean distance between sequential serve bounces was 1.72 m (5.64 ft), whereas Murray’s mean Euclidean distance was 1.45 m (4.76 ft). These results suggest that Federer’s serve had greater spatial variation than Murray’s. Visually, we could detect that the network of Federer’s Euclidean lines showed a greater spread

than Murray’s in each service box. Murray served with more variation than Federer in only one service box, the ad service box on the right side of the net.The directional arrows in Figure 6 allow us to visually follow the temporal sequence of serves from each player in any given service box. We have maintained the colors for each spatial serve cluster (wide, body, T) so you can see when a player served from one group into another.At the most important points in each game (30-40 and 40-Ad), Murray served out wide targeting Federer’s backhand 7 times out of 8 (88%). He had success doing this 38% of the time, drawing 3 outright errors from Federer. Federer mixed up the location of his 4 serves at the big points across all of the spatial serve clusters, 2 wide, 1 body and 1 T. He had success 25% of

the time drawing 1 outright error from Murray. At other less important points Murray tended to favour going down the T, while Federer continued his trend spreading his serve evenly across all spatial serve clusters (Figure 7).The proportional symbols in Figure 7 indicate a level of importance for each serve. The larger circles represent the most important points in each game – the smallest circles the least important. The ticks represent the success of each serve. By overlaying the ticks on-top of the graduated circles we can clearly see a relationship between the success at big points on serve. The map also indicates where each player served.The results suggest that Murray served with more spatial variation across the two most important point categories, recording a mean Euclidean distance of 1.73 m (5.68 ft) to Federer’s 1.64 m (5.38 ft).ConclusionSuccessfully identifying patterns of behavior in sport in an on-going area of work [5] (see figure 8), be that in tennis, football or basketball. The examples in this blog show that GIS can provide an effective means to geovisualize spatio-temporal sports data, in order to

Damien Demaj

22

reveal potential new patterns within a tennis match. By incorporating space-time into our analysis we were able to focus on relationships between events in the match, not the individual events themselves. The results of our analysis were presented using maps. These visualizations function as a convenient and comprehensive way to display the results, as well as acting as an inventory for the spatio-temporal component of the match [6].Expanding the scope of geospatial research in tennis, and other sports relies on open access to reliable spatial data. At present, such data is not publically available from the governing bodies of tennis. An integrated approach with these organizations, players, coaches, and sports scientists would allow for further validation and development of geospatial analytics for tennis. The aim of this research is to evoke a new wave of geospatial analytics in

the game of tennis and across other sports. Furthermore, to encourage statistics published on tennis to become more time and space aware to better improve the understanding of the game, for everyone.

Figure 8. The heatmap above shows Federer’s frequency of shots passing through a given point on the court. The map displays stroke paths from both ends of the court, including serves. The heat map can be used to study potential anomalies in the data that may result in further analysis.

Damien Demaj

References

[1] United States Tennis Association, “Tennis tactics, winning patterns of play”, Human Kinetics, 1st Edition, 1996.

[2] G. E. Parker, “Percentage Play in Tennis”, In Mathematics and Sports Theme Articles, http://www.mathaware.org/mam/2010/essays/

[3] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm”, Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, No. 1, pp. 100-108, 1979.

[4] C. Morris, “The most important points in tennis”, In Optimal Strategies in Sports, vol 5 in Studies and Management Science and Systems, , North-Holland Publishing, Amsterdam, pp. 131-140, 1977.

[5] M. Lames, “Modeling the interaction in games sports – relative phase and moving correlations”, Journal of Sports Science and Med-icine, vol 5, pp. 556-560, 2006.

[6] J. Bertin, “Semiology of Graphics: Diagrams, Networks, Maps”, Esri Press, 2nd Edition, 2010.

23

A Short History Of Big DataGil PressBig Data Expert

23 A History Of Big Data

24

The story of how data became big starts many years before the current buzz around big data. Already seventy years ago we encounter the first at-tempts to quantify the growth rate in the volume of data or what has popularly been known as the “information explosion” (a term first used in 1941, ac-cording to the Oxford English Dictionary). The following are the major milestones in the his-tory of sizing data volumes plus other “firsts” in the evolution of the idea of “big data” and ob-servations pertaining to data or information explosion.

1944 Fremont Rider, Wesleyan University Librarian, publishes The Scholar and the Future of the Research Library. He estimates that American university libraries were doubling in size every sixteen years. Given this growth rate, Rider speculates that the Yale Library in 2040 will have “approximately 200,000,000 volumes, which will occupy over 6,000 miles of shelves… [requiring] a cataloging staff of over six thousand persons.”

1961 Derek Price publishes Science Since Babylon, in which he charts the growth of scientific knowledge by looking at the growth in the number of scientific journals and papers. He concludes that the number of new journals has grown exponentially rather

than linearly, doubling every fifteen years and increasing by a factor of ten during every half-century. Price calls this the “law of exponential increase,” explaining that “each [scientific] advance generates a new series of advances at a reasonably constant birth rate, so that the number of births is strictly proportional to the size of the population of discoveries at any given time.”

November 1967 B. A. Marron and P. A. D. de Maine publish “Automatic data compression” in the Communications of the ACM, stating that ”The ‘information explosion’ noted in recent years makes it essential that storage requirements for all information be kept to a minimum.” The paper describes “a fully automatic and rapid three-part compressor which can be used with ‘any’ body of information to greatly reduce slow external storage requirements and to increase the rate of information transmission through a computer.”

1971 Arthur Miller writes in The Assault on Privacy that “Too many information handlers seem to measure a man by the number of bits of storage capacity his dossier will occupy.”

1975 The Ministry of Posts and Telecommunications in Japan starts conducting the Information Flow Census,

A History Of Big Data

25

tracking the volume of information circulating in Japan (the idea was first suggested in a 1969 paper). The census introduces “amount of words” as the unifying unit of measurement across all media. The 1975 census already finds that information supply is increasing much faster than information consumption and in 1978 it reports that “the demand for information provided by mass media, which are one-way communication, has become stagnant and the demand for information provided by personal telecommunications media, which are characterized by two-way communications, has drastically increased…. Our society is moving toward a new stage… in which more priority is placed on segmented, more detailed information to meet individual needs, instead of conventional mass-reproduced conformed information.” [Translated in Alistair D. Duff 2000; see also Martin Hilbert 2012 (PDF)]

April 1980 I.A. Tjomsland gives a talk titled “Where Do We Go From Here?” at the Fourth IEEE Symposium on Mass Storage Systems, in which he says “Those associated with storage devices long ago realized that Parkinson’s First Law may be paraphrased to describe our industry—‘Data expands to fill the space available’…. I believe

that large amounts of data are being retained because users have no way of identifying obsolete data; the penalties for storing obsolete data are less apparent than the penalties for discarding potentially useful data.”

1981 The Hungarian Central Statistics Office starts a research project to account for the country’s information industries. Including measuring information volume in bits, the research continues to this day. In 1993, Istvan Dienes, chief scientist of the Hungarian Central Statistics Office, compiles a manual for a standard system of national information accounts. [See Istvan Dienes 1994 (PDF) and Martin Hilbert 2012 (PDF)]

August 1983 Ithiel de Sola Pool publishes “Tracking the Flow of Information” in Science. Looking at growth trends in 17 major communications media from 1960 to 1977, he concludes that “words made available to Americans (over the age of 10) through these media grew at a rate of 8.9 percent per year… words actually attended to from those media grew at just 2.9 percent per year…. In the period of observation, much of the growth in the flow of information was due to the growth in broadcasting… But toward the end of that period [1977] the situation


26

was changing: point-to-point media were growing faster than broadcasting.” Pool, Inose, Takasaki and Hurwitz follow in 1984 with Communications Flows: A Census in the United States and Japan, a book comparing the volumes of information produced in the United States and Japan.

July 1986 Hal B. Becker publishes “Can users really absorb data at today’s rates? Tomorrow’s?” in Data Communications. Becker estimates that “the recoding density achieved by Gutenberg was approximately 500 symbols (characters) per cubic inch—500 times the density of [4,000 B.C. Sumerian] clay tablets. By the year 2000, semiconductor random access memory should be storing 1.25X10^11 bytes per cubic inch.”

1996 Digital storage becomes more cost-effective for storing data than paper according to R.J.T. Morris and B.J. Truskowski, in “The Evolution of Storage Systems,” IBM Systems Journal, July 1, 2003.

October 1997 Michael Cox and David Ellsworth publish “A p p l i c a t i o n - c o n t r o l l e d demand paging for out-of-core visualization” in the Proceedings of the IEEE 8th conference on Visualization. They start the article with “Visualization provides an interesting challenge for

computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk and even remote disk. We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.” It is the first article in the ACM digital library to use the term “big data.”

1997 Michael Lesk publishes “How much information is there in the world?” Lesk concludes that “There may be a few thousand petabytes of information all told; and the production of tape and disk will reach that level by the year 2000. So in only a few years, (a) we will be able [to] save everything–no information will have to be thrown out and (b) the typical piece of information will never be looked at by a human being.”

April 1998 John R. Masey, Chief Scientist at SGI, presents at a USENIX meeting a paper titled “Big Data… and the Next Wave of Infrastress.”

October 1998 K.G. Coffman and Andrew Odlyzko publish “The Size and Growth Rate of the Internet.” They conclude that “the growth rate of traffic on the public Internet, while lower than is often cited, is still about 100% per year, much higher than for traffic on other


27

networks. Hence, if present growth trends continue, data traffic in the U. S. will overtake voice traffic around the year 2002 and will be dominated by the Internet.” Odlyzko later established the Minnesota Internet Traffic Studies (MINTS), tracking the growth in Internet traffic from 2002 to 2009.

August 1999 Steve Bryson, David Kenwright, Michael Cox, David Ellsworth and Robert Haimes publish “Visually exploring gigabyte data sets in real time” in the Communications of the ACM. It is the first CACM article to use the term “Big Data” (the title of one of the article’s sections is “Big Data for Scientific Visualization”). The article opens with the following statement: “Very powerful computers are a blessing to many fields of inquiry. They are also a curse; fast computations spew out massive amounts of data. Where megabyte data sets were once considered large, we now find data sets from individual simulations in the 300GB range. But understanding the data resulting from high-end computations is a significant endeavor. As more than one scientist has put it, it is just plain difficult to look at all the numbers. And as Richard W. Hamming, mathematician and pioneer computer scientist, pointed out, the purpose of computing is insight, not

numbers.”

October 1999 Bryson, Kenwright and Haimes join David Banks, Robert van Liere and Sam Uselton on a panel titled “Automation or interaction: what’s best for big data?” at the IEEE 1999 conference on Visualization.

October 2000 Peter Lyman and Hal R. Varian at UC Berkeley publish “How Much Information?” It is the first comprehensive study to quantify, in computer storage terms, the total amount of new and original information (not counting copies) created in the world annually and stored in four physical media: paper, film, optical (CDs and DVDs) and magnetic. The study finds that in 1999, the world produced about 1.5 exabytes of unique information, or about 250 megabytes for every man, woman and child on earth. It also finds that “a vast amount of unique information is created and stored by individuals” (what it calls the “democratization of data”) and that “not only is digital information production the largest in total, it is also the most rapidly growing.” Calling this finding “dominance of digital,” Lyman and Varian state that “even today, most textual information is ‘born digital,’ and within a few years this will be true for images as well.” A similar study conducted in 2003


28

by the same researchers found that the world produced about 5 exabytes of new information in 2002 and that 92% of the new information was stored on magnetic media, mostly in hard disks.

November 2000 Francis X. Diebold presents to the Eighth World Congress of the Econometric Society a paper titled “’Big Data’ Dynamic Factor Models for Macroeconomic Measurement and Forecasting (PDF),” in which he states “Recently, much good science, whether physical, biological, or social, has been forced to confront—and has often benefited from—the “Big Data” phenomenon. Big Data refers to the explosion in the quantity (and sometimes, quality) of available and potentially relevant data, largely the result of recent and unprecedented advancements in data recording and storage technology.”

February 2001 Doug Laney, an analyst with the Meta Group, publishes a research note titled “3D Data Management: Controlling Data Volume, Velocity and Variety.” A decade later, the “3Vs” have become the generally-accepted three defining dimensions of big data, although the term itself does not appear in Laney’s note.

September 2005 Tim O’Reilly publishes “What is Web 2.0” in

which he asserts that “data is the next Intel inside.” O’Reilly: “As Hal Varian remarked in a personal conversation last year, ‘SQL is the new HTML.’ Database management is a core competency of Web 2.0 companies, so much so that we have sometimes referred to these applications as ‘infoware’ rather than merely software.”

March 2007 John F. Gantz, David Reinsel and other researchers at IDC release a white paper titled “The Expanding Digital Universe: A Forecast of Worldwide Information Growth through 2010 (PDF).” It is the first study to estimate and forecast the amount of digital data created and replicated each year. IDC estimates that in 2006, the world created 161 exabytes of data and forecasts that between 2006 and 2010, the information added annually to the digital universe will increase more than six fold to 988 exabytes, or doubling every 18 months. According to the 2010 (PDF) and 2012 (PDF) releases of the same study, the amount of digital data created annually surpassed this forecast, reaching 1227 exabytes in 2010 and growing to 2837 exabytes in 2012.

January 2008 Bret Swanson and George Gilder publish “Estimating the Exaflood (PDF),” in which they project


29

that U.S. IP traffic could reach one zettabyte by 2015 and that the U.S. Internet of 2015 will be at least 50 times larger than it was in 2006.

June 2008 Cisco releases the “Cisco Visual Networking Index – Forecast and Methodology, 2007–2012 (PDF)” part of an “ongoing initiative to track and forecast the impact of visual networking applications.” It predicts that “IP traffic will nearly double every two years through 2012” and that it will reach half a zettabyte in 2012. The forecast held well, as Cisco’s latest report (May 30, 2012) estimates IP traffic in 2012 at just over half a zettabyte and notes it “has increased eightfold over the past 5 years.”

December 2008 Randal E. Bryant, Randy H. Katz and Edward D. Lazowska publish “Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science and Society (PDF).” They write: “Just as search engines have transformed how we access information, other forms of big-data computing can and will transform the activities of companies, scientific researchers, medical practitioners and our nation’s defense and intelligence operations…. Big-data computing is perhaps the biggest innovation in computing in the last decade.

We have only begun to see its potential to collect, organize and process data in all walks of life. A modest investment by the federal government could greatly accelerate its development and deployment.”

December 2009 Roger E. Bohn and James E. Short publish “How Much Information? 2009 Report on American Consumers.” The study finds that in 2008, “Americans consumed information for about 1.3 trillion hours, an average of almost 12 hours per day. Consumption totaled 3.6 Zettabytes and 10,845 trillion words, corresponding to 100,500 words and 34 gigabytes for an average person on an average day.” Bohn, Short and Chattanya Baru follow this up in January 2011 with “How Much Information? 2010 Report on Enterprise Server Information,” in which they estimate that in 2008, “the world’s servers processed 9.57 Zettabytes of information, almost 10 to the 22nd power, or ten million million gigabytes. This was 12 gigabytes of information daily for the average worker, or about 3 terabytes of information per worker per year. The world’s companies on average processed 63 terabytes of information annually.”

February 2010 Kenneth Cukier publishes in The Economist a


30

Special Report titled, “Data, data everywhere.” Writes Cukier: “…the world contains an unimaginably vast amount of digital information which is getting ever vaster more rapidly… The effect is being felt everywhere, from business to science, from governments to the arts. Scientists and computer engineers have coined a new term for the phenomenon: ‘big data.’”

February 2011 Martin Hilbert and Priscila Lopez publish “The World’s Technological Capacity to Store, Communicate and Compute Information” in Science. They estimate that the world’s information storage capacity grew at a compound annual growth rate of 25% per year between 1986 and 2007. They also estimate that in 1986, 99.2% of all storage capacity was analog, but in 2007, 94% of storage capacity was digital, a complete reversal of roles (in 2002, digital information storage surpassed non-digital for the first time).

May 2011 James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh and Angela Hung Byers of the McKinsey Global Institute publish “Big data: The next frontier for innovation, competition and

productivity.” They estimate that “by 2009, nearly all sectors in the US economy had at least an average of 200 terabytes of stored data (twice the size of US retailer Wal-Mart’s data warehouse in 1999) per company with more than 1,000 employees” and that the securities and investment services sector leads in terms of stored data per firm. In total, the study estimates that 7.4 exabytes of new data were stored by enterprises and 6.8 exabytes by consumers in 2010.

April 2012 The International Journal of Communications publishes a Special Section titled “Info Capacity” on the methodologies and findings of various studies measuring the volume of information. In “Tracking the flow of information into the home (PDF),” Neuman, Park and Panek (following the methodology used by Japan’s MPT and Pool above) estimate that the total media supply to U.S. homes has risen from around 50,000 minutes per day in 1960 to close to 900,000 in 2005.Looking at the ratio of supply to demand in 2005, they estimate that people in the U.S. are “approaching a thousand minutes of mediated content available for every minute available for consumption.” In


31

“International Production and Dissemination of Information (PDF),” Bounie and Gille (following Lyman and Varian above) estimate that the world produced 14.7 exabytes of new information in 2008, nearly triple the volume of information in 2003.

May 2012 danah boyd and Kate Crawford publish “Critical Questions for Big Data” in Information, Communications and Society. They define big data as “a cultural, technological and scholarly phenomenon that rests on the interplay of: (1) Technology: maximizing computation power and

algorithmic accuracy to gather, analyze, link and compare large data sets. (2) Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical and legal claims. (3) Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity and accuracy.”


32

You need a new strategy. You rally your team. Your brainstorming session begins. And the fun fizzles out. But what if you could bring your meetings to life with the power of visual storytelling?

At ImageThink, we specialize in Graphic Facilitation, a creative, grasp-at-a-glance solution to your strategy problems. Using hand-drawn images, we’ll help you draw out a roadmap for

success, bringing your goals into focus, illustrating a clear vision, and simplifying team communication with the power of visuals.

Productive meetings and brainstorming sessions

Visual summaries for keynotes and conferences

Team trainings and workshops

Unique social media contentCreative and engaging tradeshows Whiteboard videos Hand-drawn infographics

-Picturing your big ideas-

Learn more about ImageThink at www.imagethink.netwww.facebook.com/imagethinknyc

@ImageThink

Call on us for:

33

Filip Fuxa / Shutterstock.com

Data In An Election:An Interview With

Andrew Claster Deputy Chief Analytics Officer

Obama for America

Daniel MillerBig Data Leader

33 Obama’s Big Data

34

We were lucky enough to talk to Andrew Claster, Deputy Chief Analytics Officer for President Barack Obama’s 2012 re-election campaign ahead of his presentation at the Big Data Innovation Summit in Boston, September 12 & 13 2013.

Andrew Claster, Deputy Chief Analytics Officer for President Barack Obama’s 2012 re-election campaign, helped create and lead the largest, most innovative and most successful political analytics operation ever developed. Andrew previously developed microtargeting and communications strategies as Vice President at Penn, Schoen & Berland for clients including Hillary Rodham Clinton, Tony Blair, Gordon Brown, Ehud Barak, Leonel Fernandez, Verizon, Alcatel, Microsoft, BP, KPMG, TXU and the Washington Nationals baseball team. Andrew completed his undergraduate studies in political science at Yale University and his graduate training in economics at the London School of Economics.

What was the biggest challenge for the data team during the Obama re-election campaign?

It is difficult to identify just one. Here are some of the most important:

- Data Integration: We have several major database platforms – the national voter file, our proprietary email list, campaign donation history, volunteer list, field contact history, etc. How do we integrate these and use a unified dataset to inform campaign decisions?

- Online/Offline: How do we encourage online activists to take action online and vice-versa? How do we facilitate and measure this activity?

- Models: How do we develop and validate our models about what the electorate is going to look like in November 2012?

-Communications: Our opponents and the press are continually discussing areas in which they say we are falling short. When is it in our interest to push back, when is it in our interest to let them believe their own spin, and what information are we willing to share if we do push back?

- Cost: How do we evaluate everything we do in terms of cost per vote, cost per volunteer hour or cost per staff hour?

- Prioritization: We don’t have enough resources to test everything, model everything and do everything. How do we efficiently allocate human and financial resources?

-Internal Communication,

Sales and Marketing: How do we support every department within the campaign (communications, field, digital, finance, scheduling, advertising)? How do we demonstrate value? How do we build relationships? How do we ensure that data and analytics are used to inform decision-making across the campaign?

- Hiring and Training: Where and how do we recruit more than 50 highly qualified analysts, statistical modelers and engineers who are committed to re-electing Barack Obama and willing to move to Chicago for a job that ends on Election Day 2012, requires that they work more than 80 hours a week for months with no vacation in a crowded room with no windows (nicknamed ‘The Cave’), and pays less with fewer benefits than they would earn doing a similar job in the private sector?

Many working within political statistics and analytics say that the incumbent candidate always has a significant advantage with their data effectiveness, do you think this is the case?

The incumbent has many advantages including the following:

- Incumbent has data, infrastructure and experience

Obama’s Big Data34

35

from the previous campaign.

- Incumbent is known in advance – no primary – and can start planning and implementing general election strategy earlier.

- Incumbent is known to voters – there is less uncertainty regarding underlying data and models.

However, the incumbent may also have certain disadvantages:

- Strategy is more likely to be known to the other side because it is likely to be similar to the previous campaign.

- With a similar strategy and many of the same strategists and vested interests as the previous campaign, it could be harder to innovate.

On balance, the incumbent has an opportunity to put herself or himself in a superior position regarding data, analytics and technology. However, it is not necessarily the case that s/he will do so – the incumbent must have the will and the ability to develop and invest in this potential advantage.

When there is no incumbent and there is a competitive primary, it is the role of the national party and other affiliated groups to invest in and develop this data, analytics and technology infrastructure.

How much effect do you think data had on the election result?

The most important

determinants of the election result were:

-Having a candidate with a record of accomplishment and policy positions that are consistent with the preferences of the majority of the electorate.

- Building a national organization of supporters, volunteers and donors to register likely supporters to vote, persuade likely voters to support our candidate, turn out likely supporters and protect the ballot to ensure their vote is counted.

Data, technology and analytics made us more effective and more efficient with every one of these steps. They helped us target the right people with the right message delivered in the right medium at the right time.

We conducted several tests to measure the impact of our work on the election result, but we will not be sharing those results publicly.

As an example however, I can point out that there were times during the campaign when the press and our opponent claimed that states such as Michigan and Minnesota were highly competitive, that we were losing in Ohio, Iowa, Colorado, Virginia and Wisconsin, and that Florida was firmly in our opponent’s camp. We had internal data (and there was plenty of public data, for those who are able to analyze

Obama’s Big Data

36

it properly) demonstrating that these statements were inaccurate. If we didn’t have accurate internal data, our campaign might have made multi-million dollar mistakes that could have cost us several key states and the election.

Given the reaction of the public to the NSA and PRISM data gathering techniques, what kind of effect is this likely to have on the wider data gathering activities of others working within the data science community?

Consumers are becoming more aware of what data is available and to whom. It is increasingly important for those of us in the data science community to help educate consumers about what information is available, when and how they can opt out of sharing their information and how their information is being used.

Do you think that after the success of the data teams in the previous two elections that it is no longer an advantage, but a necessity for a successful campaign?

Campaigns have always used data to make decisions, but new techniques and technology have made more data accessible and allowed it to be used in innovative ways.

Campaigns that do not invest in data, technology or analytics are missing a huge opportunity

that can help them make more intelligent decisions. Furthermore, their supporters, volunteers and donors want to know that the campaign is using their contributions of time and money as efficiently and effectively as possible, and that the campaign is making smart strategic decisions using the latest techniques.

Obama’s Big Data

37

Do You Have The Spark?

If you have a new idea that you want to tell the world, contact us to

contribute an article or idea

[email protected]

38

Gregory Piatetsky-Shapiro Talks Big Data Education

Chris TowersOrganizer, Big Data Innovation Summit

38 Gregory Piatetsky

-Shapiro

39

One of the aspects of big data that many in the industry are currently concerned by is the perceived skills gap. The lack of qualified and experienced data scientists has meant that many companies find themselves adrift of where they want to be in the data world. I thought I would talk to one of the most knowledgeable and influential big data leaders in the world, Gregory Piatetsky-Shapiro. After running the first ever Knowledge Discovery in Databases (KDD) workshop in 1989, he has stayed at the sharp end of analytics and big data for the past 25 years. His website and consultancy, KD Nuggets, is one of the most widely read data information sources and he has worked with some of the largest companies in the world. The first thing that I wanted to discuss with Gregory was his perception of the big data skills gap. Many have claimed that this could just be a flash in the pan and something that has been manipulated, rather than something that actually exists. Gregory references the McKinsey report of May 2011 which quotes: “There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts

with the know-how to use the analysis of big data to make effective decisions.”The report predicts that this kind of skills gap will exist in 2017, but Gregory believes that we are already seeing this. Whilst using Indeed.com to look at what expertise companies are looking for, Gregory found that of the top 10 job trends both Mongo DB and Hadoop appear. “Big Data is actually rising faster than any of them. This indicates that demand for Big Data skills exceeds the supply. My experience with KDnuggets jobs board confirms it - many companies are finding it hard to get enough candidates.”There are people responding to this however, with many universities and colleges recognising not only the shortages, but also the desire from people to learn. Companies looking to expand their data teams are also looking at both internal and external training. For instance companies such as EMC and IBM are training their data scientists internally. Not only does this mean that they know that they are getting a high quality of training, but that the data scientists that they are employing are being educated in ‘their ways’. With companies finding it hard to employ qualified candidates, through training programs like this, companies can look for great candidates and make

Gregory Piatetsky -Shapiro

40

sure they are sufficiently qualified afterwards.The IBMs and EMCs of this world are few and far between. The money that needs to be invested in in-depth internal training is considerable and so many companies would struggle with this proposition. So what about those other companies? How can they avoid falling through the big data skills gap?Gregory thinks that most companies have three options. Do you need BIG data?Most companies confuse big data with basic data analysis. At the moment with the buzz around big data, many companies are over investing in technology that realistically isn’t required. A company with 10,000 customers, for instance, does not necessarily need a big data solution with multiple Hadoop clusters. Gregory makes the point that on his standard laptop he would be able to process data for a large software company with 1 million customers. Companies need to ask if they really need the depth of data skills that they think. What if you do need it?For large companies who may need to manage larger data sets, the reality is that it is not necessary to employ a big data expert straight from

university. Gregory makes the point that somebody who is trained in Mongo DB can become trained as a data scientist relatively easily. If an internal training programme is not a realistic target in this instance, then external training may become the best option. There are several companies who can

offer this such as Cloudera and many others, who can train data scientists to a relatively high standard.Gregory also mentions that one way in which several companies are learning about big data and analytics is through attending conferences. There are now hundreds of conferences a year on Big Data and related topics, from leaders in the field such as Innovation Enterprise and other smaller conferences all around the world. What if these are untenable?Some Big Data and analytics work can be outsourced or given to consultants. This allows to not only free up their existing data team to specific tasks, but also means that they are not having to risk taking on a full time employee who may not be sufficiently qualified. Here, the leading companies include IBM, Deloitte, Accenture and also pure-play analytics outsourcing providers like Opera Solutions and Mu Sigma.Having discussed the big data skills gap with several people who have worked in big data for years, one of the main concerns they have is the fanfare affecting the long term viability of the business function. Gregory does not have this concern, but does make it

40 Gregory Piatetsky -Shapiro

41

clear that we need to make sure that the buzzword ‘big data’ is separated from the technological trend. He has written in Harvard Business Review about this belief that the ‘sexy’ big data is being overhyped. The majority of companies who have implemented big data have done so in order to predict human behaviour, but this is not something that can be done consistently with big data. Therefore, Gregory believes that any disillusionment with big data will not come from an inability to find the right talent, but in it’s build-up not living up to the reality. On the other hand, Gregory is quick to point out that the amount of data that we are producing will continue its rapid growth for the foreseeable future. This data will still need people to manage and analyze it and so we are going to continue to see growth even if the initial hype dies down. We are also seeing an increasing interest in countries outside the US, the current market leader. This global interest is likely to increase the big data talent pool and therefore allow for expansion. Having used Google Trends Gregory points out that the top 10 searches for ‘Big Data’ are:

Given the interest from elsewhere we are going to see an increasingly globalized talent pool and potentially the migration of the big data hub from the US to Asia. Gregory also points out that given that the top five do not have english as a primary language (the trend analysis was purely for english language searches) the likelihood is that this does not represent every search for big data in those countries. This interest in the subject certainly shows that the appetite for big data education exists globally and those working in the big data educational sphere

are utilising technology to increase effectiveness. Gregory points out that many companies are using analytics within their online education to make the experience more productive for both the students and teachers. Through the use of this technology, big data education is becoming more productive and also more tenable to a truly international audience. One of the aspects of big data that is clear, is that in order to succeed you need curiosity and passion. The other aspects of the role will always involve training and the kind of options and platforms for this will mean that in the coming years, we will see this gap closing. Gregory is a fine example of somebody who has managed to not only innovate within the industry for the past 25 years, but was one of the first to try and share the practice across many people. If we can find even one person from those working in data with the same passion and curiosity as him then the quality and breadth of education can continue to grow at the same speed as this exciting industry.

Gregory Piatetsky -Shapiro

42

Big Data Innovation with Stephen WolframHeather James Big Data Innovation Summit Curator

Stephen Wolfram

43

At the Big Data Innovation Summit, Boston in September 2013, Stephen Wolfram took the stage to deliver a presentation that many have described as the best amongst the hundreds that took place over the 2 day event.

Discussing his use of data and the way that his Wolfram Al-pha programme and Mathe-matica language are changing the ways that machines utilise data, the audience was en-thralled.

I had initially organised to sit down with Stephen immedi-ately following his presenta-tion, but I was forced to wait for several hours due to the crowds surrounding him as soon as he finished. The 20 people surrounding him for an hour after his presentation were testament to Stephens achievements in the past 25 years. Having spoken to oth-ers around the conference the most common adjective was 'brilliant'.

During the afternoon I did manage to sit down with Ste-phen. What I found was a down to earth, eloquent man with a genuine passion for data and the way that we are using it as a society.

Stephen is the CEO and found-er of Wolfram Alpha, a com-putational knowledge engine designed to answer questions

using data rather than sug-gesting results like a traditional search engine such as Google or Bing. Wolfram Alpha is the product of Stephen's ultimate goal, to make all knowledge computational, being able to answer and rationalise natural language questions into data driven answers.

He describes it as 'A major democratisation of access to knowledge', allowing peo-ple the opportunity to an-swer questions that previously would have required a sig-nificant amount of data and expert knowl-edge. According to Stephen the product is al-ready been used every-where from education to big business, it is a product on the up.

Many will claim that they have never used the system, however anybody who has asked Apple's SIRI system on the iPhone a question will have unwittingly expe-rienced it. Along with Bing and Google, Wolfram Alpha powers the SIRI platform, enabling users to ask ques-tions in standard language and translate this into data driven answers.

What Wolfram Alpha is doing

differently to everybody else at the moment is taking publicly held knowledge and using it to answer questions rather than simply showing people how to find the information. It allows users to ingest the information that others have found to find interesting and deep answers.

Stephen has a real passion for data, through not only Wolf-ram Alpha and his mission to computate knowledge, but also on a personal level. He is the human who has the record

for hold-i n g

t h e m o s t

d a t a about him-self. He has been measur-ing this for the past

25 years and he can

see this be-coming more and

Stephen Wolfram

44

more popular in wider society, with wearable personal meas-urement technologies become increasingly popular.

This change in the mindset of society as a whole to a more data driven and accepting so-ciety is what Stephen believes to be the key component to Wolfram Alpha now becoming what it is. Stephen says that he always knew that there would come a time when society had created enough data to be able to make Wolfram Alpha viable and that time is now.

This is testament to how far we have come as an industry that we can now power something like Wolfram Alpha through the amount of data that we have now recorded. It is a real mile-stone in the development of a data driven society.

The reason for this according to Stephen is that many of the key data sources haven't been around for a long time, things like social media and machine data has allowed this shift to occur. He only sees this trend continuing with increasing amounts of machine driven sensors collecting data.

With the use of data at Wolf-ram Alpha now hitting an all time high, I was curious about where Stephen thought big data would be in 5 years time. He believes that the upward curve will only continue, per-

sonal analytics will become part of a daily routine and this will only see the amount of data increase.

He also sees the use of sci-ence and mechanics having a profound effect on the ways in which companies utilise their data. We will see analysis look-ing at more than just numbers, but also putting giving these numbers meaning through sci-entific principles.

Overall, what I have learnt from talking to Stephen is that data is the future in more than just a business context. Soft-ware that allows people to mine data without r e a l i s i n g they are even do-ing it will be im-portant to devel-o p m e n t of how we use information.

Wolfram Alpha is changing the data landscape and with the passion and ge-nius of Stephen Wolfram behind it, who knows how far it could go.

Stephen Wolfram

45

Data TransparencyClaire Walmsley Big Data Expert

Data Transparency

46

Recently companies have received a bad reputation about how they are holding individual information. There have been countless data leaks, hackers exposing personal details and exploitation of individual data for criminal activities.

The world's press has had it's a t t e n t i o n d r a w n towards data p r o t e c t i o n and individual data collection through the NSA and GCHQ spying s c a n d a l . Society in general is becoming more aware of the power that their data holds and this combined with the increased media attention, has led to consumers becoming more data savvy.

Companies like Facebook and Google have made billions of dollars through their efficient use of data and are now looked at warily by many. Although major data secrecy violations are yet to occur at either organisation, the reality is that people know that data is held about them and need to trust the company who is keeping it.

So how can companies become more trustworthy with their customer data?

One of the keys to success within

a customer base is trust and the best way to gain this is through transparency. Allowing people to see what kind of information that they have held on them by any particular company creates trust. By outlining exactly what is held on people will create an understanding of what the

information is used for.

A sure fire way to lose trust is through t h e 'if you don’t ask you don’t get’ use of data collection visibility. This is the idea that when reading complex or overly long agreements the data protection aspects are available, but not implicitly stated. In reality this is much of what has happened in several cases, with information management details being buried in small prints, so although technically accessible are in reality not effectively communicated.

The best way to circumnavigate this is to make it clear, send an

Data Transparency

47

email, have a separate section or even a blog that is outlining how data is being used and why. It is very seldom that people are having their data used in manipulative or sinister ways, making them aware of how their data is improving their experiences will make an audience far more receptive to it being used.

At the moment there are ways that you can check on certain elements of how your data is being used. Using a google account you can see what Google has matched to your here: google.com/dashboard/

This allows you to see who Google presumes you are based on your browsing history and what ads are therefore targeted towards you. It is often interesting to see what your actions online say about you. This detail is a move in the right direction for companies but still has an enigmatic feeling that there isn't total transparency.

With the pressures of data protection surrounding most companies today, this kind of move would allay many of the fears that consumers currently have when their data integrity is in question.

What the industry needs today is consumer trust and transparency is one of the key components to achieving this.

Data Transparency

48

BIT.LY/BIGDATASIGNUP

@IE_BIGDATA

SUBSCRIBE

FOLLOW US

49 2014 CALENDAR

NovemberBig Data & Analyticsfor PharmaNovember 5 & 6, Philadelphia

Big Data & Marketing Innovation SummitNovember 12 & 13, Miami

Data Science Leadership Summit November 12 & 13, Chicago

Big Data FestNovember 27, London

Big Data Innovation SummitNovember 27 & 28, Beijing

DecemberBig Data in Finance SummitDecember 3 & 4, New York

Big Data & Analytics in Banking Summit December 3 & 4, New York

OctoberBig Data & Predictive Analytics Summit October 15 & 16, London

Big Data InnovationSummitOctober 30 & 31, Bangalore

AprilBig Data Innovation SummitApril 9 & 10, Santa Clara

Big Data Infrastructure SummitApril 9 & 10, Santa Clara

Data Visualization SummitApril 9 & 10, Santa Clara

JanuaryBig Data Innovation SummitJanuary 22 & 23, Las Vegas

FebruaryHadoop InnovationSummitFebruary 19 & 20, San Diego

The Digital Oil�eld Innovation SummitFebruary 20 & 21, Buenos Aires

Big Data & Analytics Innovation Summit February 27 & 28, Singapore June

Big Data Innovation Summit June 4 & 5, Toronto

Big Data & Analytics for PharmaJune 11 & 12, Philadelphia

Big Data & Analytics in Retail June 18 & 19, Chicago

MayBig Data Innovation SummitMay 14 & 15, London

Big Data & Analytics in Healthcare May 14 & 15, Philadelphia

Big Data & Advanced Analytics inGovernment May 21 & 22, Washington, DC

Chief Data O�cer SummitMay 21 & 22, San Francisco

SeptemberBig Data Innovation SummitSeptember 10 & 11, Boston

Data Visualization SummitSeptember 10 & 11, Boston

Big Data & Analytics Innovation SummitSeptember 17 & 18, Sydney

MarchBig Data Innovation SummitMarch 27 & 28, Hong Kong

Partnership Opportunities: Giles Godwin-Brown | [email protected] | +1 415 692 5498

Attendee Invitation: Sean Foreman | [email protected] | +1 415 692 5514

Flagship CXO Women Finance Hadoop Expected High Tech Healthcare Government Pharma Oil & Gas

Big Data

Documents

Big Data Innovation, Annual Review 2013