Big Data And Its Big Problems
September 18, 2012
by Adam Frank - NPR
Imagine every thousandth blood cell in your body has a tiny radio transmitter in it. Imagine that 10 times a second that transmitter sends each cell's location to a computer storing the data. Along with position, it also sends the concentration of a list of 10 chemicals encountered at receptors distributed at 10 sites over the surface of each cell. Now imagine following all those blood cells for an hour. That makes a billion blood cells being sampled 10 times a second for 3,600 seconds. Now imagine its your job to sort through all those numbers and extract something meaningful about the human body. That problem, should you choose to accept it, would be an example of "Big Data."
There are buzzwords in science that come and go and then there are codifications of forces, decades in the making, that truly represent turning points. Without a doubt, Big Data falls into the second category.
Ever since the invention of writing we have been fact collectors. It's only just now that we are passing through a watershed in our capacities to collect, store, manipulate and analyze information. These new capacities bring both great promise and great threats. In either case they will utterly and entirely transform human culture.
Big Data means many different things to different people. At its base, it's always about attacking big problems. To understand both the possibilities and the dangers, let's look at two examples.
In my group we use supercomputers to simulate star formation (among other things). That means we take a region of space that we want to simulate and cut it up into many small elements ,the way a digital camera cuts a image up into pixels. Then we digitally solve the equations that govern fluids in each one of those "voxels" (volume elements or 3-D pixels). We do this over and over again to track the evolution of something like an interstellar cloud collapsing under its own gravity to make star.
As computers have gotten faster and data storage capacities have grown larger, we have been able to push boundaries to make ever more highly resolved star formation simulations. We can see ever smaller and more important details in the process. A typical campaign of simulations now generates on order of a petabyte of data. That is the equivalent of 40 million 4-drawer file cabinets stuffed with documents. That is a lot of data, representing a lot of scientific possibilities.
So what's the problem? As Big Data science increases our ability to model or simulate complex systems, these models, ironically, become as complex as the real world. But they are not the real world. Whether its astrophysics or the economy, building a computer model still demands leaving some aspects of the problem out. More importantly, the very act of bringing the equations over to digital form means you have changed them in subtle ways and that means you are solving a slightly different problem than the real-world version.
Overcoming these difficulties requires trained skepticism, sophistication and, remarkably, some level of intuition about the systems we study. Moving deeper into Big Data simulations will be an exercise in maintaining that skepticism, developing new intuitions and developing new tools to separate the chaff from the real, useful insights.
Moving beyond models the question of models, there is the problem of the data in Big Data itself. Big Data implies having innumerable "sensors" of one form or another out there collecting real-time information about the real world. Like my (very) imaginary blood-cell example, it could be capturing petabytes worth information in vast health science studies that help scientists understand biosystems better. It could also mean following an entire nation's electric grid at high resolution in space and time to get a better understanding of how power might be distributed. The trick, of course, will be learning how to sort through or "mine" such vast collections in data through statistical or other creative computationally driven means.
So, again, where is the problem? We are! The problem is what Big Data means for us as individuals, within a societal context
Everyday we are scattering "digital breadcrumbs" into the data-verse. Credit card purchases, cell phone calls, Internet searches: Big Data means memory storage has become so cheap that all data about all those aspects of our lives can be harvested and put to use. And it's exactly the use of all that harvested data that can pose a threat to society. As Alex Pentland of MIT puts it:
"What those breadcrumbs tell is the story of your life. It tells what you've chosen to do. That's very different than what you put on Facebook. What you put on Facebook is what you would like to tell people, edited according to the standards of the day. Who you actually are is determined by where you spend time, and which things you buy. Big data is increasingly about real behavior, and by analyzing this sort of data, scientists can tell an enormous amount about you. They can tell whether you are the sort of person who will pay back loans. They can tell you if you're likely to get diabetes."
Used this way Big Data means a society that is being monitored on both the individual and collective level in new and truly unimaginable ways. It may also allows levels of manipulation that are new and truly unimaginable.
Challenges, opportunities and big problems – that is what it means to live at the dawn of the Big Data age. And, of course, once you have entered that age there will be no going back. Thus how we deal with the nascent Big Data issues over just the next decade or so may very well define what the next stage of culture looks like for long, long time.
New IBM Blue Gene/P supercomputer at the Argonne Leadership Computing Facility.