Over 150 eminent leaders in the area of big data and massive scale analytics came together on Tuesday, September 20th for the IBM Research - Almaden Centennial Colloquia to discuss "Planet Scale Analytics." A truly collaborative event, speakers like Gus Hunt, Chief Technology Officer, CIA, Peter Breunig, GM Technology & Strategy, Chevron Corporation and Arvind Krishna, GM Information Management, IBM, convened to discuss the emerging - and explosive - wave of massive scale structured and unstructured data and the rise of analytics for actionable insight.
Leading executives from Juniper Networks, Coca-Cola, Genetench, Yahoo!, Agilent Technologies and Intuit were among attendees who came together to discuss data challenges from a business perspective, and were intrigued to learn about the findings and potential solutions presented in the technical sessions. The academic and research communities were naturally anxious to hear customer concerns and needs. Todd Myers, Chief Scientist for the National Geospatial Intelligence Agency attended to learn about emerging technologies designed to tackle analytics for massive data sets and commented, "IBM does a very good job of convening great speakers and thought leaders."
The three V's of big data - and the fourth
Arvind Krishna opened the morning, describing big data from the perspectives of the 451 group, IDC and IBM - all very similar by definition - large, complex and dynamic. Krishna shared IBM's spin on the big data definition by 3 v's: volume - data at rest, velocity - data in motion and variety - data in many forms. A fourth 'v', Arvind added, is veracity - data in doubt, used to describe 'contradictory data,' or noisy data - ultimately, unstructured data that experts are not sure how to deal with.
In a much anticipated technical discussion about analytics solutions, Shiv Vaithyanathan, IBM Research senior manager in Intelligent Information Systems, again mentioned these 4 v's in his talk titled "Entity, Relationship and 360-degree View of customers." "Veracity is turning noisy data like jargon and acronyms, even wishful thinking and sarcasm into trustworthy insights," Vaithyanathan said. "We're dealing with social media data from hundreds of sources - 10,000 messages per second from over 100 million active users per source - that needs to be combined and correlated to make near real-time decisions."
Using IBM's System T and SQL language, Shiv and the database experts at Almaden have built a 360 degree customer profile built on more than 2,000 rules, analyzing an average of half a terabyte of data on any given day. Vaithyanathan explained the advantages of the analytics tool throughout his talk and via demonstration. System T uses less than 10% of the cores to keep up with Twitter's daily feed with no drop in quality due to its linear scalability; a sharp decline in core usage from both state-of-the-art statistical systems and state-of-the-art open source rule-based systems that require the thousands of cores that System T does not.
Big data challenges in cities and across industries
Through a series of discussions on industry applications such as "21st Century Water Data: Needs and Availability" presented by Peter Gleick of the Pacific Institute and "Big Data in Finance: Quantitative, Qualitative and Relationship Information," by David Leinweber of Leinweber & Co., several challenges proving the need for deep analytical capabilities were shared. "We're entering a world where anyone can be a data source and upload some publicly interesting piece of data," Gleick remarked. "We can share information to an open source database, where you can add search capabilities." Adding to the notion of the changing face of data analytics, Leinweber commented, "Dealing with this is uncomfortable. We need to expect errors and strange innovations." He closed semi-jokingly with a quote by Ogden Nash: "Progress may have been all right once, but it's gone on far too long."
Chevron's Breunig presenting some noteworthy statistics from the oil and energy area: the alarming rate of data transfer and storage that Chevron is faced with daily. At 20PB day, doubling every 2 years with a declining signal to noise ratio, there's not only a dynamic cluster of data that needs to be wrangled, but a challenge in ways to apply that data to equally dynamic dimensions. "In the oil business, subsurface modeling has many uses, and its users have different needs," Breunig said. "This is a big data integration challenge."
Alexandre Bayen of UC Berkeley suggests that like traffic data, water and earthquake data needs to be online, sharing an interesting thought about traffic analytics: monitoring 2% of traffic in real time is sufficient to predict travel time. Additionally, Bayen proposes that tracking one year of yellow cab data in San Francisco can plot a map of the city with relative accuracy. In putting multiple sources of crowdsourced data, public feeds, texts and videos online, analytics can be applied in very similar ways, as long as they have the same basic properties: 1) a mathematical model, very important for physical phenomena, 2) data, 3) inverse modeling and data simulation resulting in estimates and eventually decisions.
Data privacy and sharing
A close collaborator with IBM, Fran Maier of TRUSTe usually has a hard time describing her job: internet privacy. She explained in a panel on privacy that her company works with companies to ensure that their policies and practices are standard and meet whatever promise they make with the consumer. "It's not easy," Fran said, "because privacy, unlike security doesn't have a 'bad guy'. The important elements are transparency and accountability." The cycle of delivering trust in privacy to internet consumers to allow for more interaction and sharing is important - without it, companies are unable to collect and use data sets.
Harriet Pearson, IBM's chief privacy officer, also shared some views on thought leadership from a large corporation perspective. "Companies like Facebook are experimenting in public and seeing where the norms might be," she explained. "One approach by some of the new entrants in the Silicon Valley is to push the envelope, then retract; push, then retract. Alternately, organizations set norms from the outset, or enlist the help of organizations like trustE. Policy making has not accelerated that much. It puts a premium on those who can function in environments of uncertainty and have the confidence to strive forward."
Next steps
Many of the comments shared throughout the day confirmed positive impressions of IBM's data analytics capabilities and the proximal and technical opportunities to impact the local "hotbed" of massive data streams. Others appreciated the opportunity to hear how other companies make use of technologies and models to analyze complicated situations. Al Leung, vice president and partner for acquisitions & logistics at IBM, expressed the advantages of bringing his clients to this forum to learn new techniques and explore some of IBM's offerings in a more personalized format. The conversation of massive scale analytics at IBM Research will continue locally and from a company-wide perspective, focusing on developing plans to expand collaboration efforts.
Soundbites from the event: