Monday, October 3, 2011

Best Paper Award for "An Optimal Algorithm for the Distinct Elements Problem"

Last month, the winners of IBM's 2010 Pat Goldberg Memorial Best Paper competition in computer science, electrical engineering and math were announced. IBM Research - Almaden computer scientist David Woodruff co-authored one of the winning papers, titled "An Optimal Algorithm for the Distinct Elements Problem" with Daniel M. Kane (Harvard University) and Jelani Nelson (MIT) for PODS 2010 (ACM Symposium on Principles of Database Systems).

The Professional Interest Communities at IBM Research (PICs) reviewed a total of close to 120 papers submitted by IBM Research authors and nominated 34 for best paper consideration. A worldwide Research team reviewed the nominated papers and selected four outstanding papers as the award winners.

All of the submitted papers represent IBM Researchers advancing our field and are indicative of our commitment to long-term, exploratory work that can change the way we look at the world.

Below, David explains a bit about his paper, what it might mean for the advancement of mathematical discovery and what it's like working for IBM.

Estimating the number of distinct attributes is a fundamental practical and theoretical problem in database applications dating back to the 1970s. It arises in trying to optimize a query sequence, where keeping the number of distinct elements small at intermediate stages in the sequence ensures the overall running time is low. It is also useful for comparing two data sets, e.g., how many new items did we get by putting the two datasets together? 

The techniques in this paper will be useful for improving the memory and time complexity of a number of fundamental problems in the data stream literature, related to estimating the number of distinct elements, such as estimating rarity, similarity, union sizes of databases, etc. 

They can also be used for estimating the number of distinct elements in the distributed model, sliding window model, and time-decayed model. In the distributed model, there are multiple servers, each holding a database, and we want to estimate the number of distinct elements in the union of the databases with as few communication and computational overhead as possible. In the sliding-window model, which could for instance be used on a router monitoring distinct source-destination traffic passing through it, it may be that we are only interested in the recent traffic, or we may at least want to give more weight to recent traffic. These variations correspond to the sliding-window and time-decayed models. 

My work closes a long-standing problem in the area of data streams. IBM Research has made a sustained effort to design data stream and sub-linear algorithms for a wide variety of problems, e.g., those in graph theory, machine learning, network traffic analysis, numerical linear algebra, and statistics. This work significantly bolsters that effort. 

I’m very grateful for the amazing amount of freedom and that I’ve been given at IBM, and have been able to use this to highly optimize my time and productivity. Interacting with interesting colleagues is one of the most exciting aspects of my job. It’s really enjoyable and a great learning experience.  

Research interests: data stream algorithms, communication complexity, numerical linear algebra, graph algorithms, coding theory, and cryptography 

Inspirational figures: The super theory group at IBM Research - Almaden 

When I'm not working, I'm: home remodeling, playing basketball, practicing chinese, traveling 

Favorite travel spots: Banff, Canada. Hangzhou, China. Venice, Italy. 

David Woodruff is a Research Staff Member in IBM Research - Almaden's Principles and Methodologies Group. He received a B.S. in computer sceince, B.S. in mathematics, M.Eng in computer science and Ph.D in computer science, each from MIT. David also contributes to IBM's cognitive computing initiative, SyNAPSE. 

No comments:

Post a Comment