The Professional Interest Communities at IBM Research (PICs) reviewed a total of close to 120 papers submitted by IBM Research authors and nominated 34 for best paper consideration. A worldwide Research team reviewed the nominated papers and selected four outstanding papers as the award winners.
All of the submitted papers represent IBM Researchers advancing our field and are indicative of our commitment to long-term, exploratory work that can change the way we look at the world.
Below, David explains a bit about his paper, what it might mean for the advancement of mathematical discovery and what it's like working for IBM.
Estimating
the number of distinct attributes is a fundamental practical and
theoretical problem in database applications dating back to the 1970s. It
arises in trying to optimize a query sequence, where keeping the number
of distinct elements small at intermediate stages in the sequence
ensures the overall running time is low. It is also useful for comparing
two data sets, e.g., how many new items did we get by putting the two
datasets together?
The
techniques in this paper will be useful for improving the memory and
time complexity of a number of fundamental problems in the data stream
literature, related to estimating the number of distinct elements, such
as estimating rarity, similarity, union sizes of databases, etc.
They can also be used for estimating the number of distinct elements in the distributed model, sliding window model, and time-decayed model. In the distributed model, there are multiple servers, each holding a database, and we want to estimate the number of distinct elements in the union of the databases with as few communication and computational overhead as possible. In the sliding-window model, which could for instance be used on a router monitoring distinct source-destination traffic passing through it, it may be that we are only interested in the recent traffic, or we may at least want to give more weight to recent traffic. These variations correspond to the sliding-window and time-decayed models.
They can also be used for estimating the number of distinct elements in the distributed model, sliding window model, and time-decayed model. In the distributed model, there are multiple servers, each holding a database, and we want to estimate the number of distinct elements in the union of the databases with as few communication and computational overhead as possible. In the sliding-window model, which could for instance be used on a router monitoring distinct source-destination traffic passing through it, it may be that we are only interested in the recent traffic, or we may at least want to give more weight to recent traffic. These variations correspond to the sliding-window and time-decayed models.
My work closes a long-standing problem in the area of data streams. IBM
Research has made a sustained effort to design data stream and
sub-linear algorithms for a wide variety of problems, e.g., those in
graph theory, machine learning, network traffic analysis, numerical
linear algebra, and statistics. This work significantly bolsters that
effort.
Research interests: data stream algorithms, communication complexity, numerical linear algebra, graph algorithms, coding theory, and cryptography
Inspirational figures: The super theory group at IBM Research - Almaden
When I'm not working, I'm: home remodeling, playing basketball, practicing chinese, traveling
Favorite travel spots: Banff, Canada. Hangzhou, China. Venice, Italy.
David Woodruff is a Research Staff Member in IBM Research - Almaden's Principles and Methodologies Group. He received a B.S. in computer sceince, B.S. in mathematics, M.Eng in computer science and Ph.D in computer science, each from MIT. David also contributes to IBM's cognitive computing initiative, SyNAPSE.
No comments:
Post a Comment