Small Summaries for Big Data
The massive volume of data generated in modern
applications can overwhelm our ability to conveniently transmit, store, and index it. For many scenarios, building a compact
summary of a dataset that is vastly smaller enables flexibility and efficiency in a range of queries over the data, in exchange
for some approximation. This comprehensive introduction to data summarization, aimed at practitioners and students, showcases
the algorithms, their behavior, and the mathematical underpinnings of their operation. The coverage starts with simple sums
and approximate counts, building to more advanced probabilistic structures such as the Bloom Filter, distinct value summaries,
sketches, and quantile summaries. Summaries are described for specific types of data, such as geometric data, graphs, and
vectors and matrices. The authors offer detailed descriptions of and pseudocode for key algorithms that have been incorporated
in systems from companies such as Google, Apple, Microsoft, Netflix and Twitter.
1. Introduction; 2. Summaries for
sets; 3. Summaries for multisets; 4. Summaries for ordered data; 5. Geometric summaries; 6. Graph summaries; 7. Vector, matrix
and linear algebraic summaries; 8. Summaries over distributed data; 9. Other uses of summaries; 10. Lower bounds for summaries.
A comprehensive introduction to flexible, efficient tools for describing massive data sets to improve the scalability of