Skip to main content
SHARE
Publication

Storage and Read-Optimized Data Placement Structures for High-Performance Analysis...

by Edmon Begoli, Pragneshkumar Patel, J. Blair Christian
Publication Type
Book Chapter
Publication Date
Page Number
171
Publisher Name
Institute for Operations Research and the Management Sciences (INFORMS)
Publisher Location
Cantonsville, Maryland, United States of America

In this tutorial, we present the state-of-art structures and methods for ecient data
preparation, and representation for analysis. Our intent is to introduce the data science
and analytics communities to open source data placements, structures, and methods.
These practices can make the foundational processes of data preparation and access
dramatically more ecient than typical raw le or database representations and use
more conservative storage. To illustrate this, we introduce two highly ecient data
placement structures.We then present a tutorial, supported by step-by-step examples,
of how to create, use and access data, structured by Parquet or ORC, using Apache
Spark. Finally, we illustrate the bene ts of using these structures with computational
and storage volume benchmarks.