SHARE

Publication

Storage and Read-Optimized Data Placement Structures for High-Performance Analysis...

by Edmon Begoli, Pragneshkumar Patel, J. Blair Christian

Publication Type

Book Chapter

Publication Date

November, 2016

Page Number

171

Publisher Name

Institute for Operations Research and the Management Sciences (INFORMS)

Publisher Location

Cantonsville, Maryland, United States of America

View DOI Listing

Abstract

In this tutorial, we present the state-of-art structures and methods for ecient data
preparation, and representation for analysis. Our intent is to introduce the data science
and analytics communities to open source data placements, structures, and methods.
These practices can make the foundational processes of data preparation and access
dramatically more ecient than typical raw le or database representations and use
more conservative storage. To illustrate this, we introduce two highly ecient data
placement structures.We then present a tutorial, supported by step-by-step examples,
of how to create, use and access data, structured by Parquet or ORC, using Apache
Spark. Finally, we illustrate the benets of using these structures with computational
and storage volume benchmarks.