Abstract
In this tutorial, we present the state-of-art structures and methods for ecient data
preparation, and representation for analysis. Our intent is to introduce the data science
and analytics communities to open source data placements, structures, and methods.
These practices can make the foundational processes of data preparation and access
dramatically more ecient than typical raw le or database representations and use
more conservative storage. To illustrate this, we introduce two highly ecient data
placement structures.We then present a tutorial, supported by step-by-step examples,
of how to create, use and access data, structured by Parquet or ORC, using Apache
Spark. Finally, we illustrate the benets of using these structures with computational
and storage volume benchmarks.