Skip to main content
SHARE
Publication

Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Workstations...

by Xiaosong Ma, Zhe Zhang, Sudharshan S Vazhkudai
Publication Type
Journal
Journal Name
Journal of Grid Computing
Publication Date
Page Numbers
419 to 438
Volume
7
Issue
4

Client-side data caching serves as an excellent mechanism to
store and analyze the rapidly growing amount of scientific data.
In our previous work, we built a distributed local cache on
unreliable desktop storage contributions. This offers several
desirable properties, such as performance impedance
matching, improved space utilization, and high parallel I/O bandwidth.
Such a low-cost, best-effort cache, however, is faced with the vagaries of
storage node availability: these donated machines may be significantly less reliable than dedicated
systems and cannot be controlled centrally.

In this paper, we address
%the tradeoffs between techniques that favor
%availability or performance when it comes to cache management.
the performance impact of data availability in the distributed scientific data
cache setting.
We then present a novel approach to storage cache management,
{\em remote partial data recovery (RPDR)}.
We compare our approach to two standard techniques,
namely replication and erasure coding, both extended to the target caching
environment.
Our evaluation uses a trace-driven simulation parameterized with
benchmarking results from our distributed cache prototype.
The results with multiple real-world traces
indicate that RPDR significantly outperforms both replication and erasure coding in many
cases and overall the combination of RPDR and erasure coding yields the best
performance.