Skip to main content
SHARE
Publication

Enabling Graph Appliance for Genome Assembly...

by Rina Singh, Jeffrey A Graves, Sangkeun M Lee, Sreenivas R Sukumar, Mallikarjun Shankar
Publication Type
Conference Paper
Publication Date
Page Numbers
2583 to 2590
Conference Name
IEEE Big Data Workshop on Mining Big Data to Improve Clinical Effectiveness in conjuction with IEEE Big Data
Conference Location
Santa Clara, California, United States of America
Conference Date
-

In recent years, there has been a huge growth in
the amount of genomic data available as reads generated from
various genome sequencers. The number of reads generated can
be huge, ranging from hundreds to billions of nucleotide, each
varying in size. Assembling such large amounts of data is one of
the challenging computational problems for both biomedical and
data scientists. Most of the genome assemblers developed have
used de Bruijn graph techniques. A de Bruijn graph represents
a collection of read sequences by billions of vertices and edges,
which require large amounts of memory and computational
power to store and process. This is the major drawback to
de Bruijn graph assembly. Massively parallel, multi-threaded,
shared memory systems can be leveraged to overcome some of
these issues. The objective of our research is to investigate the
feasibility and scalability issues of de Bruijn graph assembly
on Cray’s Urika-GD system; Urika-GD is a high performance
graph appliance with a large shared memory and massively
multithreaded custom processor designed for executing SPARQL
queries over large-scale RDF data sets. However, to the best of
our knowledge, there is no research on representing a de Bruijn
graph as an RDF graph or finding Eulerian paths in RDF graphs
using SPARQL for potential genome discovery. In this paper, we
address the issues involved in representing a de Bruin graphs
as RDF graphs and propose an iterative querying approach for
finding Eulerian paths in large RDF graphs. We evaluate the
performance of our implementation on real world ebola genome
datasets and illustrate how genome assembly can be accomplished
with Urika-GD using iterative SPARQL queries.