Skip to main content
SHARE
Publication

Analyzing large biological datasets with association networks...

by Tatiana V Karpinets, Byung H Park, Edward C Uberbacher
Publication Type
Journal
Journal Name
Nucleic Acids Research
Publication Date
Volume
40
Issue
17

Due to advances in high throughput biotechnologies biological information is being collected in databases at an amazing rate, requiring novel computational approaches for timely processing of the collected data into new knowledge. In this study we address this problem by developing a new approach for discovering modular structure, relationships and regularities in complex data. These goals are achieved by converting records of biological annotations of an object, like organism, gene, chemical, sequence, into networks (Anets) and rules (Arules) of the associated annotations. Anets are based on similarity of annotation profiles of objects and can be further analyzed and visualized providing a compact birds-eye view of most significant relationships in the collected data and a way of their clustering and classification. Arules are generated by ‘Apriori” considering each record of annotations as a transaction and augmenting each annotation item by its type. Arules provide a way to validate relationships discovered by Anets producing comprehensive statistics on frequently associated annotations and specific confident relationships among them. A combination of Anets and Arules represents condensed information on associations among the collected data, helping to discover new knowledge and generate hypothesis. As an example we have applied the approach to analyze bacterial metadata from the Genomes OnLine Database. The analysis allowed us to produce a map of sequenced bacterial and archaeal organisms based on their genomic, metabolic and physiological characteristics with three major clusters of metadata representing bacterial pathogens, environmental isolates, and plant symbionts. A signature profile of clustered annotations of environmental bacteria if compared with pathogens linked the aerobic respiration, the high GC content and the large genome size to diversity of metabolic activities and physiological features of the organisms.