Skip to main content
SHARE
Publication

A Knowledge Graph Approach for the Secondary Use of Cancer Registry Data...

by S M Shamimul Hasan, Donna Rivera, Xiao-cheng Wu, James B Christian, Georgia Tourassi
Publication Type
Conference Paper
Journal Name
IEEE Journal of Biomedical and Health Informatics
Book Title
2019 IEEE-EMBS International Conferences on Biomedical and Health Informatics
Publication Date
Conference Name
2019 IEEE-EMBS International Conferences on Biomedical and Health Informatics (BHI 2019)
Conference Location
Chicago, Texas, United States of America
Conference Sponsor
IEEE
Conference Date
-

Population-based central cancer registries collect valuable structured and unstructured data primarily for cancer surveillance and research, enhancing insights into clinical features associated with cancer occurrence, cancer treatment, and cancer outcomes to guide interventions which reduce the cancer burden. Cancer registries primarily collect data on (1) cancer type (case or tumor); (2) patient demographics such as age, gender, and residential address at time of diagnosis; (3) planned first course of treatment; and (4) date of last contact, vital status, and cause of death. Cancer registry data is dynamic, structured data, which is extracted from many unstructured sources such as electronic healthcare records, and consolidated for reporting and other purposes. While available advanced analytic tools such as SEER*Stat have the ability to build SAS queries, we, however, explore an innovative knowledge graph approach to organizing cancer registry data for advanced analytics and visualization, which has unique advantages over approaches of existing tools. This innovative knowledge graph approach semantically enriches the data and easily enables linkage with third-party data, which can better explain variation in outcomes. We have developed a prototype knowledge graph based on data from the Louisiana Tumor Registry and other publicly available datasets including Behavioral Risk Factor Surveillance System, Clinical Trials, DBpedia, GeoNames, Rural-Urban Continuum Codes, and Semantic MEDLINE. The resource description framework (RDF) data model was selected to represent our knowledge graph, which contains more than 25 billion triples and is ~4TB in storage size. To exhibit the benefits of the knowledge graph approach, we used scenario specific queries, which find the relationships between cancer treatment sequences and outcomes. To illustrate its ease of use in iterative analysis, the knowledge graph was linked to external datasets for performing complex queries across multiple datasets. In addition, we used knowledge graphs to identify data discrepancies and to handle schema changes. Finally, we visualized the knowledge graph to discover data patterns. Our results demonstrate this graph-based solution enables cancer researchers to execute complex queries and more easily perform iterative analyses to improve understanding of cancer registry data. In the future, we would like to use high-performance computing (HPC) resources for faster-generating hypotheses with clinical potential from our knowledge graph.