Skip to main content
SHARE
Publication

Clustering High-dimensional Toxicogenomics Data with Rare Signals...

by Guojing Cong, Scott Auerbach
Publication Type
Conference Paper
Book Title
2023 IEEE International Conference on Data Mining Workshops (ICDMW)
Publication Date
Page Numbers
1 to 7
Publisher Location
New Jersey, United States of America
Conference Name
2023 IEEE International Conference on Data Mining Workshops (ICDMW)
Conference Location
Shanghai, China
Conference Sponsor
Various
Conference Date
-

Toxicogenomics studies the gene and protein activities to drug treatments or toxic exposures. As the drugs and genes are numerous, toxicogenomics data are naturally high dimensional, with dimension sizes up to millions. In addition, the distribution of toxicogenomics data is oftentimes skewed, and they contain rare but important signals representing a cell or organism’s response to toxicity. The combination of high dimension and extremely skewed distribution of toxicogenomics data makes clustering analysis extremely challenging.We present our study of clustering toxicogenomics data using classical approaches such as principal component analysis as well as deep learning approaches such as auto-encoders. Our experiments show that these approaches fail to preserve rare signals and produce high-quality clusters. We then explore augmenting matrix factorization with deep learning techniques such as attention mechanism to produce latent representations for clustering. Our technique is able to better preserve rare signals after dimensionality reduction than prior approaches. Furthermore, we combine our augmented matrix factorization with a mechanism similar to autoencoder to balance separable clusters and low regeneration errors. Our experiments demonstrate better clustering with our proposed approach.