Abstract
Toxicogenomics studies the gene and protein activities to drug treatments or toxic exposures. As the drugs and genes are numerous, toxicogenomics data are naturally high dimensional, with dimension sizes up to millions. In addition, the distribution of toxicogenomics data is oftentimes skewed, and they contain rare but important signals representing a cell or organism’s response to toxicity. The combination of high dimension and extremely skewed distribution of toxicogenomics data makes clustering analysis extremely challenging.We present our study of clustering toxicogenomics data using classical approaches such as principal component analysis as well as deep learning approaches such as auto-encoders. Our experiments show that these approaches fail to preserve rare signals and produce high-quality clusters. We then explore augmenting matrix factorization with deep learning techniques such as attention mechanism to produce latent representations for clustering. Our technique is able to better preserve rare signals after dimensionality reduction than prior approaches. Furthermore, we combine our augmented matrix factorization with a mechanism similar to autoencoder to balance separable clusters and low regeneration errors. Our experiments demonstrate better clustering with our proposed approach.