Publications
Comparison of missing data handling methods for variant pathogenicity predictors
Jan 09, 2026

Modern clinical genetic tests utilize next-generation sequencing (NGS) approaches to comprehensively analyze genetic variants from patients. Out of millions of variants, clinically relevant variants that match the patient’s phenotype must be identified accurately and rapidly. As manual evaluation is not a feasible option for meeting the speed and volume requirements of clinical genetic testing, automated solutions are needed. Various machine learning (ML), artificial intelligence (AI), and in silico variant pathogenicity predictors have been developed to solve this challenge. These solutions rely on comprehensive data and struggle with the sparse genetic annotations. Therefore, careful treatment of missing data is necessary, and the selected methods may have a huge impact on accuracy, reliability, speed, and associated computational costs.  

Mikko Särkkä and co-authors presented an open-source framework called AMISS that can be used to evaluate performance of different methods for handling missing genetic variant data in the context of variant pathogenicity prediction. Using AMISS, they evaluated 14 methods for handling missing values. The performance of these methods varied substantially in terms of precision, computational costs, and other attributes.  

The conclusion of this study was that it is unnecessary to use sophisticated missing data methods to treat missing values when building variant pathogenicity metapredictors. Instead, simple unconditional imputation methods and even zero imputation give higher performance and save significant computational time, leading to considerable cost savings if adopted. This highlights the conceptual separation between missing data methods for prediction and imputation for statistical inference, the latter of which requires carefully constructed techniques to reach correct conclusions. 

Särkkä M, Myöhänen S, Marinov K, Saarinen I, Lahti L, Fortino V, Paananen J. Comparison of missing data handling methods for variant pathogenicity predictors. NAR Genomics and Bioinform. 2025;7(4):lqaf133. doi:10.1093/nargab/lqaf133 

Last modified: January 09, 2026