Evaluating machine learning approaches for host prediction using H3 influenza genomic data

Background: H3 influenza A viruses (IAV) have been shown to frequently cross the species barrier which can be an important factor in sustained transmission and spread. Machine learning methods have been widely explored for host prediction of IAV using genomic data; however, this is often done using data from only one of the eight IAV segments or by using all available IAV data to predict broad categories of hosts.

Objective: The objective of this study was to combine machine learning algorithms with H3 IAV sequence data from all eight segments to train predictive machine learning models for distinct host prediction and validate model performance.

Methods: Models were trained on both k-mers and amino acid properties alongside machine learning algorithms that included random forest and XGBoost for each of the eight IAV genome segments. Models were then validated on a test dataset through analytics of model class predicted probabilities and subsequently used to investigate between-species transmission patterns within case studies including canine H3N8, swine H3N2 2010.2, and duck H3 sequences.

Results: Models demonstrated strong performance in host prediction across all eight segments on the test dataset, with overall accuracies and κ (kappa) values ranging from 0.995-0.997, 0.984-0.990, respectively. Misclassified test dataset sequences with high predicted probabilities (> 90%) were validated using available literature and were identified to be frequently associated with between-species transmission events. Between-species transmission patterns within case study model class predicted probabilities were also identified to be consistent with the literature in cases of both correct and incorrect classification.

Conclusions: These models allow for rapid and accurate host prediction of H3 IAV datasets from any of the eight IAV segments and provide a solid framework that allows for identification of variants with higher than typical between-species transmission potential. However, results obtained on selected case studies suggest further improvements of the training and validation processes should be considered.