Irwin Jungreis

Irwin Jungreis
Research Scientist, Massachusetts Institute of Technology

SARS-COV-2 gene content and COVID-19 mutation impact by comparing 44  Sarebecovirus genomes

Despite its clinical importance, the SARS-CoV-2 gene set remains unresolved, hindering dissection of COVID-19 biology. We use comparative genomics to provide a high-confidence protein-coding gene set, characterize protein-level and nucleotide-level evolutionary constraint, and prioritize functional mutations from the COVID-19 pandemic. We select 44 Sarbecovirus genomes at evolutionary distances ideally-suited for protein-coding and non-coding element identification, create whole-genome alignments, and quantify protein-coding evolutionary signatures and overlapping constraint. We find strong protein-coding signatures for named genes and ORFs 3a, 6, 7a, 7b, 8, 9b, and also ORF3c, a novel alternate-frame gene. By contrast, ORF10 and overlapping ORFs 2b, 3d, 3d-2, 3b, and 9c lack protein-coding signatures or convincing experimental evidence of protein-coding function. We show no other conserved protein-coding genes remain to be discovered. Mutation analysis suggests ORF8 contributes to fitness within an individual but not person-to-person transmission. Cross-strain and within-strain evolutionary pressures largely agree, except for fewer-than-expected mutations in nsp3 and S1, and more-than-expected in nucleocapsid. We examine evolutionary histories of residues disrupted by spike-protein mutations of concern D614G, N501Y, E484K, and K417N/T, to find clues about their biology, and catalog co-inherited mutations disrupting otherwise-perfectly-conserved residues and likely to have functional consequences. Previously reported RNA-modification sites show no enrichment for conservation.