Evolutionary study of covid-19 strains and their close relatives.

Introduction

Among infectious diseases, we know well the cold and the flu. However, behind these medical concepts, hide a wide variety of pathogens: from viruses to bacteria or both. Coronavirus is a group of virus that cause benign and severe respiratory infections. Four genus are found: alpha, beta, gamma and delta-coronavirus. In Betacoronavirus, are found causative agents of SARS (Severe Acute Respiratory Syndrome) [1]. SARS is a recent human disease, which was first identified in 2002 during the first SARS outbreak [2]. SARS coronavirus has a wide range of known hosts: rodents, birds, bats, etc. Several studies have demonstrated that human SARS coronavirus have an animal origin [1]. But unlike to flu viruses, that come from domesticated animals [3] [4] [5], explaining their annual recurrence, SARS coronavirus come from wild animals, which imply low contacts with humans, and so a lower appearance frequency.

When a virus mutates and can infect new hosts, the host-parasite relationship is destabilized, that can lead to the appearance of aggressive strains. The consequences can be multiple, like a higher pathogenicity or a disproportionate host immune response. However, more the imbalance between the host and the parasite is, the greater the risk for the virus to exhaust its new ecological niche. One of the best adapted human virus is the Epstein-Barr virus, which infects 95% of the adult human population, usually with no symptoms [6].

In this work, we wanted to confirm the origin of the covid-19 (SARS coronavirus 2) and the links between covid-19 strains. We used a genetic approach to perform a phylogenetic analysis.

Results

The genome (complete genetic information) of coronavirus is composed of two main regions. The first one corresponds to the 1ab polyprotein, which harbours functions of the genetic replication. The second one encodes for structural proteins (the spike (S), membrane (M), envelope (E), and nucleocapsid (N) proteins), which allow the construction of the virion envelope, called capsid (figure 1). Besides, this capsid has a shape reminding the solar crown, hence the name of coronavirus.

SARS genome
Figure 1: Schema of SARS genomes. 1ab: 1ab polyprotein, S: spike protein, M: membrane protein, E: envelope protein, N: nucleocapsid protein.

To study the evolutionary story of the covid-19, we decided to focus on the 1ab polyprotein, whose the evolutionary rate is lower than the second region and more congruent with the evolutionary story of the species. Structural proteins accumulate mutations that explain more the host-parasite relationship (immunity escape, host adhesion) than the evolutionary forces that drive the species. This noise can disrupt the phylogenetic tree and introduce biases that can lead to misunderstandings. Moreover, conserving the whole genome sequence does not allow to include a large panel of close relative species and to have a good level of completeness. Finally, the size of the 1ab polyprotein is about 20 kb for a total genome size of 30 kb. On the other hand, we did not wish to focus our analysis only on the coding part of the RNA polymerase, for fear of limiting the phylogenetic signal and of dissociating a structure whose transcription applies to the whole.

The sequence data search highlights the presence of 8 groups of 1ab polyprotein genes in the coronavirinae family. The group including covid-19 sequences was selected and a phylogenetic reconstruction was performed (figure 2). 5 clusters are identified and well supported, with bootstrap scores of 100:

phylogeny betacoronavirus
Figure 2: Betacoronavirus phylogeny based on 1ab polyprotein gene. 20 103 sites were used to perform a phylogenetic tree based on 1ab polyportein gene. On each leaf, the accession number of the unique sequence and the different organism names of every duplicated sequences are indicated. covid-19 cluster is shown in red, pangolin coronavirus in blue, bat SARS coronavirus in green and in yellow and previous human SARS coronavirus in pink. To enlarge the figure, right-click and select show image.

Covid-19 sequences do not cluster with other sequences from previous human SARS outbreaks, confirming the conclusion of a new emergence from strains infecting other animals. Among virus genomes sequenced by the scientific community, only those isolated from feces of Rhinolophus affinis in 2013 in China are the most relatives. Pangolin coronavirus sequences are too far genetically to imply a putative transmission from pangolins to humans. Nevertheless, the phylogenetic analysis shows the presence of two main genetic groups, dividing in one hand the covid-19 and the pangolin coronavirus, from another hand the two bat coronavirus clusters and the previous human SARS coronavirus cluster.

The figure 3 is a zoom on the situation of the covid-19 cluster. Unfortunately, the phylogenetic tree is not supported with very poor bootstrap scores. The phylogenetic signal carried by the 1ab polyprotein gene is not enough to distinguish the covid-19 strains. Nevertheless, it is possible to see that some strains, having the same sequence of this gene, were found in different countries:

The most resequenced sequence of the 1ab polyprotein gene (MT259248, 182x) is only found in USA.

phylogeny covid19
Figure 2: Covid-19 phylogeny based on 1ab polyprotein gene. 20 103 sites were used to perform a phylogenetic tree based on 1ab polyportein gene. On each leaf, the accession, the number of the unique sequence the number of duplicated sequences, the first sampling date and the country are indicated. Samples from America are shown in red, Asia in blue, Europe in green, Middle-East in yellow, Africa in violet and Oceania in pink. Blue triangle symbols indicate the oldest samples. To enlarge the figure, right-click and select show image.

A phylogenomic analysis was performed to increase the robustness of the previous tree. Unfortunately, phylogeneic distances between strains are particularly short. Statistical scores based on bootstraps are better than with 1ab polyprotein gene, but always too low to allow the detection of specific clusters and their relationships between us. A synteny analysis was also done without conclusive results (only 1 bloc detected).

phylogenomic covid19
Figure 2: Covid-19 phylogenomics. 29 478 sites were used to perform a phylogenetic tree based on the covid-19 genome. On each leaf, the accession, the number of the unique sequence the number of duplicated sequences, the first sampling date and the country are indicated. Samples from America are shown in red, Asia in blue, Europe in green, Middle-East in yellow, Africa in violet and Oceania in pink. Blue triangle symbols indicate the oldest samples. Bootstrap scores above 70 are shown on the nodes. To enlarge the figure, right-click and select show image.

*updates*

The 1ab polyprotein gene phylogeny of previous human SARS coronavirus (figure 3) is not congruent with the species one based on RAG2 gene (figure 4). The bat SARS coronavirus clusters demonstrated the presence of several tropism changes or a large ability to infect different bat species. On species phylogeny, Rhinolophus ferrumequinum are well separated in a cluster supported with a bootstrap score of 97%. However, the SARS phylogeny failed to regroup sequences from this host species. Previous SARS epidemics seem to be linked with infections from Rhinolophus ferrumequinum, contrary to the covid-19. SARS sequences from Rhinolophus sinicus show likewise the same pattern, indicating cross-infection inside the Rhinolophus genus or even Aselliscus genus.

phylogeny SARS 1
Figure 3: previous human SARS coronavirus phylogeny based on 1ab polyprotein gene. 20 903 sites were used to perform a phylogenetic tree based on 1ab polyportein gene. On each leaf, the accession and the different host organism names of every duplicated sequences are indicated. Samples from primate hosts are indicated in red, civet in green, rhinolophus in blue and Aselliscus in purple. 4 bat SARS coronavirus clusters are shown with symbols: rectangle, cercle, diamond and triangle. Bootstrap scores above 80 are shown on the nodes. To enlarge the figure, right-click and select show image.
Species phylogeny
Figure 4: Species phylogeny based on RAG2 (Recombination Activating Gene 2) gene. 420 sites were used to perform a phylogenetic tree based on RAG2 gene. On each leaf, the accession and the different organism names of every duplicated sequences are indicated. Samples from primate hosts are indicated in red, civet in green, rhinolophus in blue, mustella in gold, manis in pink and Aselliscus in purple. Bootstrap scores above 80 are shown on the nodes. To enlarge the figure, right-click and select show image.

*updates*

The SNP (single nucleotide polymorphism) analysis of covid-19 strains revealed the presence of 18 SNPs in the genome, with 13 ones located on CDS. Among them, 7 SNPs (on the 1ab polyprotein and the orf3a gene) are responsible for strong mutation events on the protein sequences, with the substitution of an amino acid having different physico-chemical features. (figure 5)

SNP covid19
Figure 5: SNP analysis of covid-19 genome. To enlarge the figure, right-click and select show image.

Conclusion

Our phylogenetic analysis based on the 1ab polyprotein gene confirms the emergence of the covid-19 from strains infecting Rhinolophus affinis, a chinese bat, and overturns the origin from the pangolin coronavirus. This gene cannot be used to perform a genotyping of the covid-19. Nevertheless the phylogenetic analysis reveals few genotypes harbouring the same polyprotein. The whole genome approach was not able to show clusters of strains. Evolutionnary events on human covid-19 are too recent to allow a efficient and robust genotyping.

*updates*

Previous human SARS infections are linked to a transmission from the Rhinolophus ferrumequinum species. The study of the other bat SARS coronavirus suggests cross-infections of close coronavirus strains inside bat species. An in-depth knowledge of these bat species and their coronaviruses is necessary to better understand and to anticipate future risks of the emergence of SARS strains infecting humans.

Methods

Sequence retrieving were performed using ACNUC [7] and the taxonomic id of the coronavirinae family (2 296 sequences for 2 309 coding sequences). Raw data were filtered by BiomandaData pipeline to remove incomplete and bad sequences. Coding sequences (CDS) were sorted by BiomandaTools TRI_CDS pipeline to form 446 groups of close relative sequence genes. Sequences were dereplicated.

CDS from the group composed of covid-19 sequences were aligned using Clustal Omega [8] (version 1.2.1) and Seaview [9] (version 4.6.2). A manual site selection was conducted and a phylogenetic tree reconstruction was performed with PhyML [10] algorithm (version 3.1, parameters: GTR, Bootstrap 100, NNI, BioNJ). Dendrograms were annotated using Treedyn [11] (version 198.3).