---
title: "Characterizing viral quasispecies"
author: "Guerrero-Murillo, M and Gregori, J"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
bibliography: diversity.bib
output: 
    BiocStyle::html_document:
        number_sections: yes
        toc_float: true
vignette: >
    %\VignetteIndexEntry{QSutils-Diversity}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Quasispecies complexity by biodiversity indices

RNA and DNA viruses that replicate by low fidelity polymerases generate a viral
quasispecies, a collection of closely related viral genomes.

The high replication error rate that generates this quasispecies is due to a 
lack of proofreading mechanisms. This feature, in addition to the high 
replication rate of some viruses, leads to generation of many mutations 
in each cycle of viral replication.

The complexity of the quasispecies contributes greatly to the adaptive 
potential of the virus. The complexity of a viral quasispecies can be 
defined as an intrinsic property that quantifies the diversity and frequency of
haplotypes, independently of the population size that contains them.
[@Gregori2016] [@Gregori2014]

Quasispecies complexity can explain or predict viral behavior, so it has an 
obvious interest for clinical purposes. We are often interested in comparing 
the viral diversity indices between sequential samples from a single patient 
or between samples from groups of patients. These comparisons can provide
information on the patient’s clinical progression or the appropriateness of
a given treatment. 

QSUtils is a package intended for use with quasispecies amplicon NGS data, but
it could also be useful for analyzing 16S/18S ribosomal-based metagenomics or 
tumor genetic diversity by amplicons.

In this tutorial we illustrate the use of functions contained in the package 
to compute diversity indices, which can be classified into incidence-based, 
abundance-based, and functional indices.

#Install package and load data

First, the package should be installed and loaded.

```{r install,message=FALSE}
BiocManager::install("QSutils")
library(QSutils)
```

Then, the dataset to work with the diversity indices should be loaded. This 
vignette deals with the different ways to load data:
[Simulation vignette for package users](QSutils-Simulation.html) is also 
available.


```{r load-ex}
filepath<-system.file("extdata","ToyData_10_50_1000.fna", package="QSutils")
lst <- ReadAmplSeqs(filepath,type="DNA")
lst
```


# Incidence-based diversity indices or richness indices

Incidence-based diversity indices correspond to the number of observed 
entities, regardless of their abundances. The main incidence indices are the 
number of haplotypes, the number of polymorphic sites, and the number of 
mutations. These indices can be computed as follows:

+ number of haplotypes

    ```{r numbhaplo}
    length(lst$hseqs)
    ```

    In this fasta file, 10 haplotypes are reported. This can be calculated with
    the length of the DNAStringSet object.

+ The number of polymorphic sites

    ```{r segsites}
    SegSites(lst$hseqs)
    ```

    In this example, there are 14 polymorphic sites; that is, mutated positions
    with respect to the dominant haplotype.

+ Number of mutations

    ```{r nummutations}
    TotalMutations(lst$hseqs)
    ```

    Returns the total number of mutations observed in the alignment. 

# Abundance-based diversity indices

Abundance-based diversity indices take into account the entities present and 
their relative abundance in the population.

+ Shannon entropy

    ```{r shannon}
    Shannon(lst$nr)
    NormShannon(lst$nr)
    ```

    Shannon entropy is a measure of uncertainty. This index varies from 0, when 
    there is a single haplotype, to the logarithm of the number of haplotypes. 
    It represents the uncertainty in assigning a random genome to the 
    corresponding haplotype in the population studied. This index is commonly 
    used in the normalized form, although the non-normalized form is 
    recommended. The normalized form constitutes a measure of evenness rather 
    than diversity.
    
    The variance of Shannon entropy and normalized Shannon entropy can be 
    computed with:

    ```{r shannon var}
    ShannonVar(lst$nr)
    NormShannonVar(lst$nr)
    ```

+ Gini-Simpson

    ```{r ginnisimpsion}
    GiniSimpson(lst$nr)
    GiniSimpsonMVUE(lst$nr)
    ```
    
    The Gini-Simpson index expresses the probability that two randomly sampled 
    molecules from the viral population correspond to different haplotypes. 
    It has a range of variation from 0, when there is a single haplotype, to 1
    (asymptotically), when there are an infinity of equally abundant 
    haplotypes. Interpretation of this index is clearer and more intuitive than
    that of Shannon entropy, and it is scarcely sensitive to rare haplotypes, 
    as it gives higher weight to the more abundant variants.
    
    The variance of the Gini-Simpson index is obtained with:

    ```{r ginivar}
    GiniSimpsonVar(lst$nr)
    ```
    
+ Hill numbers

    Although Shannon entropy, the Gini-Simpson index, and many other reported
    indices are considered measures of diversity, their units do not allow an 
    easily interpreted and intuitive measure of true diversity. They are not 
    linear with respect to the addition of new haplotypes, and they tend to 
    saturate: the larger the number of haplotypes, the less sensitive they are 
    to frequency changes. In contrast, Hill numbers are a generalization of 
    diversity measures in units of equally abundant species.

    ```{r hill}
    Hill(lst$nr,q=1)
    ```

    Hill function computes the Hill number for a given order, _q_, but in 
    QSutils there is a wrapper function that computes a profile of Hill numbers
    for different _q_ values, so that one can see what happens when _q_ 
    increases in a plot.

    ```{r hillplot, fig.cap="Hill numbers profile", eval=FALSE}
    HillProfile(lst$nr)
    plot(HillProfile(lst$nr),type ="b", main="Hill numbers obtained
        with HillProfile")
    ```
    
    ```{r hillprofile,echo=FALSE}
    HillProfile(lst$nr)
    ```
    
```{r plothill, fig.cap="Hill numbers profile", echo=FALSE}
plot(HillProfile(lst$nr),type ="b", main="Hill numbers obtained
    with HillProfile")
```

As the order of the Hill number increases, the measure of diversity becomes
less sensitive to rare haplotypes. Note that the Hill number of order _q_=0 
is simply the number of haplotypes. For q=1 the Hill number is undefined, 
but as approaches 1 it tends to the exponential of Shannon entropy. For
_q=2_ the Hill number corresponds to the inverse of the Simpson index and
for _q_=`Inf` it is the inverse of the relative abundance of the rarest 
haplotype.
    
+ Rényi entropy

    ```{r renyi}
    Renyi(lst$nr,q=3)
    ```

    Rényi entropy is a log transformation of the Hill numbers.    
    
    ```{r reniyprofile,fig.cap="Rényi entropy profile",eval=FALSE}
    RenyiProfile(lst$nr)
    plot(RenyiProfile(lst$nr),type ="b", main="Rényi entropy obtained with
        RenyiProfile")
    ```

    ```{r plotreniy,echo=FALSE}
    RenyiProfile(lst$nr)
    ```

```{r reniyplot,fig.cap="Rényi entropy profile",echo=FALSE}
plot(RenyiProfile(lst$nr),type ="b", main="Rényi entropy obtained with
    RenyiProfile")
```
    
`RenyProfile`, a wrapper function of `Renyi`, computes the Rényi entropy for
a set of predefined _q_ to compute a plot to see the behavior of this index 
as _q_ increases.
    
+ Havrda-Charvat estimator

    ```{r hcq}
    HCq(lst$nr, q= 4)
    ```
    
    Havrda-Charvat entropy, a measure of the diversity of each population, is a
    generalization of Shannon entropy.
    
    ```{r hcqvar}
    HCqVar(lst$nr, q= 4)
    ```
    
    The asymptotic variance for a given exponent of this estimator can be 
    computed with `HCqVar`. 
    
    ```{r hcprofile,fig.cap="Havrda-Charvat entropy profile",eval=FALSE}
    HCqProfile(lst$nr)
    plot(HCqProfile(lst$nr),type ="b", main="Havrda-Charvat entropy obtained 
        with HCqProfile")
    ```

    ```{r hcqplot,echo=FALSE}
    HCqProfile(lst$nr)
    ```
    
```{r plothcq,fig.cap="Havrda-Charvat entropy profile",echo=FALSE}
plot(HCqProfile(lst$nr),type ="b", main="Havrda-Charvat entropy obtained 
    with HCqProfile")
```
    
The `HCq` function computes Havrda-Charvat entropy for a given order, _q_, 
and HCqProfile is a wrapper function that computes a profile of 
Havrda-Charvat entropy for different _q_ values.
    
# Functional diversity

Going a step further, functional diversity takes into account the differences 
between haplotypes in the quasispecies by considering their genetic distances.

## Incidence-based functional diversity indices

This set of functions only considers the distances between haplotypes.

```{r dist ,message=FALSE,fig.cap="Correlation among haplotype distances"}
dst <- DNA.dist(lst$hseqs,model="raw")
library(psych)
D <- as.matrix(dst)
rownames(D) <- sapply(rownames(D),function(str) strsplit(str,
                split="\\|")[[1]][1])
colnames(D) <- rownames(D)
D
```

+ Average mutation frequency by entity

    ```{r mfe}
    MutationFreq(dst) 
    ```

    The mean mutation frequency by entity, _Mfe_, measures the fraction of 
    nucleotides in the haplotype alignment that differ from the dominant 
    haplotype in the quasispecies 

+ FAD

    ```{r fad}
    FAD(dst)
    ```

    The functional attribute diversity, _FAD_, is the sum of pairwise distances
    between haplotypes in the alignment.

+ Nucleotide diversity by entity

    ```{r pi_e}
    NucleotideDiversity(dst)  
    ```

    The nucleotide diversity by entity, \(\pi_e\), is a transformation of `FAD`
    and it expresses the average difference between haplotypes in the 
    alignment.


## Abundance-based functional diversity indices

+ Average mutation frequency by molecule

    ```{r mfm}
    nm <- nmismatch(pairwiseAlignment(lst$hseqs,lst$hseqs[1]))
    MutationFreq(nm=nm,nr=lst$nr,len=width(lst$hseqs)[1])
    ```

    The proportion of different nucleotides at the molecular level, _Mfm_,
    measures the fraction of nucleotides in the population that differ from 
    the dominant haplotype in the quasispecies.
    
    ```{r mfvar}
    MutationFreqVar(nm,lst$nr,len=width(lst$hseqs)[1])
    ```
    
    The variance of _Mfm_ is calculated with the `MutationFreqVar` function.
    
+ Nucleotide diversity 

    ```{r pi}
    NucleotideDiversity(dst,lst$nr)
    ```

    The nucleotide diversity, \(\pi_m\), which is related to Rao entropy in 
    ecology, corresponds to the mean genetic distance between molecules in
    the population.

+ Rao entropy

    ```{r rao}
    RaoPow(dst,4,lst$nr)
    ```

    To obtain the Rao entropy for a given order of _q_, the RaoPow function is
    used.

    ```{r raovar}
    RaoVar(dst,lst$nr)
    ```

    The variance of \(\pi_m\) is calculated with the `RaoVar` function.

    ```{r raoprofile,fig.cap="Havrda-Charvat entropy profile",eval=FALSE}
    RaoPowProfile(dst,lst$nr)
    plot(RaoPowProfile(dst,lst$nr),type ="b", main="Rao entropy obtained 
        with RaoPowProfile")
    ```

    ```{r raoplot,echo=FALSE}
    RaoPowProfile(dst,lst$nr)
    ```
    
```{r plotrao,fig.cap="Rao entropy profile",echo=FALSE}
plot(RaoPowProfile(dst,lst$nr),type ="b", main="Rao entropy obtained 
        with RaoPowProfile")
```

# Sample size and bias

The diversity we measure in  viral quasispecies samples is necessarily tied to 
the sample size. The larger the coverage, the larger the number of haplotypes,
polymorphic sites, and mutations we can find. The same is true for Shannon 
entropy and other diversity indices.
Because of the variability in all experimental procedures, the best pooling of
samples never results in perfectly balanced coverage, which complicates the 
comparison of diversity indices biased by sample size. Hence, a sample size 
correction method is required to obtain fair inferences. The literature 
regarding ecology shows several correction methods, which unfortunately, are 
not applicable to NGS because of one differential trait. With NGS, sooner or 
later we are obliged to eliminate all haplotypes present in frequencies below 
a threshold, considered the noise level of the technique. This lower end of 
frequencies contains the information required by most established corrections.


The correction we found that works best in this situation consists in 
determining the minimum coverage of all samples to be compared and down 
sampling the other samples to this minimum, simply by rescaling. All haplotypes
with resulting frequencies below a given threshold are then removed with high
confidence, in a step we have dubbed fringe trimming. The $DSFT$ function is 
used to correct the quasispecies composition to a given sample size. It takes
four parameters: $nr$, the vector of haplotype frequencies in the original 
sample; $size$, the total number of reads in the new sample prior to abundance 
filtering; $p.cut$, the abundance threshold in the filter; and $conf$, the 
level of confidence in trimming haplotypes, which defaults to 0.95. $DSFT$ 
returns a vector of logicals with FALSE for all haplotypes to be removed by the
correction.

The next code shows the impact of each data treatment step when determining 
Shannon entropy. First, we simulate the frequencies of haplotypes in a 
quasispecies population. Then we take repeated samples from this population, 
2000 and 5000 reads in size. We then compute Shannon entropy of the raw 
samples,and the samples after noise filtering at an abundance of 0.1%, and 
after the DSFT corrections at size 2000, and filtering at 0.2% with 95% 
confidence.

```{r downsampling,fig.cap="Diversity index variations due to sample size"}
set.seed(123)
n <- 2000
y <- geom.series(n,0.8)+geom.series(n,0.00025)
nr.pop <- round(y*1e7)

thr <- 0.1
sz1 <- 5000
sz2 <- 10000

qs.sample <- function(nr.pop,sz1,sz2){ 
    div <- numeric(6)
    nr.sz1 <- table(sample(length(nr.pop),size=sz1,replace=TRUE,p=nr.pop))
    div[1] <- Shannon(nr.sz1)
    nr.sz1 <- nr.sz1[nr.sz1>=sz1*thr/100]
    div[3] <- Shannon(nr.sz1)
    div[5] <- Shannon(nr.sz1[DSFT(nr.sz1,sz1)])
    
    nr.sz2 <- table(sample(length(nr.pop),size=sz2,replace=TRUE,p=nr.pop))
    div[2] <- Shannon(nr.sz2)
    nr.sz2 <- nr.sz2[nr.sz2>=sz2*thr/100]
    div[4] <- Shannon(nr.sz2)
    div[6] <- Shannon(nr.sz2[DSFT(nr.sz2,sz1)])
    div
}

hpl.sim <- replicate(2000,qs.sample(nr.pop,sz1,sz2))
nms <- paste(c(sz1,sz2))

par(mfrow=c(1,3))

boxplot(t(hpl.sim[1:2,]),names=nms,col="lavender",las=2,
        ylab="Shannon entropy",main="raw")
boxplot(t(hpl.sim[3:4,]),names=nms,col="lavender",las=2,
        ylab="Shannon entropy",main="filt")
boxplot(t(hpl.sim[5:6,]),names=nms,col="lavender",las=2,
        ylab="Shannon entropy",main="DSFT")
```

As is evident, the distribution of Shannon entropy values in raw samples of 
10000 reads shows higher values than those in samples of 5000 reads, even 
though the samples come from the same viral population.

After removing all haplotypes with frequencies below 0.1%, the effect is 
reversed: Shannon entropies are much lower for the larger samples. Correction
by DSFT brings the two distributions to comparable levels.

This behavior is typical of samples with long queues of very low fitness and
defective haplotypes, possibly worsened by technical errors in sample 
preparation and sequencing.

## The load of rare haplotypes 

We found that the fraction of reads belonging to haplotypes below 1% in 
frequency, the rare haplotype load (RHL), is a robust index of quasispecies
diversity, especially suitable for detecting mutagenic processes. This index
is also robust and does not require sample size corrections, provided that 
coverage is high enough. The RHL computation is done without a previous 
abundance filter.

Let’s generate a random quasispecies with an RHL of about 25%:

```{r ex-dsft}
set.seed(123)
n <- 2000
y <- geom.series(n,0.8)+geom.series(n,0.0002)
nr.pop <- round(y*1e7)
rare <- nr.pop/sum(nr.pop) < 0.01
RHL <- sum(nr.pop[rare])/sum(nr.pop)
round(RHL,4)
```

Repeating the sampling process seen before for Shannon entropy, we obtain 
median RHL values that are very close to the population value for both sample
sizes.

The population value is shown in the next figure as a dash-dot horizontal line.

```{r ex2-dsft}
thr <- 0.1
sz1 <- 5000
sz2 <- 10000
qs.sample <- function(nr.pop,sz1,sz2){
    div <- numeric(2)
    nr.sz1 <- table(sample(length(nr.pop),size=sz1,replace=TRUE,p=nr.pop))
    rare <- nr.sz1/sum(nr.sz1) < 0.01
    div[1] <- sum(nr.sz1[rare])/sum(nr.sz1)
    nr.sz2 <- table(sample(length(nr.pop),size=sz2,replace=TRUE,p=nr.pop))
    rare <- nr.sz1/sum(nr.sz2) < 0.01
    div[2] <- sum(nr.sz2[rare])/sum(nr.sz2)
    div
}

hpl.sim <- replicate(2000,qs.sample(nr.pop,sz1,sz2))
nms <- paste(c(sz1,sz2))
boxplot(t(hpl.sim),names=nms,col="lavender",las=2,ylab="RHL")
abline(h=RHL,lty=4,col="navy")
```

***

```{r,echo=FALSE}
sessionInfo()
```

#References