--- title: 'Shazam: Quantification of selection pressure' author: "Namita Gupta & Jason Anthony Vander Heiden & Julian Q. Zhou" date: '`r Sys.Date()`' output: pdf_document: dev: pdf fig_height: 4 fig_width: 7.5 highlight: pygments toc: yes html_document: fig_height: 4 fig_width: 7.5 highlight: pygments theme: readable toc: yes geometry: margin=1in fontsize: 11pt vignette: > %\VignetteIndexEntry{Selection quantification} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- BASELINe quantifies selection pressure by calculating the posterior probability density function (PDF) based on observed mutations compared to expected mutation rates derived from an underlying SHM targeting model. Selection is quantified via the following steps: 1. Calculate the selection scores for individual sequences. 2. Group by relevant fields for comparison and convolve individual selection PDFs. 4. Plot and compare selection scores of different groups of sequences. ## Example data A small example AIRR Rearrangement database is included in the `alakazam` package. The example dataset consists of a subset of Ig sequencing data from an influenza vaccination study (Laserson and Vigneault et al., PNAS, 2014). The data include sequences from multiple time-points before and after the subject received an influenza vaccination. Quantifying selection requires the following fields (columns) to be present in the table: * `sequence_id` * `sequence_alignment` * `germline_alignment_d_mask` ```{r, eval=TRUE, warning=FALSE, message=FALSE} # Import required packages library(alakazam) library(shazam) # Load and subset example data (for faster demonstration) data(ExampleDb, package="alakazam") ExampleDb <- subset(ExampleDb, c_call %in% c("IGHA", "IGHG")) ``` ## Preprocessing Before starting the selection analysis, data need to be prepared in one of two ways: 1. Constructing clonal consensus sequences. 1. Incorporating lineage information. ### Constructing clonal consensus sequences Individual sequences within clonal groups are not, strictly speaking, independent events and it is generally appropriate to only analyze selection pressures on an effective sequence for each clonal group. The `collapseClones` function provides one strategy for generating an effective sequences for each clone. It reduces the input database to one row per clone and appends `clonal_sequence` and `clonal_germline` columns which contain the consensus sequences for each clone. ```{r, eval=TRUE, warning=FALSE, results="hide"} # Collapse clonal groups into single sequences clones <- collapseClones(ExampleDb, cloneColumn="clone_id", sequenceColumn="sequence_alignment", germlineColumn="germline_alignment_d_mask", regionDefinition=IMGT_V, method="thresholdedFreq", minimumFrequency=0.6, includeAmbiguous=FALSE, breakTiesStochastic=FALSE, nproc=1) ``` ### Incorporating lineage information For each clone, lineage information can be incorporated following these steps: ```{r eval=F, warning=F, results="hide"} # Subset to sequences with clone_id=3170 db_3170 <- subset(ExampleDb, clone_id == 3170) dim(db_3170) colnames(db_3170) # Generate a ChangeoClone object for lineage construction clone_3170 <- makeChangeoClone(db_3170, seq="sequence_alignment", germ="germline_alignment") # Run the lineage reconstruction dnapars_exec <- "/usr/local/bin/dnapars" graph_3170 <- buildPhylipLineage(clone_3170, dnapars_exec, rm_temp=TRUE) # Generating a data.frame from the lineage tree graph object, # and merge it with clone data.frame graph_3170_df <- makeGraphDf(graph_3170, clone_3170) dim(graph_3170_df) colnames(graph_3170_df) ``` `makeGraphDf` creates a `data.frame` with the column `parent_sequence`, which can be used to analyze mutations for each sequence relative to their `parent_sequence`. ## Calculate selection PDFs for individual sequences Selection scores are calculated with the `calcBaseline` function. This can be performed with a single call to `calcBaseline`, which performs all required steps. Alternatively, one can perform each step separately for greater control over the analysis parameters. ### Calculating selection in multiple steps Following construction of an effective sequence for each clone, the observed and expected mutation counts are calculated for each sequence in the `clonal_sequence` column relative to the `clonal_germline`. `observedMutations` is used to calculate the number of observed mutations and `expectedMutations` calculates the expected frequency of mutations. The underlying targeting model for calculating expectations can be specified using the `targetingModel` parameter. In the example below, the default `HH_S5F` is used. Column names for sequence and germline sequence may also be passed in as parameters if they differ from the Change-O defaults. Mutations are counted by these functions separately for complementarity determining (CDR) and framework (FWR) regions. The `regionDefinition` argument defines whether these regions are handled separately, and where the boundaries lie. There are several built-in region definitions in the `shazam` package, both dependent upon the V segment being IMGT-gapped: * `IMGT_V`: All regions in the V segment, excluding CDR3, grouped as either CDR or FWR. * `IMGT_V_BY_REGIONS`: The CDR1, CDR2, FWR1, FWR and FWR3 regions in the V segment (no CDR3) treated as individual regions. * `IMGT_VDJ`: All regions, including CDR3 and FWR4, grouped as either CDR or FWR. This `RegionDefinition` is initially empty, and one is created on the fly for each set of clonally related sequences. * `IMGT_VDJ_BY_REGIONS`: CDR1, CDR2, CDR3, FWR1, FWR, FWR3 and FWR4 regions treated as individual regions. This `RegionDefinition` is initially empty, and one is created on the fly for each set of clonally related sequences. Users may define other region sets and boundaries by creating a custom `RegionDefinition` object. ```{r, eval=TRUE, warning=FALSE, results="hide"} # Count observed mutations and append mu_count columns to the output observed <- observedMutations(clones, sequenceColumn="clonal_sequence", germlineColumn="clonal_germline", regionDefinition=IMGT_V, nproc=1) # Count expected mutations and append mu_expected columns to the output expected <- expectedMutations(observed, sequenceColumn="clonal_sequence", germlineColumn="clonal_germline", targetingModel=HH_S5F, regionDefinition=IMGT_V, nproc=1) ``` The counts of observed and expected mutations can be combined to test for selection using `calcBaseline`. The statistical framework used to test for selection based on mutation counts can be specified using the `testStatistic` parameter. ```{r, eval=TRUE, warning=FALSE, results="hide"} # Calculate selection scores using the output from expectedMutations baseline <- calcBaseline(expected, testStatistic="focused", regionDefinition=IMGT_V, nproc=1) ``` ### Calculating selection in one step It is not required for `observedMutation` and `expectedMutations` to be run prior to `calcBaseline`. If the output of these two steps does not appear in the input data.frame, then `calcBaseline` will call the appropriate functions prior to calculating selection scores. ```{r, eval=FALSE, warning=FALSE, results="hide"} # Calculate selection scores from scratch baseline <- calcBaseline(clones, testStatistic="focused", regionDefinition=IMGT_V, nproc=1) ``` ### Using alternative mutation definitions and models The default behavior of `observedMutations` and `expectedMutations`, and by extension `calcBaseline`, is to define a replacement mutation in the usual way - any change in the amino acid of a codon is considered a replacement mutation. However, these functions have a `mutationDefinition` argument which allows these definitions to be changed by providing a `MutationDefinition` object that contains alternative replacement and silent criteria. `shazam` provides the following built-in `MutationDefinition` objects: * `CHARGE_MUTATIONS`: Amino acid mutations are defined by changes in side chain charge class. * `HYDROPATHY_MUTATIONS`: Amino acid mutations are defined by changes in side chain hydrophobicity class. * `POLARITY_MUTATIONS`: Amino acid mutations are defined by changes in side chain polarity class. * `VOLUME_MUTATIONS`: Amino acid mutations are defined by changes in side chain volume class. The default behavior of `expectedMutations` is to use the human 5-mer mutation model, `HH_S5F`. Alternative SHM targeting models can be provided using the `targetingModel` argument. ```{r, eval=FALSE, warning=FALSE, results="hide"} # Calculate selection on charge class with the mouse 5-mer model baseline_mk_rs5nf <- calcBaseline(clones, testStatistic="focused", regionDefinition=IMGT_V, targetingModel=MK_RS5NF, mutationDefinition=CHARGE_MUTATIONS, nproc=1) ``` ## Group and convolve individual selection distributions To compare the selection scores of groups of sequences, the sequences must be convolved into a single PDF representing each group. In the example dataset, the `sample_id` field corresponds to samples taken at different time points before and after an influenza vaccination and the `c_call` field specifies the isotype of the sequence. The `groupBaseline` function convolves the BASELINe PDFs of individual sequences/clones to get a combined PDF. The field(s) by which to group the sequences are specified with the `groupBy` parameter. The `groupBaseline` function automatically calls `summarizeBaseline` to generate summary statistics based on the requested groupings, and populates the `stats` slot of the input `Baseline` object with the number of sequences with observed mutations for each region, mean selection scores, 95% confidence intervals, and p-values with positive signs indicating the presence of positive selection and/or p-values with negative signs indicating the presence of negative selection. The magnitudes of the p-values (without the signs) should be interpreted as analogous to a t-test. ### Grouping by a single annotation The following example generates a single selection PDF for each unique annotation in the `sample_id` column. ```{r, eval=TRUE, warning=FALSE, results="hide"} # Combine selection scores by time-point grouped_1 <- groupBaseline(baseline, groupBy="sample_id") ``` ### Subsetting and grouping by multiple annotations Grouping by multiple annotations follows the sample procedure as a single annotation by simply adding columns to the `groupBy` argument. Subsetting the data can be performed before or after generating selection PDFs via `calcBaseline`. However, note that subsetting may impact the clonal representative sequences generated by `collapseClones`. In the following example, subsetting precedes the collapsing of clonal groups. ```{r, eval=TRUE, warning=FALSE, results="hide"} # Subset the original data to switched isotypes db_sub <- subset(ExampleDb, c_call %in% c("IGHM", "IGHG")) # Collapse clonal groups into single sequence clones_sub <- collapseClones(db_sub, cloneColumn="clone_id", sequenceColumn="sequence_alignment", germlineColumn="germline_alignment_d_mask", regionDefinition=IMGT_V, method="thresholdedFreq", minimumFrequency=0.6, includeAmbiguous=FALSE, breakTiesStochastic=FALSE, nproc=1) # Calculate selection scores from scratch baseline_sub <- calcBaseline(clones_sub, testStatistic="focused", regionDefinition=IMGT_V, nproc=1) # Combine selection scores by time-point and isotype grouped_2 <- groupBaseline(baseline_sub, groupBy=c("sample_id", "c_call")) ``` ### Convolving variables at multiple levels To make selection comparisons using two levels of variables, you would need two iterations of groupings, where the first iteration of `groupBaseline` groups on both variables, and the second iteration groups on the "outer" variable. For example, if a data set has both case and control subjects annotated in `status` and `subject` columns, then generating convolved PDFs for each status would be performed as: ```{r, eval=FALSE, warning=FALSE, results="hide"} # First group by subject and status subject_grouped <- groupBaseline(baseline, groupBy=c("status", "subject")) # Then group the output by status status_grouped <- groupBaseline(subject_grouped, groupBy="status") ``` ### Testing the difference in selection PDFs between groups The `testBaseline` function will perform significance testing between two grouped BASELINe PDFs, by region, and return a data.frame with the following information: * `region`: The sequence region, such as `cdr` and `fwr`. * `test`: The name of the two groups compared. * `pvalue`: Two-sided p-value for the comparison. * `fdr`: FDR corrected p-value. ```{r, eval=TRUE} testBaseline(grouped_1, groupBy="sample_id") ``` ## Plot and compare selection scores for groups `plotBaselineSummary` plots the mean and confidence interval of selection scores for the given groups. The `idColumn` argument specifies the field that contains identifiers of the groups of sequences. If there is a secondary field by which the sequences are grouped, this can be specified using the `groupColumn`. This secondary grouping can have a user-defined color palette passed into `groupColors` or can be separated into facets by setting the `facetBy="group"`. The `subsetRegions` argument can be used to visualize selection of specific regions. Several examples utilizing these different parameters are provided below. ```{r, eval=TRUE, warning=FALSE} # Set sample and isotype colors sample_colors <- c("-1h"="seagreen", "+7d"="steelblue") isotype_colors <- c("IGHM"="darkorchid", "IGHD"="firebrick", "IGHG"="seagreen", "IGHA"="steelblue") # Plot mean and confidence interval by time-point plotBaselineSummary(grouped_1, "sample_id") # Plot selection scores by time-point and isotype for only CDR plotBaselineSummary(grouped_2, "sample_id", "c_call", groupColors=isotype_colors, subsetRegions="cdr") # Group by CDR/FWR and facet by isotype plotBaselineSummary(grouped_2, "sample_id", "c_call", facetBy="group") ``` `plotBaselineDensity` plots the full `Baseline` PDF of selection scores for the given groups. The parameters are the same as those for `plotBaselineSummary`. However, rather than plotting the mean and confidence interval, the full density function is shown. ```{r, eval=TRUE, warning=FALSE} # Plot selection PDFs for a subset of the data plotBaselineDensity(grouped_2, "c_call", groupColumn="sample_id", colorElement="group", colorValues=sample_colors, sigmaLimits=c(-1, 1)) ``` ## Editing a field in a Baseline object If for any reason you need to edit the existing values in a field in a `Baseline` object, you can do so via `editBaseline`. In the following example, we remove results related to IGHA in the relevant fields from `grouped_2`. When the input data is large and it takes a long time for `calcBaseline` to run, `editBaseline` could become useful when, for instance, you would like to exclude a certain sample or isotype, but would rather not re-run `calcBaseline` after removing that sample or isotype from the original input data. ```{r, eval=FALSE, warning=FALSE, results="hide"} # Get indices of rows corresponding to IGHA in the field "db" # These are the same indices also in the matrices in the fields "numbOfSeqs", # "binomK", "binomN", "binomP", and "pdfs" # In this example, there is one row of IGHA for each sample dbIgMIndex <- which(grouped_2@db[["c_call"]] == "IGHG") grouped_2 <- editBaseline(grouped_2, "db", grouped_2@db[-dbIgMIndex, ]) grouped_2 <- editBaseline(grouped_2, "numbOfSeqs", grouped_2@numbOfSeqs[-dbIgMIndex, ]) grouped_2 <- editBaseline(grouped_2, "binomK", grouped_2@binomK[-dbIgMIndex, ]) grouped_2 <- editBaseline(grouped_2, "binomN", grouped_2@binomN[-dbIgMIndex, ]) grouped_2 <- editBaseline(grouped_2, "binomP", grouped_2@binomP[-dbIgMIndex, ]) grouped_2 <- editBaseline(grouped_2, "pdfs", lapply(grouped_2@pdfs, function(pdfs) {pdfs[-dbIgMIndex, ]} )) # The indices corresponding to IGHA are slightly different in the field "stats" # In this example, there is one row of IGHA for each sample and for each region grouped_2 <- editBaseline(grouped_2, "stats", grouped_2@stats[grouped_2@stats[["c_call"]] != "IGHA", ]) ```