BASELINe quantifies selection pressure by calculating the posterior probability density function (PDF) based on observed mutations compared to expected mutation rates derived from an underlying SHM targeting model. Selection is quantified via the following steps:
A small example AIRR Rearrangement database is included in the
alakazam
package. The example dataset consists of a subset
of Ig sequencing data from an influenza vaccination study (Laserson and
Vigneault et al., PNAS, 2014). The data include sequences from multiple
time-points before and after the subject received an influenza
vaccination. Quantifying selection requires the following fields
(columns) to be present in the table:
sequence_id
sequence_alignment
germline_alignment_d_mask
Before starting the selection analysis, data need to be prepared in one of two ways:
Constructing clonal consensus sequences.
Incorporating lineage information.
Individual sequences within clonal groups are not, strictly speaking,
independent events and it is generally appropriate to only analyze
selection pressures on an effective sequence for each clonal group. The
collapseClones
function provides one strategy for
generating an effective sequences for each clone. It reduces the input
database to one row per clone and appends clonal_sequence
and clonal_germline
columns which contain the consensus
sequences for each clone.
# Collapse clonal groups into single sequences
clones <- collapseClones(ExampleDb, cloneColumn="clone_id",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
regionDefinition=IMGT_V,
method="thresholdedFreq", minimumFrequency=0.6,
includeAmbiguous=FALSE, breakTiesStochastic=FALSE,
nproc=1)
For each clone, lineage information can be incorporated following these steps:
# Subset to sequences with clone_id=3170
db_3170 <- subset(ExampleDb, clone_id == 3170)
dim(db_3170)
colnames(db_3170)
# Generate a ChangeoClone object for lineage construction
clone_3170 <- makeChangeoClone(db_3170, seq="sequence_alignment", germ="germline_alignment")
# Run the lineage reconstruction
dnapars_exec <- "/usr/local/bin/dnapars"
graph_3170 <- buildPhylipLineage(clone_3170, dnapars_exec, rm_temp=TRUE)
# Generating a data.frame from the lineage tree graph object,
# and merge it with clone data.frame
graph_3170_df <- makeGraphDf(graph_3170, clone_3170)
dim(graph_3170_df)
colnames(graph_3170_df)
makeGraphDf
creates a data.frame
with the
column parent_sequence
, which can be used to analyze
mutations for each sequence relative to their
parent_sequence
.
Selection scores are calculated with the calcBaseline
function. This can be performed with a single call to
calcBaseline
, which performs all required steps.
Alternatively, one can perform each step separately for greater control
over the analysis parameters.
Following construction of an effective sequence for each clone, the
observed and expected mutation counts are calculated for each sequence
in the clonal_sequence
column relative to the
clonal_germline
. observedMutations
is used to
calculate the number of observed mutations and
expectedMutations
calculates the expected frequency of
mutations. The underlying targeting model for calculating expectations
can be specified using the targetingModel
parameter. In the
example below, the default HH_S5F
is used. Column names for
sequence and germline sequence may also be passed in as parameters if
they differ from the Change-O defaults.
Mutations are counted by these functions separately for
complementarity determining (CDR) and framework (FWR) regions. The
regionDefinition
argument defines whether these regions are
handled separately, and where the boundaries lie. There are several
built-in region definitions in the shazam
package, both
dependent upon the V segment being IMGT-gapped:
IMGT_V
: All regions in the V segment, excluding CDR3,
grouped as either CDR or FWR.IMGT_V_BY_REGIONS
: The CDR1, CDR2, FWR1, FWR and FWR3
regions in the V segment (no CDR3) treated as individual regions.IMGT_VDJ
: All regions, including CDR3 and FWR4, grouped
as either CDR or FWR. This RegionDefinition
is initially
empty, and one is created on the fly for each set of clonally related
sequences.IMGT_VDJ_BY_REGIONS
: CDR1, CDR2, CDR3, FWR1, FWR, FWR3
and FWR4 regions treated as individual regions. This
RegionDefinition
is initially empty, and one is created on
the fly for each set of clonally related sequences.Users may define other region sets and boundaries by creating a
custom RegionDefinition
object.
# Count observed mutations and append mu_count columns to the output
observed <- observedMutations(clones,
sequenceColumn="clonal_sequence",
germlineColumn="clonal_germline",
regionDefinition=IMGT_V, nproc=1)
# Count expected mutations and append mu_expected columns to the output
expected <- expectedMutations(observed,
sequenceColumn="clonal_sequence",
germlineColumn="clonal_germline",
targetingModel=HH_S5F,
regionDefinition=IMGT_V, nproc=1)
The counts of observed and expected mutations can be combined to test
for selection using calcBaseline
. The statistical framework
used to test for selection based on mutation counts can be specified
using the testStatistic
parameter.
# Calculate selection scores using the output from expectedMutations
baseline <- calcBaseline(expected, testStatistic="focused",
regionDefinition=IMGT_V, nproc=1)
## calcBaseline will use existing observed and expected mutations, in the fields: mu_count_cdr_r, mu_count_cdr_s, mu_count_fwr_r, mu_count_fwr_s and mu_expected_cdr_r, mu_expected_cdr_s, mu_expected_fwr_r, mu_expected_fwr_s
It is not required for observedMutation
and
expectedMutations
to be run prior to
calcBaseline
. If the output of these two steps does not
appear in the input data.frame, then calcBaseline
will call
the appropriate functions prior to calculating selection scores.
The default behavior of observedMutations
and
expectedMutations
, and by extension
calcBaseline
, is to define a replacement mutation in the
usual way - any change in the amino acid of a codon is considered a
replacement mutation. However, these functions have a
mutationDefinition
argument which allows these definitions
to be changed by providing a MutationDefinition
object that
contains alternative replacement and silent criteria.
shazam
provides the following built-in
MutationDefinition
objects:
CHARGE_MUTATIONS
: Amino acid mutations are defined by
changes in side chain charge class.HYDROPATHY_MUTATIONS
: Amino acid mutations are defined
by changes in side chain hydrophobicity class.POLARITY_MUTATIONS
: Amino acid mutations are defined by
changes in side chain polarity class.VOLUME_MUTATIONS
: Amino acid mutations are defined by
changes in side chain volume class.The default behavior of expectedMutations
is to use the
human 5-mer mutation model, HH_S5F
. Alternative SHM
targeting models can be provided using the targetingModel
argument.
To compare the selection scores of groups of sequences, the sequences
must be convolved into a single PDF representing each group. In the
example dataset, the sample_id
field corresponds to samples
taken at different time points before and after an influenza vaccination
and the c_call
field specifies the isotype of the sequence.
The groupBaseline
function convolves the BASELINe PDFs of
individual sequences/clones to get a combined PDF. The field(s) by which
to group the sequences are specified with the groupBy
parameter.
The groupBaseline
function automatically calls
summarizeBaseline
to generate summary statistics based on
the requested groupings, and populates the stats
slot of
the input Baseline
object with the number of sequences with
observed mutations for each region, mean selection scores, 95%
confidence intervals, and p-values with positive signs indicating the
presence of positive selection and/or p-values with negative signs
indicating the presence of negative selection. The magnitudes of the
p-values (without the signs) should be interpreted as analogous to a
t-test.
The following example generates a single selection PDF for each
unique annotation in the sample_id
column.
Grouping by multiple annotations follows the sample procedure as a
single annotation by simply adding columns to the groupBy
argument. Subsetting the data can be performed before or after
generating selection PDFs via calcBaseline
. However, note
that subsetting may impact the clonal representative sequences generated
by collapseClones
. In the following example, subsetting
precedes the collapsing of clonal groups.
# Subset the original data to switched isotypes
db_sub <- subset(ExampleDb, c_call %in% c("IGHM", "IGHG"))
# Collapse clonal groups into single sequence
clones_sub <- collapseClones(db_sub, cloneColumn="clone_id",
sequenceColumn="sequence_alignment",
germlineColumn="germline_alignment_d_mask",
regionDefinition=IMGT_V,
method="thresholdedFreq", minimumFrequency=0.6,
includeAmbiguous=FALSE, breakTiesStochastic=FALSE,
nproc=1)
# Calculate selection scores from scratch
baseline_sub <- calcBaseline(clones_sub, testStatistic="focused",
regionDefinition=IMGT_V, nproc=1)
## calcBaseline will calculate observed and expected mutations for clonal_sequence using clonal_germline as a reference.
To make selection comparisons using two levels of variables, you
would need two iterations of groupings, where the first iteration of
groupBaseline
groups on both variables, and the second
iteration groups on the “outer” variable. For example, if a data set has
both case and control subjects annotated in status
and
subject
columns, then generating convolved PDFs for each
status would be performed as:
The testBaseline
function will perform significance
testing between two grouped BASELINe PDFs, by region, and return a
data.frame with the following information:
region
: The sequence region, such as cdr
and fwr
.test
: The name of the two groups compared.pvalue
: Two-sided p-value for the comparison.fdr
: FDR corrected p-value.## region test pvalue fdr
## 1 cdr -1h != +7d 0.04494357 0.08988715
## 2 fwr -1h != +7d 0.49922881 0.49922881
plotBaselineSummary
plots the mean and confidence
interval of selection scores for the given groups. The
idColumn
argument specifies the field that contains
identifiers of the groups of sequences. If there is a secondary field by
which the sequences are grouped, this can be specified using the
groupColumn
. This secondary grouping can have a
user-defined color palette passed into groupColors
or can
be separated into facets by setting the facetBy="group"
.
The subsetRegions
argument can be used to visualize
selection of specific regions. Several examples utilizing these
different parameters are provided below.
# Set sample and isotype colors
sample_colors <- c("-1h"="seagreen", "+7d"="steelblue")
isotype_colors <- c("IGHM"="darkorchid", "IGHD"="firebrick",
"IGHG"="seagreen", "IGHA"="steelblue")
# Plot mean and confidence interval by time-point
plotBaselineSummary(grouped_1, "sample_id")
# Plot selection scores by time-point and isotype for only CDR
plotBaselineSummary(grouped_2, "sample_id", "c_call", groupColors=isotype_colors,
subsetRegions="cdr")
# Group by CDR/FWR and facet by isotype
plotBaselineSummary(grouped_2, "sample_id", "c_call", facetBy="group")
plotBaselineDensity
plots the full Baseline
PDF of selection scores for the given groups. The parameters are the
same as those for plotBaselineSummary
. However, rather than
plotting the mean and confidence interval, the full density function is
shown.
# Plot selection PDFs for a subset of the data
plotBaselineDensity(grouped_2, "c_call", groupColumn="sample_id", colorElement="group",
colorValues=sample_colors, sigmaLimits=c(-1, 1))
If for any reason you need to edit the existing values in a field in
a Baseline
object, you can do so via
editBaseline
. In the following example, we remove results
related to IGHA in the relevant fields from grouped_2
. When
the input data is large and it takes a long time for
calcBaseline
to run, editBaseline
could become
useful when, for instance, you would like to exclude a certain sample or
isotype, but would rather not re-run calcBaseline
after
removing that sample or isotype from the original input data.
# Get indices of rows corresponding to IGHA in the field "db"
# These are the same indices also in the matrices in the fields "numbOfSeqs",
# "binomK", "binomN", "binomP", and "pdfs"
# In this example, there is one row of IGHA for each sample
dbIgMIndex <- which(grouped_2@db[["c_call"]] == "IGHG")
grouped_2 <- editBaseline(grouped_2, "db", grouped_2@db[-dbIgMIndex, ])
grouped_2 <- editBaseline(grouped_2, "numbOfSeqs", grouped_2@numbOfSeqs[-dbIgMIndex, ])
grouped_2 <- editBaseline(grouped_2, "binomK", grouped_2@binomK[-dbIgMIndex, ])
grouped_2 <- editBaseline(grouped_2, "binomN", grouped_2@binomN[-dbIgMIndex, ])
grouped_2 <- editBaseline(grouped_2, "binomP", grouped_2@binomP[-dbIgMIndex, ])
grouped_2 <- editBaseline(grouped_2, "pdfs",
lapply(grouped_2@pdfs, function(pdfs) {pdfs[-dbIgMIndex, ]} ))
# The indices corresponding to IGHA are slightly different in the field "stats"
# In this example, there is one row of IGHA for each sample and for each region
grouped_2 <- editBaseline(grouped_2, "stats",
grouped_2@stats[grouped_2@stats[["c_call"]] != "IGHA", ])