Bulik-Sullivan EC, Roy S, Elliott RJ, Kassam Z, Lichtman SN, Carroll IM, Gulati AS. Intestinal microbial and metabolic alterations following successful fecal microbiota transplant for D-lactic acidosis. 2018.
This is the analysis and graphics creation pipeline used in our article “Intestinal microbial and metabolic alterations following successful fecal microbiota transplant for D-lactic acidosis.” Our pipeline is based directly on the one developed by Callahan et al. in “DADA2: High resolution sample inference from Illumina amplicon data” (2016; PMID: 27214047).
We generated Illumina MiSeq 250 bp paired-end, single-indexed reads of the 16S rRNA gene V4 region (available from the National Center for Biotechnology Information’s Sequence Read Archive (SRA) under accession number SRP129013) from four fecal samples, three from the patient being treated and one from the OpenBiome FMT preparation. Before entering the pipeline described here, primer content was removed using cutadapt version 1.10 (Marcel Martin, DOI: http://dx.doi.org/10.14806/ej.17.1.200). These demultiplexed, primer-free files were cleaned of contaminating human sequences during the process of being uploaded to the SRA. We used the FASTQ files now downloadable from the SRA to generate Figures 1A and B in our manuscript. Figures 2A, B, and C were created with stool pH and stool/plasma lactic acid metabolite data, available as tables throughout this document.
Command to trim primers:
cutadapt -g GAGTGCCAGCMGCCGCGGTAA -G ACGGACTACDBGGGTWTCTAAT \
-o /F_output_fp/Forward_trimmed.fastq.gz \
-p /R_output_fp/Reverse_trimmed.fastq.gz \
/F_input_fp/Forward.fastq.gz \
/R_input_fp/Reverse.fastq.gz
Identity | Barcode | Original read count | Number of reads removed by SRA |
---|---|---|---|
DLA_17pre | ATTCTCTCACGT | 7143 | 1 |
DLA_8post | GTCGAATTTGCG | 42600 | 17 |
DLA_38post | GTGGTCATCGTA | 26442 | 14 |
DLA_OBdonor | CTGAAGGGCGAA | 111344 | 1 |
library(dada2); packageVersion("dada2")
## [1] '1.6.0'
library(phyloseq); packageVersion("phyloseq")
## [1] '1.22.3'
library(ggplot2); packageVersion("ggplot2")
## [1] '2.2.1'
library(cowplot); packageVersion("cowplot")
## [1] '0.9.2'
library(plyr); packageVersion("plyr")
## [1] '1.8.4'
library(readr); packageVersion("readr")
## [1] '1.1.1'
library(scales); packageVersion("scales")
## [1] '0.5.0'
Set working directory where fastq files are located:
Path <- "/Users/EBS/Desktop/advance" # Alter filepath to wherever you've placed the files from the SRA
List files in that directory:
list.files(Path)
## [1] "DLA_17pre_F_cleaned.fastq" "DLA_17pre_R_cleaned.fastq"
## [3] "DLA_38post_F_cleaned.fastq" "DLA_38post_R_cleaned.fastq"
## [5] "DLA_8post_F_cleaned.fastq" "DLA_8post_R_cleaned.fastq"
## [7] "DLA_OBdonor_F_cleaned.fastq" "DLA_OBdonor_R_cleaned.fastq"
There should be eight files, four forward and four reverse. Sort forward (F) and reverse (R) files so they’re in the same order:
Fseqs <- sort(list.files(Path, pattern="_F_cleaned.fastq"))
Rseqs <- sort(list.files(Path, pattern="_R_cleaned.fastq"))
Fseqs
## [1] "DLA_17pre_F_cleaned.fastq" "DLA_38post_F_cleaned.fastq"
## [3] "DLA_8post_F_cleaned.fastq" "DLA_OBdonor_F_cleaned.fastq"
Rseqs
## [1] "DLA_17pre_R_cleaned.fastq" "DLA_38post_R_cleaned.fastq"
## [3] "DLA_8post_R_cleaned.fastq" "DLA_OBdonor_R_cleaned.fastq"
Extract sample names; filenames have format SAMPLENAME_F_cleaned.fastq (only need to do this step on Fs because names are the same for Rs):
samp.names <- sapply(strsplit(Fseqs, "_F_cleaned"), `[`, 1)
samp.names
## [1] "DLA_17pre" "DLA_38post" "DLA_8post" "DLA_OBdonor"
Specify full path to Fseqs and Rseqs:
Fseqs <- file.path(Path, Fseqs)
Rseqs <- file.path(Path, Rseqs)
Fseqs
## [1] "/Users/EBS/Desktop/advance/DLA_17pre_F_cleaned.fastq"
## [2] "/Users/EBS/Desktop/advance/DLA_38post_F_cleaned.fastq"
## [3] "/Users/EBS/Desktop/advance/DLA_8post_F_cleaned.fastq"
## [4] "/Users/EBS/Desktop/advance/DLA_OBdonor_F_cleaned.fastq"
Rseqs
## [1] "/Users/EBS/Desktop/advance/DLA_17pre_R_cleaned.fastq"
## [2] "/Users/EBS/Desktop/advance/DLA_38post_R_cleaned.fastq"
## [3] "/Users/EBS/Desktop/advance/DLA_8post_R_cleaned.fastq"
## [4] "/Users/EBS/Desktop/advance/DLA_OBdonor_R_cleaned.fastq"
Visualize forward read data quality:
plotQualityProfile(Fseqs[1:4])
We’ll trim forward reads at cycle 220 (so remove last 30 nucleotides). Same for reverse reads:
plotQualityProfile(Rseqs[1:4])
The quality for the reverse reads is much poorer, as expected, so these we’ll trim at cycle 160.
Filtered <- file.path(Path, "filtered")
Create new names for filtered files (paste “_F_filt.fastq.gz" on the ends – replace F with R for reverse files):
FilteredFseqs <- file.path(Filtered, paste0(samp.names, "_F_filt.fastq.gz"))
FilteredRseqs <- file.path(Filtered, paste0(samp.names, "_R_filt.fastq.gz"))
FilteredFseqs
## [1] "/Users/EBS/Desktop/advance/filtered/DLA_17pre_F_filt.fastq.gz"
## [2] "/Users/EBS/Desktop/advance/filtered/DLA_38post_F_filt.fastq.gz"
## [3] "/Users/EBS/Desktop/advance/filtered/DLA_8post_F_filt.fastq.gz"
## [4] "/Users/EBS/Desktop/advance/filtered/DLA_OBdonor_F_filt.fastq.gz"
FilteredRseqs
## [1] "/Users/EBS/Desktop/advance/filtered/DLA_17pre_R_filt.fastq.gz"
## [2] "/Users/EBS/Desktop/advance/filtered/DLA_38post_R_filt.fastq.gz"
## [3] "/Users/EBS/Desktop/advance/filtered/DLA_8post_R_filt.fastq.gz"
## [4] "/Users/EBS/Desktop/advance/filtered/DLA_OBdonor_R_filt.fastq.gz"
Filtering parameters: maxN=0; truncQ=2; rm.phix=TRUE; maxEE=2. This run contained 15% phiX, which at this point has not yet been removed from the sequences:
FiltOutput <- filterAndTrim(Fseqs, FilteredFseqs, Rseqs, FilteredRseqs, truncLen=c(220,160), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)
## Creating output directory: /Users/EBS/Desktop/advance/filtered
head(FiltOutput)
## reads.in reads.out
## DLA_17pre_F_cleaned.fastq 7142 2084
## DLA_38post_F_cleaned.fastq 26428 7541
## DLA_8post_F_cleaned.fastq 42583 12270
## DLA_OBdonor_F_cleaned.fastq 111343 36497
Visualize quality after trimming:
plotQualityProfile(FilteredFseqs[1:4])
plotQualityProfile(FilteredRseqs[1:4])
Learn error rates:
errFseqs <- learnErrors(FilteredFseqs, multithread=TRUE)
## Initializing error rates to maximum possible estimate.
## Sample 1 - 2084 reads in 580 unique sequences.
## Sample 2 - 7541 reads in 1787 unique sequences.
## Sample 3 - 12270 reads in 2575 unique sequences.
## Sample 4 - 36497 reads in 8865 unique sequences.
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## Convergence after 4 rounds.
## Total reads used: 58392
errRseqs <- learnErrors(FilteredRseqs, multithread=TRUE)
## Initializing error rates to maximum possible estimate.
## Sample 1 - 2084 reads in 441 unique sequences.
## Sample 2 - 7541 reads in 1194 unique sequences.
## Sample 3 - 12270 reads in 1869 unique sequences.
## Sample 4 - 36497 reads in 7724 unique sequences.
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## Convergence after 4 rounds.
## Total reads used: 58392
Visualize estimated error rates:
plotErrors(errFseqs, nominalQ=TRUE)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
plotErrors(errRseqs, nominalQ=TRUE)
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
Dereplicate:
derepFseqs <- derepFastq(FilteredFseqs, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_17pre_F_filt.fastq.gz
## Encountered 580 unique sequences from 2084 total sequences read.
## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_38post_F_filt.fastq.gz
## Encountered 1787 unique sequences from 7541 total sequences read.
## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_8post_F_filt.fastq.gz
## Encountered 2575 unique sequences from 12270 total sequences read.
## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_OBdonor_F_filt.fastq.gz
## Encountered 8865 unique sequences from 36497 total sequences read.
derepRseqs <- derepFastq(FilteredRseqs, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_17pre_R_filt.fastq.gz
## Encountered 441 unique sequences from 2084 total sequences read.
## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_38post_R_filt.fastq.gz
## Encountered 1194 unique sequences from 7541 total sequences read.
## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_8post_R_filt.fastq.gz
## Encountered 1869 unique sequences from 12270 total sequences read.
## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_OBdonor_R_filt.fastq.gz
## Encountered 7724 unique sequences from 36497 total sequences read.
Rename the dereplicated output with sample names:
names(derepFseqs) <- samp.names
names(derepRseqs) <- samp.names
Sample inference:
dadaFseqs <- dada(derepFseqs, err=errFseqs, multithread=TRUE)
## Sample 1 - 2084 reads in 580 unique sequences.
## Sample 2 - 7541 reads in 1787 unique sequences.
## Sample 3 - 12270 reads in 2575 unique sequences.
## Sample 4 - 36497 reads in 8865 unique sequences.
dadaRseqs <- dada(derepRseqs, err=errRseqs, multithread=TRUE)
## Sample 1 - 2084 reads in 441 unique sequences.
## Sample 2 - 7541 reads in 1194 unique sequences.
## Sample 3 - 12270 reads in 1869 unique sequences.
## Sample 4 - 36497 reads in 7724 unique sequences.
Merge pairs:
merged <- mergePairs(dadaFseqs, derepFseqs, dadaRseqs, derepRseqs, verbose=TRUE)
## 2027 paired-reads (in 11 unique pairings) successfully merged out of 2084 (in 23 pairings) input.
## 7348 paired-reads (in 38 unique pairings) successfully merged out of 7541 (in 91 pairings) input.
## 11987 paired-reads (in 47 unique pairings) successfully merged out of 12270 (in 144 pairings) input.
## 34113 paired-reads (in 179 unique pairings) successfully merged out of 36497 (in 1165 pairings) input.
Create sequence table:
sequence.table <- makeSequenceTable(merged)
## The sequences being tabled vary in length.
dim(sequence.table)
## [1] 4 233
table(nchar(getSequences(sequence.table)))
##
## 221 252 253 254 277
## 1 29 201 1 1
Remove chimeras:
sequence.table.nochim <- removeBimeraDenovo(sequence.table, method="consensus", multithread=TRUE, verbose=TRUE)
## Identified 85 bimeras out of 233 input sequences.
dim(sequence.table.nochim)
## [1] 4 148
Check reasonableness of loss in the chimera removal process:
sum(sequence.table.nochim)/sum(sequence.table)
## [1] 0.9422983
As a sanity check, track reads counts through pipeline:
getN2 <- function(x) sum(getUniques(x))
track.seqs <- cbind(FiltOutput, sapply(dadaFseqs, getN2), sapply(merged, getN2), rowSums(sequence.table), rowSums(sequence.table.nochim))
colnames(track.seqs) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
rownames(track.seqs) <- samp.names
track.seqs
## input filtered denoised merged tabled nonchim
## DLA_17pre 7142 2084 2084 2027 2027 2018
## DLA_38post 26428 7541 7541 7348 7348 6631
## DLA_8post 42583 12270 12270 11987 11987 11073
## DLA_OBdonor 111343 36497 36497 34113 34113 32552
Assign taxonomy using Silva databse v. 128 (training data available here: https://zenodo.org/record/824551#.WWE6MdPyvzU)
dlac.taxa <- assignTaxonomy(sequence.table.nochim, "~/Desktop/DLAC/silva_nr_v128_train_set.fa.gz", multithread=TRUE)
unname(head(dlac.taxa))
## [,1] [,2] [,3] [,4]
## [1,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
## [2,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
## [3,] "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales"
## [4,] "Bacteria" "Actinobacteria" "Actinobacteria" "Bifidobacteriales"
## [5,] "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
## [6,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales"
## [,5] [,6]
## [1,] "Lactobacillaceae" "Lactobacillus"
## [2,] "Lactobacillaceae" "Lactobacillus"
## [3,] "Lactobacillaceae" "Lactobacillus"
## [4,] "Bifidobacteriaceae" "Bifidobacterium"
## [5,] "Lachnospiraceae" "Blautia"
## [6,] "Enterobacteriaceae" "Escherichia/Shigella"
Add species assignments:
dlac.taxa.species <- addSpecies(dlac.taxa, "~/Desktop/DLAC/silva_species_assignment_v128.fa.gz", verbose=TRUE)
## 49 out of 148 were assigned to the species level.
## Of which 42 had genera consistent with the input table.
Import metadata:
dla_metadata <- read.table("~/Desktop/DLAC/dla_metadata.txt", header=TRUE, sep="\t", row.names=1)
Sample | Barcode | BarcodeSequence | SRA |
---|---|---|---|
DLA_17pre | 200 | ATTCTCTCACGT | SRR6467880 |
DLA_8post | 192 | GTCGAATTTGCG | SRR6467881 |
DLA_38post | 194 | GTGGTCATCGTA | SRR6467882 |
DLA_OBdonor | 195 | CTGAAGGGCGAA | SRR6467883 |
Create phyloseq object:
dlac.ps <- phyloseq(otu_table(sequence.table.nochim, taxa_are_rows=FALSE), sample_data(dla_metadata), tax_table(dlac.taxa.species))
dlac.ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 148 taxa and 4 samples ]
## sample_data() Sample Data: [ 4 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 148 taxa by 7 taxonomic ranks ]
Function to get decimal points in y-axis tick marks:
decimals <- function(x) sprintf("%.2f", x)
Rearrange order of samples to be chronological:
positions <- c("DLA_17pre", "DLA_8post", "DLA_38post", "DLA_OBdonor")
Plot Shannon diversity. This will throw an error due to the absence of singletons, but that isn’t actually too much of an issue when using Shannon diversity, in which low-abundance taxa impact the output less than more-abundant taxa. See here and Haegeman et al., 2013 (“Robust estimation of microbial diversity in theory and in practice”, The ISME Journal) for further discussion of this topic.
Fig1B.shannon <- plot_richness(dlac.ps, x="Identity", measures=c("Shannon")) + theme_bw() + theme(axis.title.x=element_blank(), plot.title=element_text(hjust=-0.07)) + ylab("Shannon index") + scale_x_discrete(labels=c("DLA_17pre"="2 weeks\npre-FMT", "DLA_8post"="1 week\npost-FMT", "DLA_38post"="5 weeks\npost-FMT", "DLA_OBdonor"="FMT\nDonor"), limits=positions) + scale_y_continuous(limits=c(0, 3.85), labels=decimals) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed")
## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
Fig1B.shannon
This is our figure 1B.
dlac.genera.top30 <- names(sort(taxa_sums(dlac.ps), decreasing=TRUE))[1:30]
dlac.ps.top30 <- transform_sample_counts(dlac.ps, function(OTU) OTU/sum(OTU))
dlac.ps.top30 <- prune_taxa(dlac.genera.top30, dlac.ps.top30)
Create taxonomy bar chart using these top 30 genera (from DADA2 sequence variants):
dlac.genera <- plot_bar(dlac.ps.top30, x="Identity", fill="Genus") + theme_bw() + theme(plot.title=element_text(hjust=-0.12)) + scale_fill_manual(values=c("#FFC0CB","#FF1493","#C71585","#FFA07A","#FF4500","#FFA500","#FFD700","#FFFF00","#F0E68C","#9370DB","#9932CC","#E6E6FA","#ADFF2F","#9ACD32","#228B22","#008080","#40E0D0","#00BFFF","#483D8B","#0000CD","#191970","#C0C0C0","#696969","#2C3E50")) + scale_x_discrete(name="Sample", limits=positions) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed")
dlac.genera
This graph is suboptimal. While it gives a decent overview of the genus-level taxonomic breakdown in these samples, we’re losing species-level information that we do have thanks to DADA2, as well as information about any sequence variants that could only be resolved to the family level (those are deposited into the white “NA” bin). We performed the steps below to get a non-default graph that is more clear and informative. The problem is that DADA2 and Silva give us a moderate amount of species-level taxonomic resolution for the sequence variants in this dataset, but they don’t achieve species-level resolution for all sequence variants (i.e. can only assign taxonomy to genus or family level). Attempting to graph only species-level taxonomy therefore creates a plot like the one above that has a handful of sequence variants identified to the species-level, and another handful that are depicted as “NA” (those that could only be resolved to genus or family level). We want to incorporate the species-level taxonomic classification that DADA2 affords us for some of these sequence variants, but also display as much taxonomic information as we can about other sequence variants that could not be identified all the way to the species level.
Melt top 30 ps object:
dla30 <- psmelt(dlac.ps.top30)
dim(dla30)
## [1] 120 14
We will add two new columns, which will eventually become the values we use for labeling our graph. This will be either:
Family name followed by an “(f)” for sequence variants that could only be resolved to family level (Family (f)),
Genus name followed by a “(g)” for sequence variants that could only be resolved to genus level (Genus (g)), or
Full binomial nomenclature for sequence variants that were resolved all the way to species level (i.e. Genus species).
All sequence variants in this top 30 dataset were resolved at least to family level, so we don’t have to perform this same task for any higher taxonomic ranks.
Add new columns:
dla30$Name1 <- NA
dla30$Name2 <- NA
dla30$Family<-as.character(dla30$Family)
dla30$Genus<-as.character(dla30$Genus)
dla30$Species<-as.character(dla30$Species)
str(dla30)
## 'data.frame': 120 obs. of 16 variables:
## $ OTU : chr "TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACT"| __truncated__ "TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTGATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACT"| __truncated__ "TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTGCTTAGGTCTGATGTGAAAGCCTTCGGCTTAACCGAAGAAGTGCATCGGAAACC"| __truncated__ "TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTGCTTAGGTCTGATGTGAAAGCCTTCGGCTTAACCGAAGAAGTGCATCGGAAACC"| __truncated__ ...
## $ Sample : chr "DLA_38post" "DLA_38post" "DLA_38post" "DLA_8post" ...
## $ Abundance : num 0.276 0.245 0.212 0.206 0.203 ...
## $ Identity : Factor w/ 4 levels "DLA_17pre","DLA_38post",..: 2 2 2 3 3 3 1 1 3 1 ...
## $ Barcode : int 194 194 194 192 192 192 200 200 192 200 ...
## $ BarcodeSequence: Factor w/ 4 levels "ATTCTCTCACGT",..: 4 4 4 3 3 3 1 1 3 1 ...
## $ SRA : Factor w/ 4 levels "SRR6467880","SRR6467881",..: 3 3 3 2 2 2 1 1 2 1 ...
## $ Kingdom : Factor w/ 1 level "Bacteria": 1 1 1 1 1 1 1 1 1 1 ...
## $ Phylum : Factor w/ 4 levels "Actinobacteria",..: 3 3 3 3 4 3 3 3 3 3 ...
## $ Class : Factor w/ 6 levels "Actinobacteria",..: 2 2 2 2 5 2 2 2 2 2 ...
## $ Order : Factor w/ 6 levels "Bacteroidales",..: 5 5 5 5 4 5 5 5 5 5 ...
## $ Family : chr "Lactobacillaceae" "Lactobacillaceae" "Lactobacillaceae" "Lactobacillaceae" ...
## $ Genus : chr "Lactobacillus" "Lactobacillus" "Lactobacillus" "Lactobacillus" ...
## $ Species : chr NA "kitasatonis" NA NA ...
## $ Name1 : logi NA NA NA NA NA NA ...
## $ Name2 : logi NA NA NA NA NA NA ...
If Genus = NA and Species = NA, then name = “Family (f)”:
dla30$Name1 <- ifelse(is.na(dla30$Genus) & is.na(dla30$Species), dla30$Family, dla30$Name1)
dla30$Name2 <- ifelse(is.na(dla30$Genus) & is.na(dla30$Species), "(f)", dla30$Name2)
If Genus exists and Species = NA, then name = “Genus (g)”:
dla30$Name1 <- ifelse(!is.na(dla30$Genus) & is.na(dla30$Species), dla30$Genus, dla30$Name1)
dla30$Name2 <- ifelse(!is.na(dla30$Genus) & is.na(dla30$Species), "(g)", dla30$Name2)
If Genus and Species exist, then name = “Genus species”:
dla30$Name1 <- ifelse(!is.na(dla30$Genus) & !is.na(dla30$Species), dla30$Genus, dla30$Name1)
dla30$Name2 <- ifelse(!is.na(dla30$Genus) & !is.na(dla30$Species), dla30$Species, dla30$Name2)
Combine Name1 and Name2 columns:
dla30$Combined <- paste(dla30$Name1, dla30$Name2, sep=' ')
count(dla30, "Combined")
## Combined freq
## 1 Alistipes putredinis 4
## 2 Anaerostipes hadrus 4
## 3 Bacteroides (g) 4
## 4 Bacteroides caccae 4
## 5 Bacteroides stercoris 4
## 6 Bacteroides vulgatus 4
## 7 Bifidobacterium (g) 4
## 8 Blautia (g) 8
## 9 Blautia stercoris 4
## 10 Dialister invisus 4
## 11 Dorea formicigenerans 4
## 12 Dorea longicatena 4
## 13 Escherichia/Shigella (g) 4
## 14 Faecalibacterium (g) 4
## 15 Faecalibacterium prausnitzii 8
## 16 Lachnospiraceae (f) 16
## 17 Lactobacillus (g) 8
## 18 Lactobacillus kitasatonis 4
## 19 Roseburia (g) 4
## 20 Ruminiclostridium_5 (g) 4
## 21 Ruminococcaceae (f) 4
## 22 Streptococcus (g) 4
## 23 Subdoligranulum (g) 4
## 24 Veillonella (g) 4
The fact that there are only 24 distinct names here even though we’re working with 30 sequence variants is acceptable. This happened because some sequence variants could only be resolved to family or genus level, and more than one sequence variant got the same not-to-species-level taxonomic assignment. For example, there are four distinct sequence variants in these top 30 that could only be resolved to “Lachnospiraceae (f)”, so now 16 things (4 sequence variants x 4 samples) have the name “Lachnospiraceae (f)”. We will lose some information about the number of sequence variants that certain taxonomic ranks represent when we graph this, but we decided this was the clearest way to display our results.
Graph Figure 1A:
Fig1A.taxonomy <- ggplot(dla30, aes(Identity, Abundance)) + geom_bar(stat="identity", width=0.6, aes(fill=Combined)) + theme_bw() + theme(legend.text=element_text(face="italic"), axis.title.x=element_blank()) + scale_x_discrete(labels=c("DLA_17pre"="2 weeks\npre-FMT", "DLA_8post"="1 week\npost-FMT", "DLA_38post"="5 weeks\npost-FMT", "DLA_OBdonor"="FMT\nDonor"), limits=positions) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed") + ylab("Relative Abundance") + scale_fill_manual(values=c("#FFC0CB","#FF1493","#C71585","#FFA07A","#FF4500","#FFA500","#FFD700","#FFFF00","#F0E68C","#9370DB","#9932CC","#E6E6FA","#ADFF2F","#9ACD32","#228B22","#008080","#40E0D0","#00BFFF","#483D8B","#0000CD","#191970","#C0C0C0","#696969","#2C3E50"), name="Taxon") + guides(fill=guide_legend(ncol=1))
Fig1A.taxonomy
This is what we wanted. As noted above, we lost some detail since all sequence variants that could only be resolved to a family or genus level are now merged together (i.e. represented by one color), but we captured as much of the taxonomic assignments as we can. The bars don’t reach all the way to 1.00 because this only represents the top 30 most abundant taxa.
Combine the Shannon diversity plot and the taxonomy plot into a two-panel figure with cowplot and save:
two.panel<- plot_grid(Fig1A.taxonomy, Fig1B.shannon, labels=c("A","B"), ncol=1, align='v', axis='r', rel_heights=c(2.75,1))
two.panel
ggsave("DLA.Fig1.tiff", width=8, height=9, units="in", dpi=300)
plasma_metab_wks <- read_delim("~/Desktop/DLAC/plasma_metab_wks.txt", "\t", escape_double=FALSE, trim_ws=TRUE)
Episode | rel_wks | plasma_lactate |
---|---|---|
A | -23.6 | 4.06 |
B | -16.4 | 8.6 |
C | 2.4 | 1.13 |
Fig2a.wks <- ggplot(plasma_metab_wks, aes(Episode, plasma_lactate)) + geom_col(fill="#696969", width=0.6) + theme_bw() + ylab("Plasma D-lactate concentration (mmol/L)") + geom_vline(xintercept=2.5, col='red', lwd=0.5, linetype="dashed") + scale_x_discrete(labels=c("A"="Second D-LA\nhospitalization", "B"="Fifth D-LA\nhospitalization", "C"="2 weeks\npost-FMT")) + theme(axis.title.x=element_blank())
Fig2a.wks
stool_metab_wks <- read_delim("~/Desktop/DLAC/stool_metab_wks.txt", "\t", escape_double=FALSE)
Weeks | stool_lactate | isomer |
---|---|---|
-2.4 | 0.3938 | D |
1.1 | 1.2524 | D |
5.4 | 1.3675 | D |
Donor | 0.0788184 | D |
-2.4 | 0.1493 | L |
1.1 | 0.1018 | L |
5.4 | 0.0902 | L |
Donor | 0.0562368 | L |
col <- c("#696969", "#F5F5F5")
Fig2b.wks <- ggplot(stool_metab_wks, aes(x=Weeks, y=stool_lactate, fill=isomer)) + geom_bar(stat="identity", position="dodge", width=0.6, colour="#696969") + theme_bw() + theme(legend.justification=c(1.05, 1.05), legend.position=c(1,1)) + scale_fill_manual(values=col) + ylab("Fecal D- and L-lactic acid (g/L)") + guides(fill=guide_legend(title="Lactic acid\nenantiomer")) + scale_x_discrete(labels=c("-2.4"="2 weeks\npre-FMT", "1.1"="1 week\npost-FMT", "5.4"="5 weeks\npost-FMT", "Donor"="FMT Donor")) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed") + theme(axis.title.x=element_blank())
Fig2b.wks
stool_pH_wks <- read_delim("~/Desktop/DLAC/stool_pH_wks.txt", "\t", escape_double=FALSE, trim_ws=TRUE)
wks | stool_pH |
---|---|
-2.4 | 6.125 |
1.1 | 5.75 |
5.4 | 4.5 |
Donor | 7.5 |
Function to get decimal points in y-axis tick marks:
decimals <- function(x) sprintf("%.1f", x)
Graph:
Fig2c.wks <- ggplot(stool_pH_wks, aes(wks, stool_pH)) + geom_bar(stat="identity", fill="#696969", width=0.6) + theme_bw() + ylab("Fecal pH") + scale_x_discrete(labels=c("-2.4"="2 weeks\npre-FMT", "1.1"="1 week\npost-FMT", "5.4"="5 weeks\npost-FMT", "Donor"="FMT Donor")) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed") + scale_y_continuous(labels=decimals) + theme(axis.title.x=element_blank())
Fig2c.wks
Create nested multipanel figure using cowplot. We’ll keep the one plasma panel on its own (Figure 2A) and the two stool figures together, since they’re from the exact same samples (Figures 2B and 2C). Right column (stool figures):
right.col <- plot_grid(Fig2b.wks, Fig2c.wks, labels=c('B', 'C'), ncol=1, align='h', axis='r')
right.col
Full multipanel figure:
tri.panel<- plot_grid(Fig2a.wks, right.col, labels=c("A",""), ncol=2, align='v', axis='r', rel_widths=c(1, 1.3))
tri.panel
Save:
ggsave("DLA.Fig2.tiff", width=8, height=8, units="in", dpi=300)
Session info:
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_0.5.0 readr_1.1.1 plyr_1.8.4 cowplot_0.9.2
## [5] ggplot2_2.2.1 phyloseq_1.22.3 dada2_1.6.0 Rcpp_0.12.14
##
## loaded via a namespace (and not attached):
## [1] ape_5.0 lattice_0.20-35
## [3] Rsamtools_1.30.0 Biostrings_2.46.0
## [5] rprojroot_1.3-2 digest_0.6.14
## [7] foreach_1.4.4 R6_2.2.2
## [9] GenomeInfoDb_1.14.0 backports_1.1.2
## [11] ShortRead_1.36.0 stats4_3.4.3
## [13] evaluate_0.10.1 pillar_1.1.0
## [15] zlibbioc_1.24.0 rlang_0.1.6
## [17] lazyeval_0.2.1 data.table_1.10.4-3
## [19] vegan_2.4-5 S4Vectors_0.16.0
## [21] Matrix_1.2-12 rmarkdown_1.8
## [23] labeling_0.3 splines_3.4.3
## [25] BiocParallel_1.12.0 stringr_1.2.0
## [27] igraph_1.1.2 RCurl_1.95-4.10
## [29] munsell_0.4.3 DelayedArray_0.4.1
## [31] compiler_3.4.3 pkgconfig_2.0.1
## [33] BiocGenerics_0.24.0 multtest_2.34.0
## [35] mgcv_1.8-22 htmltools_0.3.6
## [37] biomformat_1.6.0 SummarizedExperiment_1.8.1
## [39] tibble_1.4.1 GenomeInfoDbData_1.0.0
## [41] IRanges_2.12.0 codetools_0.2-15
## [43] matrixStats_0.52.2 permute_0.9-4
## [45] GenomicAlignments_1.14.1 MASS_7.3-47
## [47] bitops_1.0-6 grid_3.4.3
## [49] nlme_3.1-131 jsonlite_1.5
## [51] gtable_0.2.0 magrittr_1.5
## [53] RcppParallel_4.3.20 stringi_1.1.6
## [55] XVector_0.18.0 hwriter_1.3.2
## [57] reshape2_1.4.3 latticeExtra_0.6-28
## [59] RColorBrewer_1.1-2 iterators_1.0.9
## [61] tools_3.4.3 ade4_1.7-10
## [63] Biobase_2.38.0 hms_0.4.0
## [65] parallel_3.4.3 survival_2.41-3
## [67] colorspace_1.3-2 rhdf5_2.22.0
## [69] cluster_2.0.6 GenomicRanges_1.30.1
## [71] knitr_1.18