FMT for D-lactic acidosis

Emily Bulik-Sullivan

March 05 2018

University of North Carolina at Chapel Hill

Bulik-Sullivan EC, Roy S, Elliott RJ, Kassam Z, Lichtman SN, Carroll IM, Gulati AS. Intestinal microbial and metabolic alterations following successful fecal microbiota transplant for D-lactic acidosis. 2018.

This is the analysis and graphics creation pipeline used in our article “Intestinal microbial and metabolic alterations following successful fecal microbiota transplant for D-lactic acidosis.” Our pipeline is based directly on the one developed by Callahan et al. in “DADA2: High resolution sample inference from Illumina amplicon data” (2016; PMID: 27214047).

We generated Illumina MiSeq 250 bp paired-end, single-indexed reads of the 16S rRNA gene V4 region (available from the National Center for Biotechnology Information’s Sequence Read Archive (SRA) under accession number SRP129013) from four fecal samples, three from the patient being treated and one from the OpenBiome FMT preparation. Before entering the pipeline described here, primer content was removed using cutadapt version 1.10 (Marcel Martin, DOI: http://dx.doi.org/10.14806/ej.17.1.200). These demultiplexed, primer-free files were cleaned of contaminating human sequences during the process of being uploaded to the SRA. We used the FASTQ files now downloadable from the SRA to generate Figures 1A and B in our manuscript. Figures 2A, B, and C were created with stool pH and stool/plasma lactic acid metabolite data, available as tables throughout this document.

Primer trimming with cutadapt (v. 1.10) – performed before uploading to SRA

Command to trim primers:

cutadapt -g GAGTGCCAGCMGCCGCGGTAA -G ACGGACTACDBGGGTWTCTAAT \
-o /F_output_fp/Forward_trimmed.fastq.gz \
-p /R_output_fp/Reverse_trimmed.fastq.gz \
/F_input_fp/Forward.fastq.gz \
/R_input_fp/Reverse.fastq.gz

Details of demultiplexed files:

Identity	Barcode	Original read count	Number of reads removed by SRA
DLA_17pre	ATTCTCTCACGT	7143	1
DLA_8post	GTCGAATTTGCG	42600	17
DLA_38post	GTGGTCATCGTA	26442	14
DLA_OBdonor	CTGAAGGGCGAA	111344	1

Load packages

library(dada2); packageVersion("dada2")

## [1] '1.6.0'

library(phyloseq); packageVersion("phyloseq")

## [1] '1.22.3'

library(ggplot2); packageVersion("ggplot2")

## [1] '2.2.1'

library(cowplot); packageVersion("cowplot")

## [1] '0.9.2'

library(plyr); packageVersion("plyr")

## [1] '1.8.4'

library(readr); packageVersion("readr")

## [1] '1.1.1'

library(scales); packageVersion("scales")

## [1] '0.5.0'

Preliminary file manipulation

Set working directory where fastq files are located:

Path <- "/Users/EBS/Desktop/advance" # Alter filepath to wherever you've placed the files from the SRA

List files in that directory:

list.files(Path)

## [1] "DLA_17pre_F_cleaned.fastq"   "DLA_17pre_R_cleaned.fastq"  
## [3] "DLA_38post_F_cleaned.fastq"  "DLA_38post_R_cleaned.fastq" 
## [5] "DLA_8post_F_cleaned.fastq"   "DLA_8post_R_cleaned.fastq"  
## [7] "DLA_OBdonor_F_cleaned.fastq" "DLA_OBdonor_R_cleaned.fastq"

There should be eight files, four forward and four reverse. Sort forward (F) and reverse (R) files so they’re in the same order:

Fseqs <- sort(list.files(Path, pattern="_F_cleaned.fastq"))
Rseqs <- sort(list.files(Path, pattern="_R_cleaned.fastq"))
Fseqs

## [1] "DLA_17pre_F_cleaned.fastq"   "DLA_38post_F_cleaned.fastq" 
## [3] "DLA_8post_F_cleaned.fastq"   "DLA_OBdonor_F_cleaned.fastq"

Rseqs

## [1] "DLA_17pre_R_cleaned.fastq"   "DLA_38post_R_cleaned.fastq" 
## [3] "DLA_8post_R_cleaned.fastq"   "DLA_OBdonor_R_cleaned.fastq"

Extract sample names; filenames have format SAMPLENAME_F_cleaned.fastq (only need to do this step on Fs because names are the same for Rs):

samp.names <- sapply(strsplit(Fseqs, "_F_cleaned"), `[`, 1)
samp.names

## [1] "DLA_17pre"   "DLA_38post"  "DLA_8post"   "DLA_OBdonor"

Specify full path to Fseqs and Rseqs:

Fseqs <- file.path(Path, Fseqs)
Rseqs <- file.path(Path, Rseqs)
Fseqs

## [1] "/Users/EBS/Desktop/advance/DLA_17pre_F_cleaned.fastq"  
## [2] "/Users/EBS/Desktop/advance/DLA_38post_F_cleaned.fastq" 
## [3] "/Users/EBS/Desktop/advance/DLA_8post_F_cleaned.fastq"  
## [4] "/Users/EBS/Desktop/advance/DLA_OBdonor_F_cleaned.fastq"

Rseqs

## [1] "/Users/EBS/Desktop/advance/DLA_17pre_R_cleaned.fastq"  
## [2] "/Users/EBS/Desktop/advance/DLA_38post_R_cleaned.fastq" 
## [3] "/Users/EBS/Desktop/advance/DLA_8post_R_cleaned.fastq"  
## [4] "/Users/EBS/Desktop/advance/DLA_OBdonor_R_cleaned.fastq"

Initial data quality assessment

Visualize forward read data quality:

plotQualityProfile(Fseqs[1:4])

We’ll trim forward reads at cycle 220 (so remove last 30 nucleotides). Same for reverse reads:

plotQualityProfile(Rseqs[1:4])

The quality for the reverse reads is much poorer, as expected, so these we’ll trim at cycle 160.

Filter, trim, and relocate data

Filtered <- file.path(Path, "filtered")

Create new names for filtered files (paste “_F_filt.fastq.gz" on the ends – replace F with R for reverse files):

FilteredFseqs <- file.path(Filtered, paste0(samp.names, "_F_filt.fastq.gz"))
FilteredRseqs <- file.path(Filtered, paste0(samp.names, "_R_filt.fastq.gz"))
FilteredFseqs

## [1] "/Users/EBS/Desktop/advance/filtered/DLA_17pre_F_filt.fastq.gz"  
## [2] "/Users/EBS/Desktop/advance/filtered/DLA_38post_F_filt.fastq.gz" 
## [3] "/Users/EBS/Desktop/advance/filtered/DLA_8post_F_filt.fastq.gz"  
## [4] "/Users/EBS/Desktop/advance/filtered/DLA_OBdonor_F_filt.fastq.gz"

FilteredRseqs

## [1] "/Users/EBS/Desktop/advance/filtered/DLA_17pre_R_filt.fastq.gz"  
## [2] "/Users/EBS/Desktop/advance/filtered/DLA_38post_R_filt.fastq.gz" 
## [3] "/Users/EBS/Desktop/advance/filtered/DLA_8post_R_filt.fastq.gz"  
## [4] "/Users/EBS/Desktop/advance/filtered/DLA_OBdonor_R_filt.fastq.gz"

Filtering parameters: maxN=0; truncQ=2; rm.phix=TRUE; maxEE=2. This run contained 15% phiX, which at this point has not yet been removed from the sequences:

FiltOutput <- filterAndTrim(Fseqs, FilteredFseqs, Rseqs, FilteredRseqs, truncLen=c(220,160), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

## Creating output directory: /Users/EBS/Desktop/advance/filtered

head(FiltOutput)

##                             reads.in reads.out
## DLA_17pre_F_cleaned.fastq       7142      2084
## DLA_38post_F_cleaned.fastq     26428      7541
## DLA_8post_F_cleaned.fastq      42583     12270
## DLA_OBdonor_F_cleaned.fastq   111343     36497

Visualize quality after trimming:

plotQualityProfile(FilteredFseqs[1:4])

plotQualityProfile(FilteredRseqs[1:4])

DADA2 pipeline

Learn error rates:

errFseqs <- learnErrors(FilteredFseqs, multithread=TRUE)

## Initializing error rates to maximum possible estimate.
## Sample 1 - 2084 reads in 580 unique sequences.
## Sample 2 - 7541 reads in 1787 unique sequences.
## Sample 3 - 12270 reads in 2575 unique sequences.
## Sample 4 - 36497 reads in 8865 unique sequences.
##    selfConsist step 2 
##    selfConsist step 3 
##    selfConsist step 4 
## Convergence after  4  rounds.
## Total reads used:  58392

errRseqs <- learnErrors(FilteredRseqs, multithread=TRUE)

## Initializing error rates to maximum possible estimate.
## Sample 1 - 2084 reads in 441 unique sequences.
## Sample 2 - 7541 reads in 1194 unique sequences.
## Sample 3 - 12270 reads in 1869 unique sequences.
## Sample 4 - 36497 reads in 7724 unique sequences.
##    selfConsist step 2 
##    selfConsist step 3 
##    selfConsist step 4 
## Convergence after  4  rounds.
## Total reads used:  58392

Visualize estimated error rates:

plotErrors(errFseqs, nominalQ=TRUE)

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis

plotErrors(errRseqs, nominalQ=TRUE)

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Transformation introduced infinite values in continuous y-axis

Dereplicate:

derepFseqs <- derepFastq(FilteredFseqs, verbose=TRUE)

## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_17pre_F_filt.fastq.gz

## Encountered 580 unique sequences from 2084 total sequences read.

## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_38post_F_filt.fastq.gz

## Encountered 1787 unique sequences from 7541 total sequences read.

## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_8post_F_filt.fastq.gz

## Encountered 2575 unique sequences from 12270 total sequences read.

## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_OBdonor_F_filt.fastq.gz

## Encountered 8865 unique sequences from 36497 total sequences read.

derepRseqs <- derepFastq(FilteredRseqs, verbose=TRUE)

## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_17pre_R_filt.fastq.gz

## Encountered 441 unique sequences from 2084 total sequences read.

## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_38post_R_filt.fastq.gz

## Encountered 1194 unique sequences from 7541 total sequences read.

## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_8post_R_filt.fastq.gz

## Encountered 1869 unique sequences from 12270 total sequences read.

## Dereplicating sequence entries in Fastq file: /Users/EBS/Desktop/advance/filtered/DLA_OBdonor_R_filt.fastq.gz

## Encountered 7724 unique sequences from 36497 total sequences read.

Rename the dereplicated output with sample names:

names(derepFseqs) <- samp.names
names(derepRseqs) <- samp.names

Sample inference:

dadaFseqs <- dada(derepFseqs, err=errFseqs, multithread=TRUE)

## Sample 1 - 2084 reads in 580 unique sequences.
## Sample 2 - 7541 reads in 1787 unique sequences.
## Sample 3 - 12270 reads in 2575 unique sequences.
## Sample 4 - 36497 reads in 8865 unique sequences.

dadaRseqs <- dada(derepRseqs, err=errRseqs, multithread=TRUE)

## Sample 1 - 2084 reads in 441 unique sequences.
## Sample 2 - 7541 reads in 1194 unique sequences.
## Sample 3 - 12270 reads in 1869 unique sequences.
## Sample 4 - 36497 reads in 7724 unique sequences.

Merge pairs:

merged <- mergePairs(dadaFseqs, derepFseqs, dadaRseqs, derepRseqs, verbose=TRUE)

## 2027 paired-reads (in 11 unique pairings) successfully merged out of 2084 (in 23 pairings) input.

## 7348 paired-reads (in 38 unique pairings) successfully merged out of 7541 (in 91 pairings) input.

## 11987 paired-reads (in 47 unique pairings) successfully merged out of 12270 (in 144 pairings) input.

## 34113 paired-reads (in 179 unique pairings) successfully merged out of 36497 (in 1165 pairings) input.

Create sequence table:

sequence.table <- makeSequenceTable(merged)

## The sequences being tabled vary in length.

dim(sequence.table)

## [1]   4 233

table(nchar(getSequences(sequence.table)))

## 
## 221 252 253 254 277 
##   1  29 201   1   1

Remove chimeras:

sequence.table.nochim <- removeBimeraDenovo(sequence.table, method="consensus", multithread=TRUE, verbose=TRUE)

## Identified 85 bimeras out of 233 input sequences.

dim(sequence.table.nochim)

## [1]   4 148

Check reasonableness of loss in the chimera removal process:

sum(sequence.table.nochim)/sum(sequence.table)

## [1] 0.9422983

As a sanity check, track reads counts through pipeline:

getN2 <- function(x) sum(getUniques(x))
track.seqs <- cbind(FiltOutput, sapply(dadaFseqs, getN2), sapply(merged, getN2), rowSums(sequence.table), rowSums(sequence.table.nochim))
colnames(track.seqs) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
rownames(track.seqs) <- samp.names
track.seqs

##              input filtered denoised merged tabled nonchim
## DLA_17pre     7142     2084     2084   2027   2027    2018
## DLA_38post   26428     7541     7541   7348   7348    6631
## DLA_8post    42583    12270    12270  11987  11987   11073
## DLA_OBdonor 111343    36497    36497  34113  34113   32552

Assign taxonomy using Silva databse v. 128 (training data available here: https://zenodo.org/record/824551#.WWE6MdPyvzU)

dlac.taxa <- assignTaxonomy(sequence.table.nochim, "~/Desktop/DLAC/silva_nr_v128_train_set.fa.gz", multithread=TRUE)
unname(head(dlac.taxa))

##      [,1]       [,2]             [,3]                  [,4]               
## [1,] "Bacteria" "Firmicutes"     "Bacilli"             "Lactobacillales"  
## [2,] "Bacteria" "Firmicutes"     "Bacilli"             "Lactobacillales"  
## [3,] "Bacteria" "Firmicutes"     "Bacilli"             "Lactobacillales"  
## [4,] "Bacteria" "Actinobacteria" "Actinobacteria"      "Bifidobacteriales"
## [5,] "Bacteria" "Firmicutes"     "Clostridia"          "Clostridiales"    
## [6,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales"
##      [,5]                 [,6]                  
## [1,] "Lactobacillaceae"   "Lactobacillus"       
## [2,] "Lactobacillaceae"   "Lactobacillus"       
## [3,] "Lactobacillaceae"   "Lactobacillus"       
## [4,] "Bifidobacteriaceae" "Bifidobacterium"     
## [5,] "Lachnospiraceae"    "Blautia"             
## [6,] "Enterobacteriaceae" "Escherichia/Shigella"

Add species assignments:

dlac.taxa.species <- addSpecies(dlac.taxa, "~/Desktop/DLAC/silva_species_assignment_v128.fa.gz", verbose=TRUE)

## 49 out of 148 were assigned to the species level.
## Of which 42 had genera consistent with the input table.

Import metadata:

dla_metadata <- read.table("~/Desktop/DLAC/dla_metadata.txt", header=TRUE, sep="\t", row.names=1)

Sample	Barcode	BarcodeSequence	SRA
DLA_17pre	200	ATTCTCTCACGT	SRR6467880
DLA_8post	192	GTCGAATTTGCG	SRR6467881
DLA_38post	194	GTGGTCATCGTA	SRR6467882
DLA_OBdonor	195	CTGAAGGGCGAA	SRR6467883

Phyloseq

Create phyloseq object:

dlac.ps <- phyloseq(otu_table(sequence.table.nochim, taxa_are_rows=FALSE), sample_data(dla_metadata), tax_table(dlac.taxa.species))
dlac.ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 148 taxa and 4 samples ]
## sample_data() Sample Data:       [ 4 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 148 taxa by 7 taxonomic ranks ]

Figure 1: Diversity and taxonomy

Plot alpha diversity with phyloseq

Function to get decimal points in y-axis tick marks:

decimals <- function(x) sprintf("%.2f", x)

Rearrange order of samples to be chronological:

positions <- c("DLA_17pre", "DLA_8post", "DLA_38post", "DLA_OBdonor")

Plot Shannon diversity. This will throw an error due to the absence of singletons, but that isn’t actually too much of an issue when using Shannon diversity, in which low-abundance taxa impact the output less than more-abundant taxa. See here and Haegeman et al., 2013 (“Robust estimation of microbial diversity in theory and in practice”, The ISME Journal) for further discussion of this topic.

Fig1B.shannon <- plot_richness(dlac.ps, x="Identity", measures=c("Shannon")) + theme_bw() + theme(axis.title.x=element_blank(), plot.title=element_text(hjust=-0.07)) + ylab("Shannon index") + scale_x_discrete(labels=c("DLA_17pre"="2 weeks\npre-FMT", "DLA_8post"="1 week\npost-FMT", "DLA_38post"="5 weeks\npost-FMT", "DLA_OBdonor"="FMT\nDonor"), limits=positions) + scale_y_continuous(limits=c(0, 3.85), labels=decimals) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed")

## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

Fig1B.shannon

This is our figure 1B.

Plot taxonomy bar chart – top 30 genera

dlac.genera.top30 <- names(sort(taxa_sums(dlac.ps), decreasing=TRUE))[1:30]
dlac.ps.top30 <- transform_sample_counts(dlac.ps, function(OTU) OTU/sum(OTU))
dlac.ps.top30 <- prune_taxa(dlac.genera.top30, dlac.ps.top30)

Create taxonomy bar chart using these top 30 genera (from DADA2 sequence variants):

dlac.genera <- plot_bar(dlac.ps.top30, x="Identity", fill="Genus") + theme_bw() + theme(plot.title=element_text(hjust=-0.12)) + scale_fill_manual(values=c("#FFC0CB","#FF1493","#C71585","#FFA07A","#FF4500","#FFA500","#FFD700","#FFFF00","#F0E68C","#9370DB","#9932CC","#E6E6FA","#ADFF2F","#9ACD32","#228B22","#008080","#40E0D0","#00BFFF","#483D8B","#0000CD","#191970","#C0C0C0","#696969","#2C3E50")) + scale_x_discrete(name="Sample", limits=positions) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed")
dlac.genera

This graph is suboptimal. While it gives a decent overview of the genus-level taxonomic breakdown in these samples, we’re losing species-level information that we do have thanks to DADA2, as well as information about any sequence variants that could only be resolved to the family level (those are deposited into the white “NA” bin). We performed the steps below to get a non-default graph that is more clear and informative. The problem is that DADA2 and Silva give us a moderate amount of species-level taxonomic resolution for the sequence variants in this dataset, but they don’t achieve species-level resolution for all sequence variants (i.e. can only assign taxonomy to genus or family level). Attempting to graph only species-level taxonomy therefore creates a plot like the one above that has a handful of sequence variants identified to the species-level, and another handful that are depicted as “NA” (those that could only be resolved to genus or family level). We want to incorporate the species-level taxonomic classification that DADA2 affords us for some of these sequence variants, but also display as much taxonomic information as we can about other sequence variants that could not be identified all the way to the species level.

Melt top 30 ps object:

dla30 <- psmelt(dlac.ps.top30)
dim(dla30)

## [1] 120  14

We will add two new columns, which will eventually become the values we use for labeling our graph. This will be either:

Family name followed by an “(f)” for sequence variants that could only be resolved to family level (Family (f)),
Genus name followed by a “(g)” for sequence variants that could only be resolved to genus level (Genus (g)), or
Full binomial nomenclature for sequence variants that were resolved all the way to species level (i.e. Genus species).

All sequence variants in this top 30 dataset were resolved at least to family level, so we don’t have to perform this same task for any higher taxonomic ranks.

Add new columns:

dla30$Name1 <- NA
dla30$Name2 <- NA
dla30$Family<-as.character(dla30$Family)
dla30$Genus<-as.character(dla30$Genus)
dla30$Species<-as.character(dla30$Species)
str(dla30)

## 'data.frame':    120 obs. of  16 variables:
##  $ OTU            : chr  "TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTCAATAAGTCTGATGTGAAAGCCTTCGGCTCAACCGGAGAATTGCATCAGAAACT"| __truncated__ "TACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGAAAAATAAGTCTGATGTGAAAGCCCTCGGCTTAACCGAGGAACTGCATCGGAAACT"| __truncated__ "TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTGCTTAGGTCTGATGTGAAAGCCTTCGGCTTAACCGAAGAAGTGCATCGGAAACC"| __truncated__ "TACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTGCTTAGGTCTGATGTGAAAGCCTTCGGCTTAACCGAAGAAGTGCATCGGAAACC"| __truncated__ ...
##  $ Sample         : chr  "DLA_38post" "DLA_38post" "DLA_38post" "DLA_8post" ...
##  $ Abundance      : num  0.276 0.245 0.212 0.206 0.203 ...
##  $ Identity       : Factor w/ 4 levels "DLA_17pre","DLA_38post",..: 2 2 2 3 3 3 1 1 3 1 ...
##  $ Barcode        : int  194 194 194 192 192 192 200 200 192 200 ...
##  $ BarcodeSequence: Factor w/ 4 levels "ATTCTCTCACGT",..: 4 4 4 3 3 3 1 1 3 1 ...
##  $ SRA            : Factor w/ 4 levels "SRR6467880","SRR6467881",..: 3 3 3 2 2 2 1 1 2 1 ...
##  $ Kingdom        : Factor w/ 1 level "Bacteria": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Phylum         : Factor w/ 4 levels "Actinobacteria",..: 3 3 3 3 4 3 3 3 3 3 ...
##  $ Class          : Factor w/ 6 levels "Actinobacteria",..: 2 2 2 2 5 2 2 2 2 2 ...
##  $ Order          : Factor w/ 6 levels "Bacteroidales",..: 5 5 5 5 4 5 5 5 5 5 ...
##  $ Family         : chr  "Lactobacillaceae" "Lactobacillaceae" "Lactobacillaceae" "Lactobacillaceae" ...
##  $ Genus          : chr  "Lactobacillus" "Lactobacillus" "Lactobacillus" "Lactobacillus" ...
##  $ Species        : chr  NA "kitasatonis" NA NA ...
##  $ Name1          : logi  NA NA NA NA NA NA ...
##  $ Name2          : logi  NA NA NA NA NA NA ...

If Genus = NA and Species = NA, then name = “Family (f)”:

dla30$Name1 <- ifelse(is.na(dla30$Genus) & is.na(dla30$Species), dla30$Family, dla30$Name1)
dla30$Name2 <- ifelse(is.na(dla30$Genus) & is.na(dla30$Species), "(f)", dla30$Name2)

If Genus exists and Species = NA, then name = “Genus (g)”:

dla30$Name1 <- ifelse(!is.na(dla30$Genus) & is.na(dla30$Species), dla30$Genus, dla30$Name1)
dla30$Name2 <- ifelse(!is.na(dla30$Genus) & is.na(dla30$Species), "(g)", dla30$Name2)

If Genus and Species exist, then name = “Genus species”:

dla30$Name1 <- ifelse(!is.na(dla30$Genus) & !is.na(dla30$Species), dla30$Genus, dla30$Name1)
dla30$Name2 <- ifelse(!is.na(dla30$Genus) & !is.na(dla30$Species), dla30$Species, dla30$Name2)

Combine Name1 and Name2 columns:

dla30$Combined <- paste(dla30$Name1, dla30$Name2, sep=' ')

count(dla30, "Combined")

##                        Combined freq
## 1          Alistipes putredinis    4
## 2           Anaerostipes hadrus    4
## 3               Bacteroides (g)    4
## 4            Bacteroides caccae    4
## 5         Bacteroides stercoris    4
## 6          Bacteroides vulgatus    4
## 7           Bifidobacterium (g)    4
## 8                   Blautia (g)    8
## 9             Blautia stercoris    4
## 10            Dialister invisus    4
## 11        Dorea formicigenerans    4
## 12            Dorea longicatena    4
## 13     Escherichia/Shigella (g)    4
## 14         Faecalibacterium (g)    4
## 15 Faecalibacterium prausnitzii    8
## 16          Lachnospiraceae (f)   16
## 17            Lactobacillus (g)    8
## 18    Lactobacillus kitasatonis    4
## 19                Roseburia (g)    4
## 20      Ruminiclostridium_5 (g)    4
## 21          Ruminococcaceae (f)    4
## 22            Streptococcus (g)    4
## 23          Subdoligranulum (g)    4
## 24              Veillonella (g)    4

The fact that there are only 24 distinct names here even though we’re working with 30 sequence variants is acceptable. This happened because some sequence variants could only be resolved to family or genus level, and more than one sequence variant got the same not-to-species-level taxonomic assignment. For example, there are four distinct sequence variants in these top 30 that could only be resolved to “Lachnospiraceae (f)”, so now 16 things (4 sequence variants x 4 samples) have the name “Lachnospiraceae (f)”. We will lose some information about the number of sequence variants that certain taxonomic ranks represent when we graph this, but we decided this was the clearest way to display our results.

Graph Figure 1A:

Fig1A.taxonomy <- ggplot(dla30, aes(Identity, Abundance)) + geom_bar(stat="identity", width=0.6, aes(fill=Combined)) + theme_bw() + theme(legend.text=element_text(face="italic"), axis.title.x=element_blank()) + scale_x_discrete(labels=c("DLA_17pre"="2 weeks\npre-FMT", "DLA_8post"="1 week\npost-FMT", "DLA_38post"="5 weeks\npost-FMT", "DLA_OBdonor"="FMT\nDonor"), limits=positions) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed") + ylab("Relative Abundance") + scale_fill_manual(values=c("#FFC0CB","#FF1493","#C71585","#FFA07A","#FF4500","#FFA500","#FFD700","#FFFF00","#F0E68C","#9370DB","#9932CC","#E6E6FA","#ADFF2F","#9ACD32","#228B22","#008080","#40E0D0","#00BFFF","#483D8B","#0000CD","#191970","#C0C0C0","#696969","#2C3E50"), name="Taxon") + guides(fill=guide_legend(ncol=1))
Fig1A.taxonomy

This is what we wanted. As noted above, we lost some detail since all sequence variants that could only be resolved to a family or genus level are now merged together (i.e. represented by one color), but we captured as much of the taxonomic assignments as we can. The bars don’t reach all the way to 1.00 because this only represents the top 30 most abundant taxa.

Combine the Shannon diversity plot and the taxonomy plot into a two-panel figure with cowplot and save:

two.panel<- plot_grid(Fig1A.taxonomy, Fig1B.shannon, labels=c("A","B"), ncol=1, align='v', axis='r', rel_heights=c(2.75,1))
two.panel

ggsave("DLA.Fig1.tiff", width=8, height=9, units="in", dpi=300)

Figure 2: Lactic acid concentrations and pH

Plasma D-lactate measurements (Figure 2A):

plasma_metab_wks <- read_delim("~/Desktop/DLAC/plasma_metab_wks.txt", "\t", escape_double=FALSE, trim_ws=TRUE)

Episode	rel_wks	plasma_lactate
A	-23.6	4.06
B	-16.4	8.6
C	2.4	1.13

Fig2a.wks <- ggplot(plasma_metab_wks, aes(Episode, plasma_lactate)) + geom_col(fill="#696969", width=0.6) + theme_bw() + ylab("Plasma D-lactate concentration (mmol/L)") + geom_vline(xintercept=2.5, col='red', lwd=0.5, linetype="dashed") + scale_x_discrete(labels=c("A"="Second D-LA\nhospitalization", "B"="Fifth D-LA\nhospitalization", "C"="2 weeks\npost-FMT")) + theme(axis.title.x=element_blank())
Fig2a.wks

Fecal D- and L-lactic acid concentration (Figure 2B):

stool_metab_wks <- read_delim("~/Desktop/DLAC/stool_metab_wks.txt", "\t", escape_double=FALSE)

Weeks	stool_lactate	isomer
-2.4	0.3938	D
1.1	1.2524	D
5.4	1.3675	D
Donor	0.0788184	D
-2.4	0.1493	L
1.1	0.1018	L
5.4	0.0902	L
Donor	0.0562368	L

col <- c("#696969", "#F5F5F5")
Fig2b.wks <- ggplot(stool_metab_wks, aes(x=Weeks, y=stool_lactate, fill=isomer)) + geom_bar(stat="identity", position="dodge", width=0.6, colour="#696969") + theme_bw() + theme(legend.justification=c(1.05, 1.05), legend.position=c(1,1)) + scale_fill_manual(values=col) + ylab("Fecal D- and L-lactic acid (g/L)") + guides(fill=guide_legend(title="Lactic acid\nenantiomer")) + scale_x_discrete(labels=c("-2.4"="2 weeks\npre-FMT", "1.1"="1 week\npost-FMT", "5.4"="5 weeks\npost-FMT", "Donor"="FMT Donor")) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed") + theme(axis.title.x=element_blank())
Fig2b.wks

Fecal pH (Figure 2C):

stool_pH_wks <- read_delim("~/Desktop/DLAC/stool_pH_wks.txt", "\t", escape_double=FALSE, trim_ws=TRUE)

wks	stool_pH
-2.4	6.125
1.1	5.75
5.4	4.5
Donor	7.5

Function to get decimal points in y-axis tick marks:

decimals <- function(x) sprintf("%.1f", x)

Graph:

Fig2c.wks <- ggplot(stool_pH_wks, aes(wks, stool_pH)) + geom_bar(stat="identity", fill="#696969", width=0.6) + theme_bw() + ylab("Fecal pH") + scale_x_discrete(labels=c("-2.4"="2 weeks\npre-FMT", "1.1"="1 week\npost-FMT", "5.4"="5 weeks\npost-FMT", "Donor"="FMT Donor")) + geom_vline(xintercept=1.5, col='red', lwd=0.5, linetype="dashed") + scale_y_continuous(labels=decimals) + theme(axis.title.x=element_blank())
Fig2c.wks

Create nested multipanel figure using cowplot. We’ll keep the one plasma panel on its own (Figure 2A) and the two stool figures together, since they’re from the exact same samples (Figures 2B and 2C). Right column (stool figures):

right.col <- plot_grid(Fig2b.wks, Fig2c.wks, labels=c('B', 'C'), ncol=1, align='h', axis='r')
right.col

Full multipanel figure:

tri.panel<- plot_grid(Fig2a.wks, right.col, labels=c("A",""), ncol=2, align='v', axis='r', rel_widths=c(1, 1.3))
tri.panel

Save:

ggsave("DLA.Fig2.tiff", width=8, height=8, units="in", dpi=300)

References:

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13(7):581-3. doi: 10.1038/nmeth.3869
Haegeman B, Hamelin J, Moriarty J, Neal P, Dushoff J, Weitz JS. Robust estimation of microbial diversity in theory and practice. The ISME J. 2013:7(6):1092-101. doi: 10.1038/ismej.2013.10
Martin, Marcel. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17(1):10-12. doi:http://dx.doi.org/10.14806/ej.17.1.200.
McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8(4);e61217. doi: 10.1371/journal.pone.0061217

Session info:

sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scales_0.5.0    readr_1.1.1     plyr_1.8.4      cowplot_0.9.2  
## [5] ggplot2_2.2.1   phyloseq_1.22.3 dada2_1.6.0     Rcpp_0.12.14   
## 
## loaded via a namespace (and not attached):
##  [1] ape_5.0                    lattice_0.20-35           
##  [3] Rsamtools_1.30.0           Biostrings_2.46.0         
##  [5] rprojroot_1.3-2            digest_0.6.14             
##  [7] foreach_1.4.4              R6_2.2.2                  
##  [9] GenomeInfoDb_1.14.0        backports_1.1.2           
## [11] ShortRead_1.36.0           stats4_3.4.3              
## [13] evaluate_0.10.1            pillar_1.1.0              
## [15] zlibbioc_1.24.0            rlang_0.1.6               
## [17] lazyeval_0.2.1             data.table_1.10.4-3       
## [19] vegan_2.4-5                S4Vectors_0.16.0          
## [21] Matrix_1.2-12              rmarkdown_1.8             
## [23] labeling_0.3               splines_3.4.3             
## [25] BiocParallel_1.12.0        stringr_1.2.0             
## [27] igraph_1.1.2               RCurl_1.95-4.10           
## [29] munsell_0.4.3              DelayedArray_0.4.1        
## [31] compiler_3.4.3             pkgconfig_2.0.1           
## [33] BiocGenerics_0.24.0        multtest_2.34.0           
## [35] mgcv_1.8-22                htmltools_0.3.6           
## [37] biomformat_1.6.0           SummarizedExperiment_1.8.1
## [39] tibble_1.4.1               GenomeInfoDbData_1.0.0    
## [41] IRanges_2.12.0             codetools_0.2-15          
## [43] matrixStats_0.52.2         permute_0.9-4             
## [45] GenomicAlignments_1.14.1   MASS_7.3-47               
## [47] bitops_1.0-6               grid_3.4.3                
## [49] nlme_3.1-131               jsonlite_1.5              
## [51] gtable_0.2.0               magrittr_1.5              
## [53] RcppParallel_4.3.20        stringi_1.1.6             
## [55] XVector_0.18.0             hwriter_1.3.2             
## [57] reshape2_1.4.3             latticeExtra_0.6-28       
## [59] RColorBrewer_1.1-2         iterators_1.0.9           
## [61] tools_3.4.3                ade4_1.7-10               
## [63] Biobase_2.38.0             hms_0.4.0                 
## [65] parallel_3.4.3             survival_2.41-3           
## [67] colorspace_1.3-2           rhdf5_2.22.0              
## [69] cluster_2.0.6              GenomicRanges_1.30.1      
## [71] knitr_1.18