(Refer back to the RNA-seq data analysis lesson).

## Recount

Take a look at the preprint: Collado-Torres, Leonardo, et al. “recount: A large-scale resource of analysis-ready RNA-seq expression data.” bioRxiv (2016): 068478. It’s available at http://biorxiv.org/content/early/2016/08/08/068478.

Here’s the abstract:

recount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bioconductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at https://jhubiostatistics.shinyapps.io/recount/.

This preprint describes an incredible resource. The authors here have downloaded and pre-processed several thousand RNA-seq studies from the raw data available in the SRA. They make this data available through a web application1 at where you can browse, search, and sort the studies they’ve processed, and it gives you direct links to download analysis-ready pre-processed count data and metadata. As the abstract mentions, you can access recount at jhubiostatistics.shinyapps.io/recount/.

## Get some data

In the accession search box, search for SRP026387. This has some pre-processed data from a study looking at the Wnt/Beta-catenin pathway and how that’s affected by androgen deprivation therapy in treating prostate cancer. You can learn more about the data and the study by clicking the accession number hyperlink.

Androgen ablation therapy (AAT) is standard treatment for locally-advanced/metastatic prostate cancer (PCa). Many patients develop castration-resistance (CRPCa) after ~2-3 years, with a poor prognosis. The molecular mechanisms underlying CRPCa progression are unclear. mRNA-Seq was performed on tumours from 7 patients with locally-advanced/metastatic PCa before and ~22 weeks after AAT initiation. Differentially regulated genes were identified in treatment pairs. Overall design: Tumour biopsies from 7 patients were taken before and after AAT treatment.

You could get the gene-level counts and metadata by clicking the appropriate links. I downloaded the data and and cleaned up the metadata and gene identifiers a bit, and made that data available on the Data page. The files are called SRP026387_scaledcounts.csv and SRP026387_metadata.csv.

Go to the Data page, download these data files, and load them into R. As we did in the original lesson) under the importing data heading, you can load the readr and dplyr, and libraries and load them with read_csv(), but before you can import them into a DESeqDataSet you’ll have to convert them back to a regular data frame. You could alternatively use base R’s read.csv() function and not have to worry about it.

# Load the count data and metadata. Here, the "data" folder
# is relative to my current working directory.
metadata <- read.csv("data/SRP026387_metadata.csv")

Take a look at the metadata:

metadata
##           id replicate prepost
## 1  SRR923920        R1     Pre
## 2  SRR923921        R2     Pre
## 3  SRR923923        R4     Pre
## 4  SRR923924        R5     Pre
## 5  SRR923925        R6     Pre
## 6  SRR923927        R1    Post
## 7  SRR923926        R7     Pre
## 8  SRR923929        R3    Post
## 9  SRR923928        R2    Post
## 10 SRR923930        R4    Post
## 11 SRR923931        R5    Post
## 12 SRR923933        R7    Post
## 13 SRR923932        R6    Post

And the first few rows of the count data:

head(mycounts)
##           ensgene SRR923920 SRR923921 SRR923923 SRR923924 SRR923925
## 1 ENSG00000000003       159      1508      2062      1369      1625
## 2 ENSG00000000005         0         0        17         7         0
## 3 ENSG00000000419       416       312       505       404       609
## 4 ENSG00000000457       529       507       718       851       687
## 5 ENSG00000000460       214       197       330       279       249
## 6 ENSG00000000938       170       278       100       142       174
##   SRR923927 SRR923926 SRR923929 SRR923928 SRR923930 SRR923931 SRR923933
## 1       401      2024      1075       719      1104      1441      1125
## 2         0         0         7         8         0         6         6
## 3       376       754       387       324       454       652       640
## 4       722       599       705       689       855      1016       883
## 5       213       217       281       220       184       379       383
## 6       537       214       206       255       110       253       147
##   SRR923932
## 1      1376
## 2         2
## 3       620
## 4       671
## 5       269
## 6       185

Finally, load the annotation data (best to do this with dplyr’s read_csv() because it’s pretty big).

library(readr)
library(dplyr)
anno
## # A tibble: 66,531 x 9
##            ensgene entrez   symbol   chr     start       end strand
##              <chr>  <int>    <chr> <chr>     <int>     <int>  <int>
##  1 ENSG00000000003   7105   TSPAN6     X 100627109 100639991     -1
##  2 ENSG00000000005  64102     TNMD     X 100584802 100599885      1
##  3 ENSG00000000419   8813     DPM1    20  50934867  50958555     -1
##  4 ENSG00000000457  57147    SCYL3     1 169849631 169894267     -1
##  5 ENSG00000000460  55732 C1orf112     1 169662007 169854080      1
##  6 ENSG00000000938   2268      FGR     1  27612064  27635277     -1
##  7 ENSG00000000971   3075      CFH     1 196651878 196747504      1
##  8 ENSG00000001036   2519    FUCA2     6 143494811 143511690     -1
##  9 ENSG00000001084   2729     GCLC     6  53497341  53616970     -1
## 10 ENSG00000001167   4800     NFYA     6  41072945  41099976      1
## # ... with 66,521 more rows, and 2 more variables: biotype <chr>,
## #   description <chr>

## What genes are DE?

1. After you’ve loaded the data, load the DESeq2 library and import your data into a DESeqDataSet using the tidy=TRUE argument, with the design formula being the prepost variable from the metadata.
2. Run the DESeq pipeline.
3. Get the results for the pre vs post analysis.2
4. Arrange these results by p-value, and join the results to the annotation data (hint: %>% inner_join(anno, by=c("row"="ensgene"))).
5. Print out the top 10 differentially expressed genes, their fold changes, p-values, adjusted p-values, annotated with the gene symbol and the description from the annotation data.3

Here’s what your results should look like. I’m showing actual values for the first few genes so you can check to see if your results are close. Don’t worry about rounding errors.

## # A tibble: 10 x 6
##                row log2FoldChange   pvalue     padj     symbol description
##              <chr>          <chr>    <chr>    <chr>      <chr>       <chr>
##  1 ENSG00000151503           3.46 3.78e-37 1.03e-32     NCAPD3           ?
##  2 ENSG00000249599           3.62 1.41e-29 1.93e-25 BMPR1B-AS1           ?
##  3 ENSG00000228278           7.48 4.04e-28 3.68e-24       ORM2           ?
##  4               ?              ?        ?        ?          ?           ?
##  5               ?              ?        ?        ?          ?           ?
##  6               ?              ?        ?        ?          ?           ?
##  7               ?              ?        ?        ?          ?           ?
##  8               ?              ?        ?        ?          ?           ?
##  9               ?              ?        ?        ?          ?           ?
## 10               ?              ?        ?        ?          ?           ?

1. Incidentally, the web application itself was built with R using the Shiny framework (https://shiny.rstudio.com/). Pete teaches a workshop on this!

2. Note that without further specifying to the results() function, the default is to treat the alphabetically first condition as the control, and the alphabetically second condition as the experimental group. Therefore, the differential expression values will be comparing “Pre” versus “Post”, since “Post” comes first.

3. Note that you may need to explicitly call select() from the dplyr library using dplyr::select(), because loading DESeq2 will mask dplyr’s select if you loaded dplyr first.