General Information

Lead Instructor: Stephen Turner, PhD

Co-instructor: VP Nagraj

Spring 2017 Module S1
Feb 13 - Mar 27, 2016
2:00pm - 5:00pm

Where: BIMS Education Center (McKim Hall)

Course Schedule

Week 1: Intro to R

This novice-level introduction is directed toward life scientists with little to no experience with statistical computing or bioinformatics. This interactive introduction will introduce the R statistical computing environment. The first part of this workshop will demonstrate very basic functionality in R, including functions, functions, vectors, creating variables, getting help, filtering, data frames, plotting, and reading/writing files.

Week 2: Advanced Data Manipulation with R

Data analysis involves a large amount of janitor work – munging and cleaning data to facilitate downstream data analysis. This session assumes a basic familiarity with R and covers tools and techniques for advanced data manipulation. It will cover data cleaning and “tidy data,” and will introduce R packages that enable data manipulation, analysis, and visualization using split-apply-combine strategies. Upon completing this lesson, students will be able to use the dplyr package in R to effectively manipulate and conditionally compute summary statistics over subsets of a “big” dataset containing many observations.

Week 3: Advanced Data Visualization with R and ggplot2

This session will cover fundamental concepts for creating effective data visualization and will introduce tools and techniques for visualizing large, high-dimensional data using R. We will review fundamental concepts for visually displaying quantitative information, such as using series of small multiples, avoiding “chart-junk,” and maximizing the data-ink ratio. After briefly covering data visualization using base R graphics, we will introduce the ggplot2 package for advanced high-dimensional visualization. We will cover the grammar of graphics (geoms, aesthetics, stats, and faceting), and using ggplot2 to create plots layer-by-layer. Upon completing this lesson, students will be able to use R to explore a high-dimensional dataset by faceting and scaling arbitrarily complex plots in small multiples.

Week 4: Reproducible Research & Dynamic Documents

Contemporary life sciences research is plagued by reproducibility issues. This session covers some of the barriers to reproducible research and how to start to address some of those problems during the data management and analysis phases of the research life cycle. In this session we will cover using R and dynamic document generation with RMarkdown and RStudio to weave together reporting text with executable R code to automatically generate reports in the form of PDF, Word, or HTML documents.

Week 5: Essential Statistics

This session will provide hands-on instruction and exercises covering basic statistical analysis in R. This will cover descriptive statistics, t-tests, linear models, chi-square, clustering, dimensionality reduction, and resampling strategies. We will also cover methods for “tidying” model results for downstream visualization and summarization.

Week 6: Survival Analysis

This session will provide hands-on instruction and exercises covering survival analysis using R. The data for parts of this session will come from The Cancer Genome Atlas (TCGA), where we will also cover programmatic access to TCGA through Bioconductor.

Week 7: Introduction to RNA-seq Data Analysis

This session focuses on analyzing real data from a biological application - analyzing RNA-seq data for differentially expressed genes. This session provides an introduction to RNA-seq data analysis, involving reading in count data from an RNA-seq experiment, exploring the data using base R functions and then analysis with the DESeq2 Bioconductor package. The session will conclude with downstream pathway analysis and exploring the biological and functional context of the results.


What’s this class all about?

This class introduces methods, tools, and software for reproducibly managing, manipulating, analyzing, and visualizing large-scale biomedical data. Specifically, the course introduces the R statistical computing environment and packages for manipulating and visualizing high-dimensional data, covers strategies for reproducible research, essential statistics and survival analysis, and culminates with analysis of data from a real RNA-seq experiment using R and Bioconductor packages.

What are the pre-requisites?

There are none! This class doesn’t assume any knowledge of programming or using a command-line interface, but if you’ve ever had any experience here, the content won’t come as so much of a shock. But don’t panic. Command-line interfaces and programming languages like R are incredibly powerful and will be utterly transformative on your research. There’s a learning curve, and it’s near-vertical in the beginning, but it’s surmountable and the payoff is worth it! Some general knowledge of statistics and study design is helpful, but isn’t strictly required.

Can I audit?

Yes! However, you will be expected to attend every class meeting, participate in coding exercises during class, and complete any and all assignments, just as if you are taking the course for credit.

Please email Stephen Turner if you’d like to audit. Instructions for signing up to audit will be forthcoming.

Where do I get additional help?

Glad you asked! See here.

Do I need a laptop?

YES. You must have access to a computer on which you can install software. The class will be a mix of lecture, discussion, but primarily live coding. You must bring your laptop to the course every day. Bring your charging cable also.

Software requirements

All the software we’re using in class is open-source and freely available online. This setup must be completed prior to class, as we will not have time for troubleshooting software installation issues during class. See the setup instructions, and follow all instructions under the major headings for:

You’ll need to download all the data. As described in the setup page, navigate to the data page and download all the relevant datasets, saving them to a folder that’s easy to find.