There are a number of JavaScript libraries available that leverage web functionality to produce compelling, dynamic data visualizations. This workshop will introduce several of these tools, including D3, HighCharts and Leaflet, and will demonstrate how to implement them in the context of the R programming language. No JavaScript experience is required. Some familiarity with R will be helpful.

Make sure you complete the setup here prior to the class.

Introduction

This tutorial will introduce several JavaScript libraries that are capable of producing interactive data visualizations. The goal is to demonstrate how to implement some of these tools in the context of the R programming language.

R

R is an open-source, statistical computing language. One of its strengths is its ability to produce clean, high dimensional data visualizations. The development of ggplot2, which is among the language’s most downloaded add-on packages, has strengthened R’s position as a “gold-standard” data visualization tool1.

JavaScript

JavaScript is a programming language that is widely used in web development. It may be worth stopping and pointing out that (despite its name) JavaScript bears no relation to Java2. Java is a compiled langauge, while JavaScript (like R) is a scripting language that can be saved as a text file and run via an interpreter. Its name (among other features …) has been a source to confusion about what JavaScript is and how to use it3. For our purposes, we will focus on JavaScript as means for data visualization. There are number of libraries that can produce powerful, interactive plots. Below are a few examples of available frameworks:

  • d3.js
  • c3.js
  • vis.js
  • plotly.js
  • sigma.js
  • dygraphs.js

htmlwidgets

R and JavaScript are both capable of data preparation and visualization on their own. But together they can make an efficient and impressive pipeline. The principal bridge between these two tools is htmlwidgets4. This package provides scaffolding to build connectors (or “bindings”) from JavaScript to R.

The workflow is as follows:

  1. Data are read into R
  2. Data are processed (and potentially manipulated) by R
  3. Data are converted to JavaScript Object Notation (JSON) format
  4. Data are bound to JavaScript
  5. Data are processed (and potentially manipulated) by JavaScript
  6. Data are mapped to plotting features and rendered

nb Many JavaScript / R bindings developed with htmlwidgets support piping with magrittr, and the examples that follow will make extensive use of the %>% syntax. For more information on the pipe, refer to the dplyr lesson material.

library(magrittr)

Examples

d3heatmap

D3 (Data Driven Documents) is one of the most well-known JavaScript visualization libraries. In use since 2011, D3 is a staple of many interactive graphs featured on media outlets like the New York Times. D3 can produce everything from choropleths5 to scatter plots to dygraphs to network visualizations6 … and beyond. So far, individual D3 plotting methods have been distributed across multiple R packages. For this tutorial we’ll focus on single D3 feature: interactive heatmapping7.

The data of interest below is the gapminder data set, which includes life expectancy, GDP per capita, and population, every five years, from 1952 to 2007 for 142 countries as a data frame8. One question that this data could address is the “similarity” of countries. By aggregating on country, one calculate historical summary statistics about each nation and then visualize these side-by-side. But since there are 142 countries, this may be easier to visualize on a subset. For our purposes, we’ll look at current UN Security Council member nations9 and their similarity based on life expectancy and GDP per capita. The code below reads in the raw data, identifies security council countries, filters the data to only include these countries, aggregates and calculates the mean and standard deviation of each variable of interest.

nb Russia is a member of the UN Security Council but does not appear in the gapminder data set

library(d3heatmap)
library(dplyr)
library(readr)

# read data from github repository using readr read_csv() function
gapminder <- read_csv("https://raw.githubusercontent.com/bioconnector/workshops/master/data/gapminder.csv")

# create a list of countries to filter *in*
countrylist <- c("China","France","United Kingdom", "United States", "Egypt", "Japan", "Senegal", "Ukraine", "Uruguay", "Venezuela", "Spain", "Malaysia", "New Zealand", "Angola")

# subset the original data and calculate summary statistics of interest
gmun <- 
    gapminder %>%
    filter(country %in% countrylist) %>%
    group_by(country) %>%
    summarise(
        meanlifexp = mean(lifeExp),
        sdlifeexp = sd(lifeExp),
        meangdppercap = mean(gdpPercap),
        sdgdppercap = sd(gdpPercap)) 
gmun
country meanlifexp sdlifeexp meangdppercap sdgdppercap
Angola 37.9 4.00 3607 1166
China 61.8 10.25 1488 1371
Egypt 56.2 10.06 3074 1405
France 74.3 4.30 18834 7903
Japan 74.8 6.50 17751 10132
Malaysia 64.3 8.61 5406 3750
New Zealand 74.0 3.56 17263 4409
Senegal 50.6 9.14 1533 105
Spain 74.2 5.16 14030 8047
United Kingdom 73.9 3.38 19380 7388
United States 73.5 3.34 26261 9695
Uruguay 70.8 3.34 7100 1612
Venezuela 66.6 6.11 10089 1476

d3heatmap() is essentially an interactive extension of base R’s heatmap(), and also accepts a matrix as its primary argument. In this case we’ll use the dist() function to create distance matrix between each country using the continuous variables calculated above.

gmdistmat <- dist(select(gmun, -country))

With the matrix prepared, we can create the heatmap. The first argument (x) is the object to be plotted. The appearance can be further customized with arguments like labRow (row labels), labCol (column labels), dendrogram (clustering visualization), etc.

The final product will include interactive elements like popus on hover and click-drag ‘brushing’ (click and drag to zoom). Some of these behaviors can be customized with further arguments to d3heatmap().

d3heatmap(
    # calculate distance matrix
    gmdistmat,
    # name the rows (top to bottom)
    labRow = gmun$country, 
    # name the columns (left to right)
    labCol = gmun$country, 
    # turn off dendrogram
    dendrogram = 'none')

Leaflet

Leaflet is an open-source JavaScript library that is widely used in interactive GIS (mapping) applications. The R package supporting Leaflet is written and maintained by RStudio, and ports over the library’s (very extensive) list of features10. Leaflet enables developers to create maps that include markers (i.e. data points), popup text, custom zoom levels, a variety of tiles (i.e. types of maps), polygons (i.e. custom boundaries), panning, and more.

The example below will use a list of academic research institutions in the United States. This data set has been previously geocoded with the ggmap package to calculate latitude and longitude.

The code below reads in the data and makes sure we will we be using fully geocoded data, without any NAs in latitude or longitude fields.

# read in institution data from github using readr read_csv() function
inst <- read_csv("https://raw.githubusercontent.com/vpnagraj/grupo/master/institutions.csv")

# remove missing data
inst <- inst[complete.cases(inst),]

The first step is to invoke a “base map” with the leaflet() function. As mentioned above, this package uses the magrittr piping, which allows you to chain together this base invocation with a command to add “tiles” to the map. Leaflet tiles are essentially specifications for the map “style” to be used. Below is the default base map.

library(leaflet)
leaflet() %>%
addTiles()

In order to customize the appearance of this map, we can add special “provider” tiles11. These are third-party map styles that have been made available to use via Leaflet. For example, you could use a National Geographic style map like the one below:

leaflet() %>%
addProviderTiles("Esri.NatGeoWorldMap")

Below is an example of the black-and-white verison of the OpenStreetMap provider tile, along with a new feature: markers.

leaflet() %>%
addProviderTiles("OpenStreetMap.BlackAndWhite") %>%
addMarkers(lng=inst$Longitude, lat=inst$Latitude)

Markers essentially represent points on the map, and must be applied via latitude and longitude coordinates. In this example, each set of coordinates maps the location of a major American research university.

The shape and style of markers can be further customized. addCircleMarkers() will use circles to represent points on the map, rather than the default icons. Within this function, one can specify other style options, like color, as well as data to use in the visualization as a “popup” text. The map below includes all institutions as red circles, with the names and city/state locations encoded in the popup text, which appears on click.

leaflet() %>%
addProviderTiles("OpenStreetMap.BlackAndWhite") %>%
addCircleMarkers(lng=inst$Longitude, lat=inst$Latitude, popup = paste0(inst$Institution, "<br>", inst$Location), col = "firebrick")

The example above is relatively simple, but it demonstrates the main workflow for creating an interactive Leaflet map in R. It’s worth noting a few other features:

  • setView(): focus the map on a specific point or zoom level
  • addPolygons: add custom boundaries onto the map as specified by something like a shapefile … this is necessary for making choropleth maps
  • addLegend(): add and customize a legend for the map

visNetwork

Network visualization refers to the graphical representation of a collection of nodes (entities) and edges (connections between those entities). There are a number of JavaScript libraries that are capable of producing these kinds of plots, and vis.js12 is one of the most robust. The associated R package for this library is visNetwork13.

Data for network plots must be prepared with nodes and edges in mind. visNetwork in particular requires that these two should be separted into distinct data frames: one for the nodes (with an id column as well as any relevant attributes) and one for the edges (with a from and to as well as any associated attributes).

For the example that follows we’ll use data describing the flow of patients from an outpatient orthopaedic clinic visit to surgery14. This dataset is available on Github, and must be converted from JSON format to a data frame object using the fromJSON() function from the jsonlite package.

library(visNetwork)
library(jsonlite)

patients <- fromJSON("https://raw.githubusercontent.com/micahstubbs/sankey-datasets/master/patient-flow-bos-2016/graph.json")

patientnodes <- patients$nodes
patientlinks <- patients$links

names(patientnodes) <- c("label","id")
names(patientlinks) <- c("from", "to", "value")
patientnodes
label id
All referred patients 0
First consult outpatient clinic 1
No OR-receipt 2
OR-receipt 3
No surgery 4
Start surgery 5
Emergency 6
No emergency 7
patientlinks
from to value
0 1 1.00
1 2 0.64
1 3 0.36
3 4 0.33
3 5 0.67
5 6 0.16
5 7 0.84

The first invocation of the network visualization will simply call the node and edge data frames with visNetwork(). Because the columns are named appropriately (in the node data frame: label and id … in the edge data frame: from, to and value) the function knows how to map the attributes onto the plot. The nodes are labeled appropriately and connected via definitions of edges, which are in turn represented as specific widths based on their values.

visNetwork(patientnodes, patientlinks) 

To configure the appearance of the network’s nodes and edges you can include calls to visNodes() and visEdges(), respectively. The code below customizes the shape, color and highlight behavior. Passing parameters to customize the plot at this granular of a level will likely require reference to the vis.js API documentation, and arguments may have t be formatted as lists … or even lists of lists. The visLayout() fixes the position of the nodes via a random seed. Alternatively, the nodes will appear in different positions each time the plot is generated, which can be particularly problematic for large networks.

visNetwork(patientnodes, patientlinks) %>%
    visNodes(shape = "triangle", 
        color = list(background = "lightblue", 
            border = "darkblue",
            highlight = "firebrick")) %>%
    # fix the layout to fall in the same position
    visLayout(randomSeed = 1999)

The base layout above is helpful, but an even better depiction of the patient traffic might be a hierarchical layout, which can represent flow in a “directed” network. To convert the existing visualization to an alternate layout, pass it into an additional visHierarchicalLayout() function.

visNetwork(patientnodes,patientlinks) %>%
        visNodes(
            shape = "triangle", 
            color = list(background = "lightblue", 
            border = "darkblue",
            highlight = "firebrick"),
            shadow = list(enabled = TRUE, size = 10)) %>%
        visHierarchicalLayout(sortMethod = "directed") %>%
        visLayout(randomSeed = 1999)

Highcharter

Like D3, Highcharts is a flexible JavaScript visualization tool that can produce a variety of different kinds of plots: line, spline, area, areaspline, column, bar, pie, scatter, angular gauges, arearange, areasplinerange, columnrange and polar chart15. Unlike D3, the implementation of the library through the highcharter package condenses many of these types of plots into a single set of functions16.

Note that Highcharts is free for non-commerical use, but must be licensed for commercial applications17.

For the example that follows we’ll look at different methods for creating interactive scatterplots. The data are records describing movies in the Internet Movie Database (IMDB) as collected and made available via Hadley Wickham’s ggplot2movies package18.

First we need to load the packages and read the data directly from the package’s development repository on Github. In this case, we’ll assume we’re interested in a particular subset of the data: movies directed by Stanley Kubrick.

library(highcharter)
library(dplyr)

movies <- read_csv("https://raw.githubusercontent.com/hadley/ggplot2movies/master/data-raw/movies.csv")


kubrickdat <- data.frame(
    title = c("Fear and Desire","Killer's Kiss","Killing, The","Paths of Glory","Spartacus","Lolita","Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb", "2001: A Space Odyssey","Clockwork Orange, A","Barry Lyndon","Shining, The", "Full Metal Jacket","Eyes Wide Shut"),
    year = c(1953,1955,1956,1957,1960,1962,1964,1968,1971,1975,1980,1987,1999)
)

kubrick <-
    movies %>%
    filter(title %in% kubrickdat$title & year %in% kubrickdat$year) %>%
    select(title, year, budget, rating, length) %>%
    arrange(desc(year))
kubrick
title year budget rating length
Eyes Wide Shut 1999 65000000 7.0 159
Full Metal Jacket 1987 30000000 8.2 116
Shining, The 1980 19000000 8.3 146
Barry Lyndon 1975 11000000 8.0 184
Clockwork Orange, A 1971 2200000 8.3 137
2001: A Space Odyssey 1968 10500000 8.3 156
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb 1964 1800000 8.7 93
Lolita 1962 2000000 7.6 152
Spartacus 1960 12000000 8.0 198
Paths of Glory 1957 935000 8.6 87
Killing, The 1956 320000 8.1 85
Killer’s Kiss 1955 75000 6.6 67
Fear and Desire 1953 20000 6.3 68

highcharter provides a function for “quick and dirty” access to the Highchart library. hchart() will create a plot in a single function. For those familiar with ggplot2, this is similar to qplot(). The plot type can be specified within the function.

In our example, we’re looking at records for Stanley Kubrick movies from IMDB. If we pass a single continuous variable (say, ‘budget’) into hchart(), the function will infer that we want a histogram.

hchart(kubrick$budget)

Much like the base plot() function, hchart() can produce a variety of different kinds of plots depending on the data inputs and plot type specification. The following will produce a scatterplot of year versus rating for the Kubrick movies.

hchart(kubrick, "scatter",  hcaes(x = year, y = rating))

Both of the plots above are interactive in the sense that they display information about a point (or bin) on hover. While hchart() provides a quick way to implement this interactivity, to fully take advantage of Highhcarts features it may be necessary to use a different approach.

The highchart() function provides a “canvas” from which you can invoke specific plot types, and can accept further customization as arguments. So the code below will create a “highcart” and then (using the magrittr piping syntax) pass that object along to another function that creates a scatter plot. This method allows for more customization of the scatter plot than hchart(), as you can easily specifiy attributes to be mapped on onto the size (z) and label.

highchart() %>%
    hc_add_series_scatter(x = kubrick$year, y = kubrick$rating, z = kubrick$budget, label = kubrick$title)

The plot above provides much more information and interactivity than the previous visualizations. However, it can still be improved.

For one thing, the plot should have a title. We can give it one with hc_title(). But another issue is that the individual movie titles aren’t all visible. It appears that overlapping labels are hidden. The solution to this issue is relatively obscure, but illustrates an important feature of using JavaScript libraries in R. Like many clients, the highcharter package makes it possible to pass arbitrary agruments to the original Highcarts API. Unfortunately, these options aren’t documented within the package. But by looking at the API documentation19, it’s clear that you can control (and disable) the behavior in question. The key is that these options must be passed into the highcharter() function via the hc_opts argument, and must be defined as a named list that walks down through the HTML container for the visualization. This workflow can be tedious … but very useful. Plots can be fully customized based on these parameters.

# specify options
kubrickopts <- list(plotOptions = list(series = list(dataLabels = list(allowOverlap = TRUE))))

# create plot and pass in options to disable behavior
highchart(hc_opts = kubrickopts) %>%
    hc_add_series_scatter(x = kubrick$year, y = kubrick$rating, z = kubrick$budget, label = kubrick$title) %>%
    hc_title(text = "Stanley Kubrick IMDB Ratings By Year") 

Shiny

One of the principal advantages of JavaScript visualizations is that they are ready to insert directly into web applications. They can potentially be embedded in HTML content, and included in a variety of frameworks. One such framework, specific to R development, is Shiny20.

Created and maintained by RStudio, Shiny is a web application framework developing interactive, web-based tools with R. JavaScript / R bindings developed with htmlwidgets include functions (render*() for server.R and *Output() for ui.R) to create and update these visualizations in a Shiny app.

Future Features

The htmlwidgets package significantly cuts down on the complexity of integrating JavaScript libraries into R21. As mentioned above, this package creates boilerplate outlines and provides a helpful structure for writing the pieces of R and JavaScript code that will communicate with one another. The end result of the binding is a package that includes functions to be called to create the visualization of interest. The relative ease of this workflow has resulted in a flood of new JavaScript / R visualization tools in the past year22 … and there will likely be many more released in the future. Another new tool in development is crosstalk, which aims to facilitate interaction across JavaScript frameworks via R23.

References