Formatting data as a list is sometimes necessary. However, retrieving this kind of non-tabular information for analysis can be challenging. This workshop will introduce students to the motivations and techniques for storing and parsing list objects in R. Some familiarity with R will be helpful.

## Introduction

Compared to the data frame, vector and matrix, the list is under-represented in many introductory R tutorials. This likely has less do with the relative importance of lists, and more to do with their potential complexity. However, an understanding of how to create, curate and manipulate objects of this type can prove immensely useful.

The list is one of the most versatile data types in R thanks to its ability to accommodate heterogenous elements. A single list can contain multiple elements, regardless of their types or whether these elements contain further nested data. So you can have a list of a list of a list of a list of a list …

Garrett Grolemund and Hadley Wickham’s R For Data Science includes a section on lists. They use a helpful simile for the list as a shaker filled with packets of pepper1. To retrieve individual “grains” of pepper, you’d have to first access the shaker … then the packet inside the shaker … then the pepper inside the packet.

Still confused? Here’s another way of thinking about it: the list is like a movie. Each movie has a cast, crew, budget, script, etc. These elements may have different dimensions (more cast members than crew) and be of different types (budget is a number, script is a series of characters), yet they are all part of the same movie.

We’ll use a brief review of R basics as a vehicle to get started with lists.

## R Basics

To do anything interesting in R, you must assign values (or experessions that produce values) to objects. The syntax for assignment is the name of the object followed by a <- operator and the expression to be evaluated.

x <- 3
y <- 2 + x

Although the two are mostly equivalent, the <- should be used in place of the = to improve code legibility and reduce potential mistakes … we’ll see why this is important when we start creating “named” lists.

Every object has a class, which can be accessed using the class() function. Certain functions are specific to a given class. Other functions can behave differently depending on the class of the input. The “list” class is what we are interested in for this tutorial.

One of the most fundamental types of objects is the vector. A vector is a series of elements from 1 to n. Each element can be accessed by an identifier (“index”) using square brackets y[1]. We will make extensive use of a modifed version of this syntax in order to manipulate list items.

## Creating Lists

The most direct way to create a list is with the list() function.

slamwins <- list(17,14,14,12,11)

To confirm that the object we’ve created is indeed a “list” we can use class() as described above.

class(slamwins)
## [1] "list"

OK. Let’s see what a list looks like as printed output …

slamwins
## [[1]]
## [1] 17
##
## [[2]]
## [1] 14
##
## [[3]]
## [1] 14
##
## [[4]]
## [1] 12
##
## [[5]]
## [1] 11

## Indexing Lists

The printed output above isn’t pretty, but it does include some hints as to how we can isolate specific elements of the list. In this case there are double square brackets (e.g. [[1]]) as well as single square brackets (e.g. [1]). As with vectors, data frames and matrices, the bracket notation is used for indexing. However, a list can have mulitple levels of indices. The value in the double brackets represents the number of the parent element in the list. The value in the single brackets represents the number of the element in that parent element of the list. We can chain this notation together to access granular parts of our list.

slamwins[[2]][1]
## [1] 14

If we’d prefer a more explicit way to access elements of a list, then we can give them names. When given a list as an argument, the names() function can let you assign a character vector of the same length as the list as the names for each corresponding element.

names(slamwins) <- c("Federer", "Sampras", "Nadal", "Djokovic", "Borg")
slamwins
## $Federer ## [1] 17 ## ##$Sampras
## [1] 14
##
## $Nadal ## [1] 14 ## ##$Djokovic
## [1] 12
##
## $Borg ## [1] 11 Another way to set names to is to do so while creating the list. slamwins <- list(Federer = 17, Sampras = 14, Nadal = 14, Djokovic = 12, Borg = 11) slamwins ##$Federer
## [1] 17
##
## $Sampras ## [1] 14 ## ##$Nadal
## [1] 14
##
## $Djokovic ## [1] 12 ## ##$Borg
## [1] 11

With our list named now we can use the $ operator to extract specific values by key. slamwins$Federer
## [1] 17
# federer has ? more titles than borg
slamwins$Federer - slamwins$Borg
## [1] 6

The example above could be consider a minimal viable list … there’s a single level of named elements, which just as easily could have been stored as a vector. Let’s add another layer of data nested into our list object.

slamwins <-
list(
Federer =
list(
AUS = 4,
FR = 1,
WIM = 7,
US = 5),
Sampras =
list(
AUS = 2,
FR = 0,
WIM = 7,
US = 5),
list(
AUS = 1,
FR = 9,
WIM = 2,
US = 2),
Djokovic =
list(
AUS = 6,
FR = 1,
WIM = 3,
US = 2),
Borg =
list(
AUS = 0,
FR = 6,
WIM = 5,
US = 0)
)

In this case we have created a named list of 5 named lists each of which has 5 named values.

But wait … we’re missing something … we have the number of slam wins by event but what about the total number of wins per player?

## Editing Lists

One way to solve the problem we’re encountering would be to use the indexing syntax discussed earlier to match our “totals” with the appropriate list item. That would basically amount to using a for loop:

totals <- c(17, 14, 14, 12, 11)

for (i in 1:length(slamwins)) {

slamwins[[i]]$Total <- totals[i] } slamwins ##$Federer
## $Federer$AUS
## [1] 4
##
## $Federer$FR
## [1] 1
##
## $Federer$WIM
## [1] 7
##
## $Federer$US
## [1] 5
##
## $Federer$Total
## [1] 17
##
##
## $Sampras ##$Sampras$AUS ## [1] 2 ## ##$Sampras$FR ## [1] 0 ## ##$Sampras$WIM ## [1] 7 ## ##$Sampras$US ## [1] 5 ## ##$Sampras$Total ## [1] 14 ## ## ##$Nadal
## $Nadal$AUS
## [1] 1
##
## $Nadal$FR
## [1] 9
##
## $Nadal$WIM
## [1] 2
##
## $Nadal$US
## [1] 2
##
## $Nadal$Total
## [1] 14
##
##
## $Djokovic ##$Djokovic$AUS ## [1] 6 ## ##$Djokovic$FR ## [1] 1 ## ##$Djokovic$WIM ## [1] 3 ## ##$Djokovic$US ## [1] 2 ## ##$Djokovic$Total ## [1] 12 ## ## ##$Borg
## $Borg$AUS
## [1] 0
##
## $Borg$FR
## [1] 6
##
## $Borg$WIM
## [1] 5
##
## $Borg$US
## [1] 0
##
## $Borg$Total
## [1] 11

There are a couple of potential issues with this code. The main thing is that we need to know what the totals are ahead of time. It would be a lot better to calculate those dynamically in case our underlying data changes … or in case we’re performing a calculation that’s not as simple as a sum. Another problem with this approach is that it’s implemented with a for loop, which is a construct that works when programming R but can be problematic2.

Enter the “apply” functions …

For this lesson, the two most relevant members of this family of functions are lapply() and sapply(), both of which allow you to pass other functions to each element of a list.

Before we start working with these functions, we need to restore our list the state it was in before we ran the loop to add the sums for each element. Assigning an element as NULL effectively deletes that element from the list.

for (i in 1:length(slamwins)) {

slamwins[[i]]$Total <- NULL } And because he have nested data (lists within lists within lists …) we also need to understand how to use unlist() in order to apply our functions appropriately. Unlist is simply returns a “flat” version of all of the elements in the list as a vector. You can specify this to be recursive (i.e. flatten out all lists of lists) and to either retain or discard any named identifiers you have for your list. In this context, we’ll use unlist() in conjunction with lapply() to reduce the complexity of our original list. lapply(slamwins, unlist) ##$Federer
## AUS  FR WIM  US
##   4   1   7   5
##
## $Sampras ## AUS FR WIM US ## 2 0 7 5 ## ##$Nadal
## AUS  FR WIM  US
##   1   9   2   2
##
## $Djokovic ## AUS FR WIM US ## 6 1 3 2 ## ##$Borg
## AUS  FR WIM  US
##   0   6   5   0

The lapply() function will go to each element in the highest level of the list, and perform an arbitrary action. In this case, we’ve “unlisted” each of the player lists in our slamwins object. It is important to understand that lapply() always returns a list. So essentially we’ve just created another list, which we could then use within another lapply() call.

lapply(lapply(slamwins, unlist), sum)
## $Federer ## [1] 17 ## ##$Sampras
## [1] 14
##
## $Nadal ## [1] 14 ## ##$Djokovic
## [1] 12
##
## $Borg ## [1] 11 Now that we’ve figured out how to calculate the values we’re interested in, we just need to append them to the original list. One of the keys here is appreciating that lapply() can take any function (including one that we write … an “anonymous function”3) and use that operation on each element in the list. Another point worth noting is that the c() function works on lists. Most introduction to R tutorials include examples of using c() to create a vector, and it works very similarly for lists. Essentially it appends either a single item or a list of items onto the list. slamwins <- lapply(lapply(slamwins, unlist), function(x) c(x, Total = sum(x))) slamwins ##$Federer
##   AUS    FR   WIM    US Total
##     4     1     7     5    17
##
## $Sampras ## AUS FR WIM US Total ## 2 0 7 5 14 ## ##$Nadal
##   AUS    FR   WIM    US Total
##     1     9     2     2    14
##
## $Djokovic ## AUS FR WIM US Total ## 6 1 3 2 12 ## ##$Borg
##   AUS    FR   WIM    US Total
##     0     6     5     0    11

## Converting Lists

Using the subsetting and manipulation features above we can perform a wide variety of manipulations on our list object. But ultimately (especially if you’re familiar with the “Tidyverse” approach to using R) it may be helpful to cast list data in a tabular format … a data frame.

as.data.frame(slamwins)
##       Federer Sampras Nadal Djokovic Borg
## AUS         4       2     1        6    0
## FR          1       0     9        1    6
## WIM         7       7     2        3    5
## US          5       5     2        2    0
## Total      17      14    14       12   11
datmat <- do.call(rbind, slamwins)
datdf <- as.data.frame(datmat, row.names = FALSE)
datdf$player <- row.names(datmat) datdf ## AUS FR WIM US Total player ## 1 4 1 7 5 17 Federer ## 2 2 0 7 5 14 Sampras ## 3 1 9 2 2 14 Nadal ## 4 6 1 3 2 12 Djokovic ## 5 0 6 5 0 11 Borg ## Lists “In The Wild” The above is a contrived example. In practice, you’re much more likely to encounter lists written by other people (or applications) than to code out a list of your own. The example data we’ll use will be pulled from an Application Programming Interface (API) for the github.com website4. Like many other wep APIs, the data comes out in JavaScript Object Notation (JSON). JSON is a format for storing and transmitting “semi-structured” data5. Keys and values are paired together to facilitate parsing6. When read into R, JSON is interpreted as a list. ### Example Github is a platform for sharing, storing and managing code. Projects can be defined in a “repository” structure. The example that follows will look at repositories for a single user: Hadley Wickham. To read the data into R, we can use the fromJSON() function the jsonlite package. For this example, we can pull each page of results (in this case, we know a priori that there are two pages) and make sure to pass the simplifyVector = FALSE argument after the url. library(jsonlite) had1 <- fromJSON("https://api.github.com/users/hadley/repos?page=1&per_page=100", simplifyVector = FALSE) had2 <- fromJSON("https://api.github.com/users/hadley/repos?page=2&per_page=100", simplifyVector = FALSE) The data are stored in two separate lists, so we need to combine them with the c() function. Since the original objects are no longer necessary (and may be large), it’s probably a good idea to remove them. had <- c(had1,had2) rm(had1, had2) The first item of interest is to know how many elements are in this list: length(had) ## [1] 200 It’s also helpful to take a peek at the data structure: had[[1]] ##$id
## [1] 40423928
##
## $name ## [1] "15-state-of-the-union" ## ##$full_name
##
## $owner ##$owner$login ## [1] "hadley" ## ##$owner$id ## [1] 4196 ## ##$owner$avatar_url ## [1] "https://avatars3.githubusercontent.com/u/4196?v=4" ## ##$owner$gravatar_id ## [1] "" ## ##$owner$url ## [1] "https://api.github.com/users/hadley" ## ##$owner$html_url ## [1] "https://github.com/hadley" ## ##$owner$followers_url ## [1] "https://api.github.com/users/hadley/followers" ## ##$owner$following_url ## [1] "https://api.github.com/users/hadley/following{/other_user}" ## ##$owner$gists_url ## [1] "https://api.github.com/users/hadley/gists{/gist_id}" ## ##$owner$starred_url ## [1] "https://api.github.com/users/hadley/starred{/owner}{/repo}" ## ##$owner$subscriptions_url ## [1] "https://api.github.com/users/hadley/subscriptions" ## ##$owner$organizations_url ## [1] "https://api.github.com/users/hadley/orgs" ## ##$owner$repos_url ## [1] "https://api.github.com/users/hadley/repos" ## ##$owner$events_url ## [1] "https://api.github.com/users/hadley/events{/privacy}" ## ##$owner$received_events_url ## [1] "https://api.github.com/users/hadley/received_events" ## ##$owner$type ## [1] "User" ## ##$owner$site_admin ## [1] FALSE ## ## ##$private
## [1] FALSE
##
## $html_url ## [1] "https://github.com/hadley/15-state-of-the-union" ## ##$description
## NULL
##
## $fork ## [1] FALSE ## ##$url
##
## $forks_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/forks" ## ##$keys_url
##
## $collaborators_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/collaborators{/collaborator}" ## ##$teams_url
##
## $hooks_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/hooks" ## ##$issue_events_url
##
## $events_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/events" ## ##$assignees_url
##
## $branches_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/branches{/branch}" ## ##$tags_url
##
## $blobs_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/git/blobs{/sha}" ## ##$git_tags_url
##
## $git_refs_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/git/refs{/sha}" ## ##$trees_url
##
## $statuses_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/statuses/{sha}" ## ##$languages_url
##
## $stargazers_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/stargazers" ## ##$contributors_url
##
## $subscribers_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/subscribers" ## ##$subscription_url
##
## $commits_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/commits{/sha}" ## ##$git_commits_url
##
## $comments_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/comments{/number}" ## ##$issue_comment_url
##
## $contents_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/contents/{+path}" ## ##$compare_url
##
## $merges_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/merges" ## ##$archive_url
##
## $downloads_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/downloads" ## ##$issues_url
##
## $pulls_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/pulls{/number}" ## ##$milestones_url
##
## $notifications_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/notifications{?since,all,participating}" ## ##$labels_url
##
## $releases_url ## [1] "https://api.github.com/repos/hadley/15-state-of-the-union/releases{/id}" ## ##$deployments_url
##
## $created_at ## [1] "2015-08-09T03:22:26Z" ## ##$updated_at
## [1] "2017-05-03T02:53:29Z"
##
## $pushed_at ## [1] "2015-08-10T20:29:10Z" ## ##$git_url
##
## $ssh_url ## [1] "git@github.com:hadley/15-state-of-the-union.git" ## ##$clone_url
##
## $svn_url ## [1] "https://github.com/hadley/15-state-of-the-union" ## ##$homepage
## NULL
##
## $size ## [1] 4519 ## ##$stargazers_count
## [1] 24
##
## $watchers_count ## [1] 24 ## ##$language
## [1] "R"
##
## $has_issues ## [1] TRUE ## ##$has_projects
## [1] TRUE
##
## $has_downloads ## [1] TRUE ## ##$has_wiki
## [1] TRUE
##
## $has_pages ## [1] FALSE ## ##$forks_count
## [1] 7
##
## $mirror_url ## NULL ## ##$open_issues_count
## [1] 0
##
## $forks ## [1] 7 ## ##$open_issues
## [1] 0
##
## $watchers ## [1] 24 ## ##$default_branch
## [1] "master"

Some of the elements and sub-elements of this particular list are nested (lists of lists) … but overall this data is formatted in a friendly, parseable format. Each parent element has the same number of children, which are named and defined as “key : value” pairs.

So if we wanted to extract a specific child element from one of its parents, we could use something like the following:

had[[5]]$language ## [1] "Python" We mentioned sapply() above, and now we can put it into action. This function will be useful in extracting the same child elements from different parents. To do so, we’ll need to define an anonymous function to apply across the list. Note that sapply() is similar to lapply() but always returns a vector, matrix or array rather than a list. sapply(had, function(x) x$watchers)
##   [1]   24   13    0 1030    5   82   59   10    4  260    5   13    6    1
##  [15]    0    8   12    1    0   15    6    6    5   35  117   21   22    4
##  [29]   44  146    2    0   45    7 1545    5    3   17    1    0    1   91
##  [43]    6  128    1    4   11    3    6   16    3   61    3    3    3   14
##  [57]    8  407    5    4   72    7   26    3    8    9    0    6   28    3
##  [71]    5    2    2    0   11    2    1   12    8   79    0    3    4   92
##  [85]    3    0   32    6   23   10    5    2   11    8    0   41   14  287
##  [99]   10    3    9   28   41    3   39    4    0  260  469    4   64    6
## [113]   22   30    9  126   58    6    3   98   31  183   55    2    6    2
## [127]  178  906    8    6    4    1    1    0    5    1    0    4    1   30
## [141]    1   21    3   14    0  159    5    6   10    0    1    1   13    4
## [155]   22   28    1   10    7  846    2    0    8   90   91   31    0    3
## [169]   22    0   14   26    5    1    4    7    3    5  128    1    2    6
## [183]  195    7    2    1   23   19    1   27    5    8    1    1    0    4
## [197]    0    8    6    0

We’ve successfully extracted the child element of interest from each of the parent elements in the list. However, this vector could be hard to interpret since the elements are divorced from the larger context. One solution might be to assign names to the original list, which will give sapply() a named vector output.

names(had) <- sapply(had, function(x) x$name) sapply(had, function(x) x$watchers)
##  15-state-of-the-union      15-student-papers               500lines
##                     24                     13                      0
##                   1030                      5                     82
##              babynames         beautiful-data                  bench
##                     59                     10                      4
##                 bigvis         bigvis-infovis         boxplots-paper
##                    260                      5                     13
##                  broom                builder             cellranger
##                      6                      1                      0
##              classifly             clusterfly       cocktail-balance
##                      8                     12                      1
##                      0                     15                      6
##          cran-packages              cranatics             crantastic
##                      6                      5                     35
##        data-baby-names          data-counties      data-fuel-economy
##                    117                     21                     22
##               data-gbd    data-housing-crisis            data-movies
##                      4                     44                    146
##            data-stride               datafest                decumar
##                      2                      0                     45
##             densityvis               devtools           directlabels
##                      7                   1545                      5
##              distpower                 docker                   docs
##                      3                     17                      1
##                      0                      1                     91
##                eggnogr                    emo              example-r
##                      6                    128                      1
##              extrafont              fec-dplyr        fivethirtyeight
##                      4                     11                      3
##                fortify            fueleconomy                gdtools
##                      6                     16                      3
##                   gg2v             ggenealogy                  ggmap
##                     61                      3                      3
##                 ggplot                ggplot1        ggplot2-bayarea
##                      3                     14                      8
##           ggplot2-book           ggplot2-docs          ggplot2movies
##                    407                      5                      4
##                 ggstat               ggthemes                 gtable
##                     72                      7                     26
##                      3                      8                      9
##              hclpicker      healthyr_preamble                  helpr
##                      0                      6                     28
##     herndon-ash-pollin               hflights      highlighting-kate
##                      3                      5                      2
##                httpbin                 httpuv                  ideas
##                      2                      0                     11
##              imvisoned                 kmeans                   l1tf
##                      2                      1                     12
##                 layers               lazyeval                leaflet
##                      8                     79                      0
##          leaflet-shiny                legends               lineprof
##                      3                      4                     92
##                 linval                   lme4                 lobstr
##                      3                      0                     32
##               localmds                 lvplot           lvplot-paper
##                      6                     23                     10
##                      5                      2                     11
##                      8                      0                     41
##                 mturkr             multidplyr                 mutatr
##                     14                    287                     10
##              mutatrGui            nasaweather                  neiss
##                      3                      9                     28
##           nycflights13               olctools            oldbookdown
##                     41                      3                     39
##                packman               PivotalR                pkgdown
##                      4                      0                    260
##                   plyr              pop-flows                 precis
##                    469                      4                     64
##          prodplotpaper           productplots                  profr
##                      6                     22                     30
##                  proto                   pryr               purrrlyr
##                      9                    126                     58
##          qtpaint-demos      r-devel-san-clang            r-internals
##                      6                      3                     98
##            r-on-github                 r-pkgs               r-python
##                     31                    183                     55
##               r-source               r-travis                 r-yaml
##                      2                      6                      2
##                   r2d3                   r4ds    ranking-correlation
##                    178                    906                      8
##              rastermap                rblocks                Rcereal
##                      6                      4                      1
##                   Rcpp           rcpp-gallery           RcppDateTime
##                      1                      0                      5
##           rcpplonglong           RcppProgress            rcrunchbase
##                      1                      0                      4
##                      1                     30                      1
##                recipes    redesigned-barnacle                 remake
##                     21                      3                     14
##                 reprex                reshape                  rfmt2
##                      0                    159                      5
##               rifftron                    rio riotworkshop.github.io
##                      6                     10                      0
##                  rJava              rmarkdown                 rminds
##                      1                      1                     13
##               roxygen2               roxygen3                 rsmith
##                      4                     22                     28
##                RSQLite                 rtweet                    rv2
##                      1                     10                      7
##                  rvest              rworldmap                   rydn
##                    846                      2                      0
##            scagnostics                 scales                 secure
##                      8                     90                     91
##              sfhousing                    sfr                  shiny
##                     31                      0                      3
##           shinySignals               simpleS4               sinartra
##                     22                      0                     14
##                  sloop             spatialVis               sqlutils
##                     26                      5                      1
##       stat405-practice      stat405-resources  STAT545-UBC.github.io
##                      4                      7                      3
##             stationaRy                 strict              strptimer
##                      5                    128                      1
##                syuzhet              tanglekit              tidy-data
##                      2                      6                    195
##                toc-vis               unittest          USAboundaries
##                      7                      2                      1
##          usdanutrients                  vctrs                   vega
##                     23                     19                      1
##                vis-eda          vis-migration                   vita
##                     27                      5                      8
##                      1                      1                      0
##                 weeder         weight-and-see            wesanderson
##                      4                      0                      8
##                whisker               wishlist
##                      6                      0

### Exercise

• How many times are these repositories forked on average?
• Try reading the data from Github again. Make sure to use the simplifyVector = TRUE argument instead. What happened?

## Other Methods

There are many, many ways to work with lists. What follows is a very brief nod to a few features from packages that help address list complexity.

#### rlist

rlist includes a set of very useful tools for list manipulation7.

Some highlights:

• list.map()
• list.sort()
• list.filter()
• list.group()
• list.table()
library(rlist)
list.table(had, fork)
• map(): allows functions to be passed to each element of the list (roughly analogous to sapply() or lapply())
• flatten(): simplifies a list to a vector (roughly analogous to unlist())
• transpose(): turns a list inside out (transpose() then transpose() will revert the list back to original state)