## Precision Medicine Bioinformatics

Introduction to bioinformatics for DNA and RNA sequence analysis

# Data munging / cleaning

Adapted from the NCEAS dataviz module and R for Data Science

# Introduction

## Learning outcomes

Students should

• Learn what it means to examine and clean their data
• Learn how to reshape and restructure their data with base R and dplyr functions
• Pull out informative subsets of a dataset to make an attractive figure

## Lesson

Let’s look at an example dataset (one we’ve made) and go over some ways we might QA/QC this dataset.

It is an arbitrary temperature study with six sites. Within each site, there are 6 plots and we took 10 samples of temperature (degrees C) per plot.

### Step one: Import and prepare

fundata <- read.csv("http://genomedata.org/seq-tec-workshop/misc/fundata.csv")
head(fundata) # Look at the data

##   siteplot       date temp_c
## 1      A.1 03-21-2017  31.50
## 2      B.2 04-21-2017  33.63
## 3      C.3 05-21-2017  32.56
## 4      D.4 06-21-2017  31.22
## 5      E.5 07-21-2017  25.44
## 6      F.6 08-21-2017  33.55

str(fundata)

## 'data.frame':    355 obs. of  3 variables:
##  $siteplot: chr "A.1" "B.2" "C.3" "D.4" ... ##$ date    : chr  "03-21-2017" "04-21-2017" "05-21-2017" "06-21-2017" ...
##  $temp_c : num 31.5 33.6 32.6 31.2 25.4 ...  summary(fundata)  ## siteplot date temp_c ## Length:355 Length:355 Min. :-9999.00 ## Class :character Class :character 1st Qu.: 26.75 ## Mode :character Mode :character Median : 30.20 ## Mean : -111.25 ## 3rd Qu.: 33.36 ## Max. : 180.00 ## NA's :1  What’s our assessment of this dataset from the above commands? • The site and plot codes are smushed together in one column. • The dates look a bit funky. What format is that? • The range on the temperatures surely can’t be right #### Split siteplot into two columns Here’s some example code showing how to generate a dataframe with ugly smashed together variables. and then how to split it out: library(tidyverse) lettersdf <- data.frame(letters = paste(LETTERS, rev(LETTERS), sep = ".")) head(lettersdf)  ## letters ## 1 A.Z ## 2 B.Y ## 3 C.X ## 4 D.W ## 5 E.V ## 6 F.U  separate(lettersdf,letters, c("letter_one", "letter_two"), sep = "\\.")  ## letter_one letter_two ## 1 A Z ## 2 B Y ## 3 C X ## 4 D W ## 5 E V ## 6 F U ## 7 G T ## 8 H S ## 9 I R ## 10 J Q ## 11 K P ## 12 L O ## 13 M N ## 14 N M ## 15 O L ## 16 P K ## 17 Q J ## 18 R I ## 19 S H ## 20 T G ## 21 U F ## 22 V E ## 23 W D ## 24 X C ## 25 Y B ## 26 Z A  Exercise: Using your fundata object,split the siteplot column into two columns called site and plot: #### Convert the dates to real R dates These dates use a format that isn’t wrong, but isn’t ideal for plotting and such. What happens when we try to sort them? head(sort(fundata$date))

## [1] "03-21-2016" "03-21-2017" "03-21-2017" "03-21-2017"
## [5] "03-21-2017" "03-21-2017"


Ugh - these are characters, and so it’s not sorting how we might expect.

When you have dates in R, it’s usually best to convert them to a Date object. The function we’ll use is as.Date. Here’s an example of using it on a few dates:

datestrings <- c("2000-08-03", "2017-02-20", "1980-04-27")
class(datestrings)

## [1] "character"

mydate <- as.Date(datestrings, format = "%Y-%m-%d")
mydate

## [1] "2000-08-03" "2017-02-20" "1980-04-27"

class(mydate)

## [1] "Date"


Exercise: Convert the date column in fundata to from a character vector to a Date vector with as.Date():

### Step two: Checking assumptions in the data

#### site and plot columns

Let’s start by looking at the site column for potential issues.

The table function is a great way to tally the occurrences of each of the unique values in a vector:

table(c("A", "A", "B", "C"))

##
## A B C
## 2 1 1

fish <- c("gag grouper", "striped bass", "red drum", "gag grouper")
table(fish)

## fish
##  gag grouper     red drum striped bass
##            2            1            1


To ahead and apply table to the site column. Based on our experimental design, we expect to see 60 observations from each site. Do we?

We can see we’re missing some observations from A, C, and E. Depending on our needs, we may need to go back to our field notes to find out what happened.

#### temp_c column

Get some basic stats about the temp_c column with summary:

summary(fundata$temp_c)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -9999.00 26.75 30.20 -111.25 33.36 180.00 1  Or look directly at the range of values with range: range(fundata$temp_c)

## [1] NA NA


Plotting your data is always a good idea and we can often find new insights or issues this way. For univariate data, a box plot is a great way to look at the distribution of values:

boxplot(rnorm(100, 50, 5))


Exercise Use the boxplot() function to make a boxplot of the temperature values

Those outlier values can’t be right! Before we move on, let’s fix the -9999 and 180 observations by removing them.

Exercise: Remove the rows with temperatures of -9999 and 180:

Oh no, not only do we have weird data, we also have missing data:

any(is.na(fundata$temp_c))  ## [1] TRUE  “NA” is its own type in R. is.na(2)  ## [1] FALSE  is.na(NA)  ## [1] TRUE  is.na(c(1, NA, 3))  ## [1] FALSE TRUE FALSE  fish <- c("gag grouper", NA, "red drum", NA) fish  ## [1] "gag grouper" NA "red drum" NA  # Filter NAs in a character vector fish[!is.na(fish)]  ## [1] "gag grouper" "red drum"  # Remember we can subset the rows in a data.frame like: fundata[c(1, 2, 3),] # First, second, third rows  ## siteplot date temp_c ## 1 A.1 03-21-2017 31.50 ## 2 B.2 04-21-2017 33.63 ## 3 C.3 05-21-2017 32.56  #or fundata[which(fundata$site == "A"),] # Just the rows with site == "A"

## [1] siteplot date     temp_c
## <0 rows> (or 0-length row.names)

# Write an expression using is.na to subset fundata and save the result
# e.g. fundata$site <-  Your experimental design may dictate the way you handle missing data, but today, we’re just going to remove those rows. Exercise Write an expression using is.na to subset fundata, removing rows with missing temperature values and save the result ### Manipulating data with dplyr The dplyr package contains lots of useful functions for manipulating data. Let’s try it with the starwars data set. As a first step, let’s look at what’s in this: starwars  ## # A tibble: 87 × 14 ## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> ## 1 Luke Skyw… 172 77 blond fair blue 19 male ## 2 C-3PO 167 75 <NA> gold yellow 112 none ## 3 R2-D2 96 32 <NA> white,… red 33 none ## 4 Darth Vad… 202 136 none white yellow 41.9 male ## 5 Leia Orga… 150 49 brown light brown 19 fema… ## 6 Owen Lars 178 120 brown,… light blue 52 male ## 7 Beru Whit… 165 75 brown light blue 47 fema… ## 8 R5-D4 97 32 <NA> white,… red NA none ## 9 Biggs Dar… 183 84 black light brown 24 male ## 10 Obi-Wan K… 182 77 auburn… fair blue-g… 57 male ## # … with 77 more rows, 6 more variables: gender <chr>, ## # homeworld <chr>, species <chr>, films <list>, ## # vehicles <list>, starships <list>, and abbreviated variable ## # names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year ## # ℹ Use print(n = ...) to see more rows, and colnames() to see all variable names  Notice that this output doesn’t look exactly like our fundata data frame. It’s stored in something called a “tibble”. The details are largely beyond this course, but thinking of it as a “fancy data frame” is pretty much right on. We can use base R functions format for interfacing with this data: You can access a single element: starwars[3,2]  ## # A tibble: 1 × 1 ## height ## <int> ## 1 96  or filter the data to only characters that have light skin and brown eyes: starwars[starwars$skin_color == "light" & starwars\$eye_color == "brown", ]

## # A tibble: 7 × 14
##   name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex
##   <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
## 1 Leia Organa    150    49 brown   light   brown        19 fema…
## 2 Biggs Dark…    183    84 black   light   brown        24 male
## 3 Cordé          157    NA brown   light   brown        NA fema…
## 4 Dormé          165    NA brown   light   brown        NA fema…
## 5 Raymus Ant…    188    79 brown   light   brown        NA male
## 6 Poe Dameron     NA    NA brown   light   brown        NA male
## 7 Padmé Amid…    165    45 brown   light   brown        46 fema…
## # … with 6 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year
## # ℹ Use colnames() to see all variable names


### Filter rows with filter()

dplyr gives us some new fancy tools to slice and dice this data any way we want, and also introduces a new way to move data around with pipes.

To achieve the same filter as above, we can use this command:

starwars %>% filter(skin_color == "light", eye_color == "brown")

## # A tibble: 7 × 14
##   name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex
##   <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
## 1 Leia Organa    150    49 brown   light   brown        19 fema…
## 2 Biggs Dark…    183    84 black   light   brown        24 male
## 3 Cordé          157    NA brown   light   brown        NA fema…
## 4 Dormé          165    NA brown   light   brown        NA fema…
## 5 Raymus Ant…    188    79 brown   light   brown        NA male
## 6 Poe Dameron     NA    NA brown   light   brown        NA male
## 7 Padmé Amid…    165    45 brown   light   brown        46 fema…
## # … with 6 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names
## #   ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year
## # ℹ Use colnames() to see all variable names


These work the same as unix pipes - data flows from left to right. We took the data contained in starwars and piped into the filter function and told it which columns to operate on. This is in some ways a little more intuitive and easier to type than the above

### Arrange rows with arrange()

We can also reorder or sort data:

starwars %>% arrange(height, mass)

## # A tibble: 87 × 14
##    name       height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex
##    <chr>       <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr>
##  1 Yoda           66    17 white   green   brown       896 male
##  2 Ratts Tye…     79    15 none    grey, … unknown      NA male
##  3 Wicket Sy…     88    20 brown   brown   brown         8 male
##  4 Dud Bolt       94    45 none    blue, … yellow       NA male
##  5 R2-D2          96    32 <NA>    white,… red          33 none
##  6 R4-P17         96    NA none    silver… red, b…      NA none
##  7 R5-D4          97    32 <NA>    white,… red          NA none
##  8 Sebulba       112    40 none    grey, … orange       NA male
##  9 Gasgano       122    NA none    white,… black        NA male
## 10 Watto         137    NA black   blue, … yellow       NA male
## # … with 77 more rows, 6 more variables: gender <chr>,
## #   homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>, and abbreviated variable
## #   names ¹​hair_color, ²​skin_color, ³​eye_color, ⁴​birth_year
## # ℹ Use print(n = ...) to see more rows, and colnames() to see all variable names


If you provide more than one column name, each additional column will be used to break ties in the values of the preceding column.

### Select columns with select()

This is kind of a big dataset with lots of columns, maybe we only really care about a few of them. Let’s grab just those:

starwars %>% select(hair_color, skin_color, eye_color)

## # A tibble: 87 × 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>
##  1 blond         fair        blue
##  2 <NA>          gold        yellow
##  3 <NA>          white, blue red
##  4 none          white       yellow
##  5 brown         light       brown
##  6 brown, grey   light       blue
##  7 brown         light       blue
##  8 <NA>          white, red  red
##  9 black         light       brown
## 10 auburn, white fair        blue-gray
## # … with 77 more rows
## # ℹ Use print(n = ...) to see more rows


If you want to save the output of these filtering operations for use in later code, instead of just sending it to the screen, add a variable name:

mydata = starwars %>% select(hair_color, skin_color, eye_color)


Finally, we can chain together powerful expressions to group and summarize data:

starwars %>%
group_by(species, sex) %>%
select(height, mass) %>%
summarize(
height = mean(height, na.rm = TRUE),
mass = mean(mass, na.rm = TRUE)
)

## Adding missing grouping variables: species, sex
## summarise() has grouped output by 'species'. You can override
## using the .groups argument.

## # A tibble: 41 × 4
## # Groups:   species [38]
##    species   sex    height  mass
##    <chr>     <chr>   <dbl> <dbl>
##  1 Aleena    male      79   15
##  2 Besalisk  male     198  102
##  3 Cerean    male     198   82
##  4 Chagrian  male     196  NaN
##  5 Clawdite  female   168   55
##  6 Droid     none     131.  69.8
##  7 Dug       male     112   40
##  8 Ewok      male      88   20
##  9 Geonosian male     183   80
## 10 Gungan    male     209.  74
## # … with 31 more rows
## # ℹ Use print(n = ...) to see more rows


It takes time to really know these tools intuitively, but this gives you a flavor of what you can do with it. There are many additional useful functions like mutate(), gather() and spread() that are worth exploring, but all of them should be used for the same purpose - to get your data cleaned up and into a format that is useful for statistics or plotting.

Exercise: Use ggplot to make a beautiful figure demonstrating the relationship between height, mass, and sex in these fictional starwars species. Remember axis labels, legends, themes, and other things that we talked about in the previous lesson!