Data cleaning – worked example

We will practice reading and cleaning data using the published race data for the California “10 20” running event (10 miles, 20 bands). These data were scraped from the official race website, then fixed up a bit and saved as a tab delimited text file. This text file can be found here.

We will use the various tidyverse packages to carry out this cleaning.

library(tidyverse)

Read file with readr

First, let’s read the data in using readr::read_tsv, creating a data-frame saved as cal1020. “tsv” generally refers to “tab separated values,” which is what we have.

cal1020 = readr::read_tsv('http://vulstats.ucsd.edu/data/cal10202.txt')

The readr functions (such as read_csv, read_tsv, etc.) are very similar to the base R functions (read.csv), but they have more sensible defaults, and also save the data as a “tibble,” which works just like a data frame, but has a few added bells and whistles.

First glimpse at the data.

What’s in this data frame? We can check with str(), or a slightly neater display using the dplyr::glimpse() function.

glimpse(cal1020)

## Rows: 3,252
## Columns: 22
## $ time.gun         <time> 00:48:00, 00:48:05, 00:49:30, 00:51:03, 00:51:24, 00:53:26, 00:53:43, 00:54:09, 00:54:49, 00:58:49, 00:55…
## $ time.chip        <time> 00:48:00, 00:48:05, 00:49:30, 00:51:02, 00:51:23, 00:53:26, 00:53:42, 00:54:01, 00:54:49, 00:55:18, 00:55…
## $ bib              <dbl> 1205, 9, 13, 15, 1303, 1213, 3, 1055, 12, 1351, 1054, 1216, 1352, 1218, 6, 1220, 1304, 18, 1353, 1305, 101…
## $ name.first       <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last        <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ City             <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ State            <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?",…
## $ Division         <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ Class.Position   <dbl> 1, 2, 3, 4, 5, 1, 1, 1, 1, 1, 1, 2, 3, 2, 2, 4, 1, 3, 2, 3, 1, 4, 1, 2, 1, 2, 5, 1, 1, 1, 5, 3, 1, 3, 1, 2…
## $ Overall.Place    <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 8, 9, 2, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2…
## $ Age              <chr> "25", "29", "32", "30", "28", "39", "26", "42", "27", "33", "60", "34", "33", "39", "26", "32", "41", "24"…
## $ Zip              <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ Sex.place        <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 1, 8, 2, 9, 10, 11, 2, 12, 13, 3, 14, 15, 16, 4, 17, 18, 19, 20, 5, 6, 7, 8, 21, 2…
## $ Div.tot          <dbl> 3239, 3239, 3239, 3239, 3239, 3239, 3239, 13, 3239, 3239, 13, 3239, 3239, 3239, 3239, 3239, 3239, 3239, 32…
## $ Sex.tot          <dbl> 1179, 1179, 1179, 1179, 1179, 1179, 1179, 11, 2060, 1179, 11, 1179, 1179, 1179, 2060, 1179, 1179, 2060, 11…
## $ AG.tot           <dbl> 5, 5, 5, 5, 5, 118, 85, 1, 5, 139, 1, 139, 139, 118, 5, 139, 3, 5, 3, 3, 167, 5, 178, 167, 191, 191, 5, 81…
## $ `5.25.mile.rank` <chr> "3", "2", "4", "6", "5", "7", "8", "1", "9", "10", "INC", "11", "13", "15", "12", "16", "19", "14", "18", …
## $ `5.25.mile.time` <chr> "25:12:00", "25:12:00", "25:31:00", "26:47:00", "26:46:00", "27:41:00", "27:54:00", "29:10:00", "28:41:00"…
## $ `5.25.mile.pace` <chr> "4:48/M", "4:48/M", "4:52/M", "5:06/M", "5:06/M", "5:16/M", "5:19/M", "5:33/M", "5:28/M", "5:28/M", NA, "5…
## $ `10.mile.rank`   <chr> "1", "2", "3", "4", "5", "6", "7", "1", "8", "9", "13", "10", "12", "15", "16", "13", "11", "25", "18", "1…
## $ `10.mile.time`   <chr> "22:47", "22:52", "23:58", "24:14:00", "24:36:00", "25:44:00", "25:47:00", "24:50:00", "26:07:00", "26:35:…
## $ `10.mile.pace`   <chr> "2:17/M", "2:17/M", "2:24/M", "2:25/M", "2:28/M", "2:34/M", "2:35/M", "2:29/M", "2:37/M", "2:40/M", "5:32/…

From this we learn lots of things, the most immediately relevant are:

the names of the variables (columns); we could have gotten just this part by running names(cal1020))
cal1020 is a data frame with 22 variables (columns) and 3252 observations (rows); we could have gotten just this part by running dim(cal1020)
all the variables are saved as either a “chr” (a character string), an “int” (an integer), or a “time” variable. (if we had used the base R read.tsv, many of the strings would have been converted to factors, and the variables recognized as “time” would have been treated as strings as well).

Before we work on fixing this, let’s look at some of the rows in the data. Looking at 3252 observations at once is silly, so let’s just look at the first 10 using the head() command. Because there are 22 variables, the printout doesn’t fit. In R markdown we get a fancy scrollable table, but in the console you would just get a truncated list (this behavior differs slightly between data frames and “tibbles,” with tibbles yielding more legible console output).

head(cal1020, n = 10)

## # A tibble: 10 x 22
##    time.gun time.chip   bib name.first name.last  City    State Division   Class.Position Overall.Place Age   Zip   Sex.place Div.tot
##    <time>   <time>    <dbl> <chr>      <chr>      <chr>   <chr> <chr>               <dbl>         <dbl> <chr> <chr>     <dbl>   <dbl>
##  1 48'00"   48'00"     1205 Jordan     Chipangama Flagst… AZ    10 Mile O…              1             1 25    86004         1    3239
##  2 48'05"   48'05"        9 Macdonard  Ondara     Grand … TX    10 Mile O…              2             2 29    75054         2    3239
##  3 49'30"   49'30"       13 Sergio     Reyes      Palmda… CA    10 Mile O…              3             3 32    93551         3    3239
##  4 51'03"   51'02"       15 Jamesom    Mora       Arroyo… CA    10 Mile O…              4             4 30    93420         4    3239
##  5 51'24"   51'23"     1303 Darren     Brown      Solana… CA    10 Mile O…              5             5 28    92075         5    3239
##  6 53'26"   53'26"     1213 Okwaro     Raura      Oceans… CA    10 Mile M…              1             6 39    92057         6    3239
##  7 53'43"   53'42"        3 Steven     Underwood  Encini… CA    10 Mile M…              1             7 26    92024         7    3239
##  8 54'09"   54'01"     1055 Edwin      Figueroa   Commer… CA    Wheelchai…              1             1 42    90040         1      13
##  9 54'49"   54'49"       12 Lindsey    Scherf     High F… NY    10 Mile O…              1             8 27    12440         1    3239
## 10 58'49"   55'18"     1351 Derek      Bradley    Encini… CA    10 Mile M…              1             9 33    92024         8    3239
## # … with 8 more variables: Sex.tot <dbl>, AG.tot <dbl>, 5.25.mile.rank <chr>, 5.25.mile.time <chr>, 5.25.mile.pace <chr>,
## #   10.mile.rank <chr>, 10.mile.time <chr>, 10.mile.pace <chr>

What we want to do

rename some poorly named columns
convert variables saved as time objects into numbers (in seconds)
extract the runner’s sex from the “Division” string, or other clues
extract the runner’s “corral” number from the bib number
fix the age column, which, for some reason is a “character,” which means that some entries could not be parsed as numbers.
get time out of the somewhat mangled 5.25 and 10.mile columns (the pace columns are really mangled for 10 miles, so we will recalculate them ourselves).
get rid of unnecessary columns.
calculate any other convenience variables

rename columns

To rename columns, we will use the dplyr rename function.

cal1020 <- rename(cal1020, 
                  time.mile.5.25 = `5.25.mile.time`,
                  time.mile.10 = `10.mile.time`)

There’s lots more columns to rename, but let’s worry about those later, after we drop all the columns we won’t use. I wanted to rename these now, to avoid dealing working with the backticks when converting these two columns into seconds.

convert time columns into seconds.

The time.gun and time.chip columns seem to have been correctly parsed into time objects, so we can just cast them as.numeric to get the number of seconds they measure.

We will use the dplyr::mutate function to change these columns (if we used a new column name, this command would just make a new column, but since those names exist, it assigns new values to them).

cal1020 <- mutate(cal1020, 
       time.gun = as.numeric(time.gun),
       time.chip = as.numeric(time.chip))

The times recorded part of the way through the race are kind of messed up, so they could not be parsed into time objects automatically. Let’s try to figure out exactly how they are messed up.

Three asides here:

we use the dplyr %>% pipe operator to pass the output of one command as the first input into the next command. so the command above is equivalent to glimpse(select(cal1020, starts_with("time.mile"))), but is much easier to read.
we use the dplyr select() function to choose a subset of columns, by name.
we use the dplyr starts_with() command to identify column names that start with “time.mile”

So, this allows us to just look at those two columns of interest.

cal1020 %>% select(starts_with("time.mile")) %>% glimpse()

## Rows: 3,252
## Columns: 2
## $ time.mile.5.25 <chr> "25:12:00", "25:12:00", "25:31:00", "26:47:00", "26:46:00", "27:41:00", "27:54:00", "29:10:00", "28:41:00", …
## $ time.mile.10   <chr> "22:47", "22:52", "23:58", "24:14:00", "24:36:00", "25:44:00", "25:47:00", "24:50:00", "26:07:00", "26:35:00…

So, the 5.25 mile time measurement seems to have recorded minutes in the first part, then seconds, then zeros. the 10 mile time measurement seems to have done something similar, but dropped the zeros for some cases.

However, these are just the first few people we are looking at, if we look at the tail() of the data, we see that this coding seems to change for people who took more than an hour!

cal1020 %>% select(starts_with("time.mile")) %>% tail()

## # A tibble: 6 x 2
##   time.mile.5.25 time.mile.10
##   <chr>          <chr>       
## 1 1:31:59        1:36:12     
## 2 ?              3:08:42     
## 3 1:46:27        1:30:38     
## 4 1:34:21        1:55:48     
## 5 1:34:21        1:55:54     
## 6 ?              3:30:18

So, to these strings are sometimes “?” (missing data), sometimes in a “mm:ss:00” format, sometimes in a “mm:ss” format, and sometimes in a “hh:mm:ss” format. To convert these into seconds, we will need to detect whether the string is missing, which format is being used, and do the appropriate calculation. We do this in a later section, as it is a bit more complicated than everything else we are undertaking…

Extract sex from division, and other clues

We do not yet have an explicit coding of runner sex, but the division column gives us a pretty big hint. Let’s look at the unique division names:

unique(cal1020$Division)

##  [1] "10 Mile Overall"    "10 Mile M 35-39"    "10 Mile M 25-29"    "Wheelchair Top Fin" "10 Mile M 30-34"    "10 Mile Masters"   
##  [7] "10 Mile M 50-54"    "10 Mile M 45-49"    "10 Mile M 40-44"    "10 Mile F 20-24"    "10 Mile F 35-39"    "10 Mile F 30-34"   
## [13] "10 Mile F 25-29"    "10 Mile M 15-19"    "10 Mile M 55-59"    "10 Mile F 15-19"    "10 Mile F 40-44"    "10 Mile F 50-54"   
## [19] "10 Mile M 20-24"    "10 Mile F 45-49"    "10 Mile M 12-14"    "10 Mile M 60-64"    "10 Mile F 55-59"    "10 Mile M 65-69"   
## [25] "10 Mile F 60-64"    "Wheelchair M 1-99"  "10 Mile M 0- 0"     "10 Mile M 70-74"    "Wheelchair F 1-99"  "10 Mile F 65-69"   
## [31] "10 Mile F 12-14"    "10 Mile M 75-79"    "10 Mile F 70-74"    "10 Mile F 75-79"    "10 Mile M 80-99"

It seems that if the Division name includes " M " (with spaces!), we are talking about a male. If it includes " F ", we are talking about a female. We can use the stringr::str_detect command to do this. We will assign these two values independently, to make sure we didn’t overlook something, before merging them into one column.

cal1020 <- cal1020 %>%
  mutate(is.male = stringr::str_detect(cal1020$Division, " M "),
         is.female = stringr::str_detect(cal1020$Division, " F "),
         sex = case_when(
           is.male & is.female ~ "both",
           is.male ~ "male",
           is.female ~ "female", 
           TRUE ~ "neither"))

Now let’s see how many of each category we have. In theory, we should have only “male” and “female” labels (not because of some claim about transgender folks in the world at large, but because that’s how the division sex categorization works in this data set). We use the dplyr::count(X) command, which is a shortcut for group_by(X) %>% summarize(n=n()).

cal1020 %>% count(sex)

## # A tibble: 3 x 2
##   sex         n
##   <chr>   <int>
## 1 female   2054
## 2 male     1180
## 3 neither    18

Ok, so we don’t have any folks who were categorized as both male and female, but we have 18 folks who were not categorized as either based on division name. Who are they, and can we figure out their sex?

cal1020 %>% 
  filter(sex == 'neither') %>% 
  select(Overall.Place, name.first, Division, Sex.tot) %>%
  head(18)

## # A tibble: 18 x 4
##    Overall.Place name.first Division           Sex.tot
##            <dbl> <chr>      <chr>                <dbl>
##  1             1 Jordan     10 Mile Overall       1179
##  2             2 Macdonard  10 Mile Overall       1179
##  3             3 Sergio     10 Mile Overall       1179
##  4             4 Jamesom    10 Mile Overall       1179
##  5             5 Darren     10 Mile Overall       1179
##  6             1 Edwin      Wheelchair Top Fin      11
##  7             8 Lindsey    10 Mile Overall       2060
##  8             2 Ralph      Wheelchair Top Fin      11
##  9            13 Natasha    10 Mile Overall       2060
## 10            15 Jeff       10 Mile Masters       1179
## 11            16 Sarah      10 Mile Overall       2060
## 12            17 Brian      10 Mile Masters       1179
## 13            18 Robert     10 Mile Masters       1179
## 14            20 Ariana     10 Mile Overall       2060
## 15            25 Tori       10 Mile Overall       2060
## 16            40 Deeann     10 Mile Masters       2060
## 17            43 Theresa    10 Mile Masters       2060
## 18            46 Celestine  10 Mile Masters       2060

These people all have really high placements, and Division names that reflect that they were in a special Division (professionals, masters, etc). We can try to guess their sex based on their name, but we can also notice that the “Sex.tot” column seems to be correlated with sex. It should, since it should tell us exactly how many competitors there were of the same sex as a particular runner. Let’s check:

cal1020 %>% count(sex,Sex.tot)

## # A tibble: 8 x 3
##   sex     Sex.tot     n
##   <chr>     <dbl> <int>
## 1 female        2     2
## 2 female     2060  2052
## 3 male          5     5
## 4 male         11     9
## 5 male       1179  1166
## 6 neither      11     2
## 7 neither    1179     8
## 8 neither    2060     8

So, those that were labeled as female (based on division name), have either 2 or 2060 as Sex.tot, and males have either 5, 11, or 1179. This is a bit weird. Let’s try to figure out what these different Sex.tot groups mean.

cal1020 %>% 
  filter(Sex.tot %in% c(2, 5, 11)) %>% 
  select(Sex.tot, name.first, Division, sex) %>%
  arrange(Sex.tot) %>%
  head(20)

## # A tibble: 18 x 4
##    Sex.tot name.first  Division           sex    
##      <dbl> <chr>       <chr>              <chr>  
##  1       2 Lauren      Wheelchair F 1-99  female 
##  2       2 Karen       Wheelchair F 1-99  female 
##  3       5 Unknown     10 Mile M 0- 0     male   
##  4       5 Unknown     10 Mile M 0- 0     male   
##  5       5 Unknown     10 Mile M 0- 0     male   
##  6       5 Unknown     10 Mile M 0- 0     male   
##  7       5 Unknown     10 Mile M 0- 0     male   
##  8      11 Edwin       Wheelchair Top Fin neither
##  9      11 Ralph       Wheelchair Top Fin neither
## 10      11 N. Mauricio Wheelchair M 1-99  male   
## 11      11 Chuck       Wheelchair M 1-99  male   
## 12      11 Laird       Wheelchair M 1-99  male   
## 13      11 Riley       Wheelchair M 1-99  male   
## 14      11 James       Wheelchair M 1-99  male   
## 15      11 David       Wheelchair M 1-99  male   
## 16      11 Jake        Wheelchair M 1-99  male   
## 17      11 Randy       Wheelchair M 1-99  male   
## 18      11 Douglas     Wheelchair M 1-99  male

Ok, so Sex.tot=2 are the female Wheelchair racers, Sex.tot=11 are the male Wheelchair racers, Sex.tot=5 are some class of weird male runners who are missing a bunch of information, sex.tot=1179 is just male runners, and Sex.tot 2060 is female runners. Let’s assign a new sex column based on Sex.tot, and compare the two, to make sure we didn’t mess anything up.

cal1020 <- cal1020 %>%
  mutate(sex.2 = case_when(
Sex.tot %in% c(2, 2060) ~ 'female',
Sex.tot %in% c(11, 5, 1179) ~ 'male',
TRUE ~ 'neither'))

cal1020 %>% count(sex, sex.2, match=sex==sex.2)

## # A tibble: 4 x 4
##   sex     sex.2  match     n
##   <chr>   <chr>  <lgl> <int>
## 1 female  female TRUE   2054
## 2 male    male   TRUE   1180
## 3 neither female FALSE     8
## 4 neither male   FALSE    10

So: we get exactly the same categorization from Sex.tot, except we can also correctly classify the folks who could not be identified from the Division name. Great, let’s reassign the sex.2 column to sex. While we are at it, let’s make a column that tells us whether someone was a wheelchair racer.

cal1020 <- cal1020 %>%
  mutate(sex = sex.2,
         wheelchair = stringr::str_detect(Division, "Wheelchair"))

Get corral from bib number

We are told that bib numbers correspond to starting corral, such that numbers 0-999 are in the first corral, 1000-1999 are in the second, etc. Let’s make a new column indicating corral number:

cal1020 <- mutate(cal1020, corral = as.integer(floor(cal1020$bib/1000)))

fix the age column (which is a string for some reason)

Let’s see what’s happening with the age column, because it is a character, that means that at least some elements could not be interpreted as a number. If we force the age column to be a number, they will show up as NAs. So let’s see what information we would be throwing out by converting to NA:

cal1020 %>% filter(is.na(as.numeric(Age))) %>% select(Age, name.first)

## # A tibble: 5 x 2
##   Age   name.first
##   <chr> <chr>     
## 1 ?     Unknown   
## 2 ?     Unknown   
## 3 ?     Unknown   
## 4 ?     Unknown   
## 5 ?     Unknown

Ah, so the only problem are the missing values, so we can just convert age to a number, since we want the “?” cells to be NA.

cal1020 <- mutate(cal1020, Age = as.numeric(Age))

Get time out of the mangled 5.25 and 10 mile columns.

cal1020 %>% select(starts_with("time.mile")) %>% tail()

## # A tibble: 6 x 2
##   time.mile.5.25 time.mile.10
##   <chr>          <chr>       
## 1 1:31:59        1:36:12     
## 2 ?              3:08:42     
## 3 1:46:27        1:30:38     
## 4 1:34:21        1:55:48     
## 5 1:34:21        1:55:54     
## 6 ?              3:30:18

Remember, these strings are in a weird format: either “?” (missing data), “mm:ss:00” , “mm:ss,” or “hh:mm:ss.” We need to detect the format, and do the appropriate calculation. The function below does exactly that.

parseWeirdTime <- function(string){
  if(string == "?"){
    return(NA)
  } else {
    sub.strings <- as.numeric(stringr::str_split(string, ":")[[1]])
    if((length(sub.strings) == 2) || (sub.strings[3]==0 & sub.strings[1]>15)){
      # we are either in "mm:ss" or "mm:ss:00" format
      hours = 0
      minutes <- sub.strings[1]
      seconds <- sub.strings[2]
    } else if(sub.strings[1] < 5) {
      # we are in "hh:mm:ss" format
      hours <- sub.strings[1]
      minutes <- sub.strings[2]
      seconds <- sub.strings[3]
    } else {
      # if we see a number like 07:34:00, 
      # it doesn't make sense as either hh:mm:ss (too much time)
      # nor as mm:ss (too short for running ~5 miles)
      print(paste0("string format doesn't make sense: ", string))
      return(NA)
    }
    return(hours*60*60 + minutes*60 + seconds)
  }
}
parseWeirdTime("?")

## [1] NA

parseWeirdTime("30:44:00")

## [1] 1844

parseWeirdTime("30:44")

## [1] 1844

parseWeirdTime("2:30:44")

## [1] 9044

parseWeirdTime("14:30:00")

## [1] "string format doesn't make sense: 14:30:00"

## [1] NA

How does this work?
1. it takes a string, and checks if it is equal to “?” if so, it returns NA, which is a special R data type that means “not available,” and indicates that the data are missing.
2. stringr::str_split(string, ":") splits strings into vectors of strings using “:” as a separator.
3. however, that function returns a list of vectors, and we just want the first element of that list with [[1]] (since our function is written to operate on just one string at a time)
4. as.numeric converts the individual strings to numbers.
5. we check if those numbers make sense as either “mm:ss” or as “hh:mm:ss,” and assign hours/minutes/second accordingly.
6. we then calculate seconds from the hours, minutes, seconds, and return it.

Now we want to run this function on each element of the “time.mile” columns. However, this function is not “vectorized,” so we have to apply it to each element in isolation. the various “map” and “apply” functions are very well suited for this task. we use purr::map_dbl, to apply our function to a particular column, and return a list of numbers.

cal1020 <- mutate(cal1020, 
         time.mile.5.25 = map_dbl(cal1020$time.mile.5.25, parseWeirdTime),
         time.mile.10 = map_dbl(cal1020$time.mile.10, parseWeirdTime))
glimpse(cal1020)

## Rows: 3,252
## Columns: 28
## $ time.gun         <dbl> 2880, 2885, 2970, 3063, 3084, 3206, 3223, 3249, 3289, 3529, 3323, 3575, 3600, 3625, 3421, 3648, 3659, 3485…
## $ time.chip        <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485…
## $ bib              <dbl> 1205, 9, 13, 15, 1303, 1213, 3, 1055, 12, 1351, 1054, 1216, 1352, 1218, 6, 1220, 1304, 18, 1353, 1305, 101…
## $ name.first       <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last        <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ City             <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ State            <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?",…
## $ Division         <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ Class.Position   <dbl> 1, 2, 3, 4, 5, 1, 1, 1, 1, 1, 1, 2, 3, 2, 2, 4, 1, 3, 2, 3, 1, 4, 1, 2, 1, 2, 5, 1, 1, 1, 5, 3, 1, 3, 1, 2…
## $ Overall.Place    <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 8, 9, 2, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2…
## $ Age              <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26…
## $ Zip              <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ Sex.place        <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 1, 8, 2, 9, 10, 11, 2, 12, 13, 3, 14, 15, 16, 4, 17, 18, 19, 20, 5, 6, 7, 8, 21, 2…
## $ Div.tot          <dbl> 3239, 3239, 3239, 3239, 3239, 3239, 3239, 13, 3239, 3239, 13, 3239, 3239, 3239, 3239, 3239, 3239, 3239, 32…
## $ Sex.tot          <dbl> 1179, 1179, 1179, 1179, 1179, 1179, 1179, 11, 2060, 1179, 11, 1179, 1179, 1179, 2060, 1179, 1179, 2060, 11…
## $ AG.tot           <dbl> 5, 5, 5, 5, 5, 118, 85, 1, 5, 139, 1, 139, 139, 118, 5, 139, 3, 5, 3, 3, 167, 5, 178, 167, 191, 191, 5, 81…
## $ `5.25.mile.rank` <chr> "3", "2", "4", "6", "5", "7", "8", "1", "9", "10", "INC", "11", "13", "15", "12", "16", "19", "14", "18", …
## $ time.mile.5.25   <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, …
## $ `5.25.mile.pace` <chr> "4:48/M", "4:48/M", "4:52/M", "5:06/M", "5:06/M", "5:16/M", "5:19/M", "5:33/M", "5:28/M", "5:28/M", NA, "5…
## $ `10.mile.rank`   <chr> "1", "2", "3", "4", "5", "6", "7", "1", "8", "9", "13", "10", "12", "15", "16", "13", "11", "25", "18", "1…
## $ time.mile.10     <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726…
## $ `10.mile.pace`   <chr> "2:17/M", "2:17/M", "2:24/M", "2:25/M", "2:28/M", "2:34/M", "2:35/M", "2:29/M", "2:37/M", "2:40/M", "5:32/…
## $ is.male          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, F…
## $ is.female        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ sex              <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ sex.2            <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ wheelchair       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ corral           <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Let’s now check if this gave us sensible results. In theory, time for the first 5.25 miles, and the time between mile 5.25 and mile 10, should add up to the total time (time.chip; modulo some rounding error). If they do not, something is wrong.

off.by.2sec = abs((cal1020$time.mile.5.25 + cal1020$time.mile.10) - cal1020$time.chip) > 2
sum(is.na(off.by.2sec)) # number of rows with missing data which precluded calculation

## [1] 66

sum(!(off.by.2sec), na.rm=T)     # number of rows where we were not off by more than 2 seconds

## [1] 3186

sum((off.by.2sec), na.rm=T)     # number of rows where we were off by more than 2 seconds

## [1] 0

So it looks like everything worked out!

Get rid of some unnecessary columns

So, here’s where we are now:

glimpse(cal1020)

## Rows: 3,252
## Columns: 28
## $ time.gun         <dbl> 2880, 2885, 2970, 3063, 3084, 3206, 3223, 3249, 3289, 3529, 3323, 3575, 3600, 3625, 3421, 3648, 3659, 3485…
## $ time.chip        <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485…
## $ bib              <dbl> 1205, 9, 13, 15, 1303, 1213, 3, 1055, 12, 1351, 1054, 1216, 1352, 1218, 6, 1220, 1304, 18, 1353, 1305, 101…
## $ name.first       <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last        <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ City             <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ State            <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?",…
## $ Division         <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ Class.Position   <dbl> 1, 2, 3, 4, 5, 1, 1, 1, 1, 1, 1, 2, 3, 2, 2, 4, 1, 3, 2, 3, 1, 4, 1, 2, 1, 2, 5, 1, 1, 1, 5, 3, 1, 3, 1, 2…
## $ Overall.Place    <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 8, 9, 2, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2…
## $ Age              <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26…
## $ Zip              <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ Sex.place        <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 1, 8, 2, 9, 10, 11, 2, 12, 13, 3, 14, 15, 16, 4, 17, 18, 19, 20, 5, 6, 7, 8, 21, 2…
## $ Div.tot          <dbl> 3239, 3239, 3239, 3239, 3239, 3239, 3239, 13, 3239, 3239, 13, 3239, 3239, 3239, 3239, 3239, 3239, 3239, 32…
## $ Sex.tot          <dbl> 1179, 1179, 1179, 1179, 1179, 1179, 1179, 11, 2060, 1179, 11, 1179, 1179, 1179, 2060, 1179, 1179, 2060, 11…
## $ AG.tot           <dbl> 5, 5, 5, 5, 5, 118, 85, 1, 5, 139, 1, 139, 139, 118, 5, 139, 3, 5, 3, 3, 167, 5, 178, 167, 191, 191, 5, 81…
## $ `5.25.mile.rank` <chr> "3", "2", "4", "6", "5", "7", "8", "1", "9", "10", "INC", "11", "13", "15", "12", "16", "19", "14", "18", …
## $ time.mile.5.25   <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, …
## $ `5.25.mile.pace` <chr> "4:48/M", "4:48/M", "4:52/M", "5:06/M", "5:06/M", "5:16/M", "5:19/M", "5:33/M", "5:28/M", "5:28/M", NA, "5…
## $ `10.mile.rank`   <chr> "1", "2", "3", "4", "5", "6", "7", "1", "8", "9", "13", "10", "12", "15", "16", "13", "11", "25", "18", "1…
## $ time.mile.10     <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726…
## $ `10.mile.pace`   <chr> "2:17/M", "2:17/M", "2:24/M", "2:25/M", "2:28/M", "2:34/M", "2:35/M", "2:29/M", "2:37/M", "2:40/M", "5:32/…
## $ is.male          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, F…
## $ is.female        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ sex              <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ sex.2            <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ wheelchair       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ corral           <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Let’s get rid of the redundant columns, to make this more manageable. Here are the columns I don’t think we need:

drop <- c('time.gun',
          'bib',
          'Class.Position',
          'Overall.Place',
          'Sex.place',
          'Div.tot',
          'Sex.tot',
          'AG.tot',
          '5.25.mile.rank',
          '5.25.mile.pace',
          '10.mile.rank',
          '10.mile.pace',
          'is.male',
          'is.female',
          'sex.2')
cal1020 <- cal1020 %>% select(-one_of(drop))

Let’s also convert all the names to lower case.

names(cal1020) <- tolower(names(cal1020))

glimpse(cal1020)

## Rows: 3,252
## Columns: 13
## $ time.chip      <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485, …
## $ name.first     <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ralp…
## $ name.last      <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "Pi…
## $ city           <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Commer…
## $ state          <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?", "…
## $ division       <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M 35…
## $ age            <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26, …
## $ zip            <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078", …
## $ time.mile.5.25 <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, 18…
## $ time.mile.10   <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726, …
## $ sex            <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", "m…
## $ wheelchair     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ corral         <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

Getting new columns from simple transformations

Let’s specify that time is in seconds in the column name, and calculate the pace and speed for the total race, and each half.

min/mile from time in seconds We can get total pace over the 10 miles via division:

cal1020 <- cal1020 %>% 
  mutate(time.sec = time.chip,
                   speed.mph = 10/(time.sec/60/60),
                   pace.min = (time.sec/60)/10,
                   speed.mph.first = 5.25/(time.mile.5.25/60/60),
                   speed.mph.second= (10-5.25)/(time.mile.10/60/60),
                   pace.min.first = (time.mile.5.25/60)/5.25,
                   pace.min.second = (time.mile.10/60)/(10-5.25)) %>%
  select(-time.chip)
                   
glimpse(cal1020)

## Rows: 3,252
## Columns: 19
## $ name.first       <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last        <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ city             <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ state            <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?",…
## $ division         <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ age              <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26…
## $ zip              <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ time.mile.5.25   <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, …
## $ time.mile.10     <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726…
## $ sex              <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ wheelchair       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ corral           <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ time.sec         <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485…
## $ speed.mph        <dbl> 12.500000, 12.478336, 12.121212, 11.757022, 11.676938, 11.228946, 11.173184, 11.107683, 10.945576, 10.8499…
## $ pace.min         <dbl> 4.800000, 4.808333, 4.950000, 5.103333, 5.138333, 5.343333, 5.370000, 5.401667, 5.481667, 5.530000, 5.5333…
## $ speed.mph.first  <dbl> 12.500000, 12.500000, 12.344873, 11.761045, 11.768369, 11.378688, 11.290323, 10.800000, 10.981987, 10.9756…
## $ speed.mph.second <dbl> 12.509144, 12.463557, 11.891516, 11.760660, 11.585366, 11.075130, 11.053652, 11.476510, 10.912572, 10.7210…
## $ pace.min.first   <dbl> 4.800000, 4.800000, 4.860317, 5.101587, 5.098413, 5.273016, 5.314286, 5.555556, 5.463492, 5.466667, NA, 5.…
## $ pace.min.second  <dbl> 4.796491, 4.814035, 5.045614, 5.101754, 5.178947, 5.417544, 5.428070, 5.228070, 5.498246, 5.596491, 11.649…

Fixing states.

There are a few puzzling things remaining; the states look weird.

sort(unique(cal1020$state))

##  [1] "?"          "AB"         "AL"         "AP"         "ARIZONA"    "AZ"         "B.C.N"      "BC"         "CA"        
## [10] "CALIFORNIA" "CO"         "CT"         "DC"         "EUR"        "FL"         "FLORIDA"    "GA"         "HI"        
## [19] "ID"         "IL"         "KY"         "MA"         "MD"         "ME"         "MEXICO"     "MI"         "MN"        
## [28] "MO"         "MT"         "ND"         "NE"         "NEBRASKA"   "NEVADA"     "NEW JERSEY" "NJ"         "NL"        
## [37] "NV"         "NY"         "OH"         "OK"         "ON"         "OR"         "SD"         "TX"         "UT"        
## [46] "UTAH"       "VA"         "VIRGINIA"   "WA"         "WASHINGTON" "WI"         "WV"

We can clearly have some redundant coding, and a few mysterious things like “AP” and “B.C.N.” Let’s make a dictionary to remap these values. BCN is likely Baja – let’s call it MEXICO. BC is likely british columbia – call it CANADA. AB is probably alberta, ON is ontario, NL is newfoundland/labrador (zip code makes sense), “AP” is us military in the pacific… let’s call that NA.

state.rename = c("ARIZONA" = "AZ",
                 "CALIFORNIA" = "CA",
                 "FLORIDA" = "FL",
                 "NEBRASKA" = "NE",
                 "NEVADA" = "NV",
                 "NEW JERSEY" = "NJ",
                 "VIRGINIA" = "VA",
                 "WASHINGTON" = "WA",  # could be DC, but zipcode indicates WA state.
                 "B.C.N" = "MEXICO",
                 "BC" = "CANADA",
                 "AB" = "CANADA",
                 "NL" = "CANADA",
                 "ON" = "CANADA",
                 "AP" = NA,
                 "UTAH" = "UT",
                 "EUR" = "EUROPE",
                 "?" = NA)

cal1020 <- cal1020 %>% 
  mutate(state = ifelse(state %in% names(state.rename),
                        state.rename[state],
                        state))

sort(unique(cal1020$state))

##  [1] "AL"     "AZ"     "CA"     "CANADA" "CO"     "CT"     "DC"     "EUROPE" "FL"     "GA"     "HI"     "ID"     "IL"     "KY"    
## [15] "MA"     "MD"     "ME"     "MEXICO" "MI"     "MN"     "MO"     "MT"     "ND"     "NE"     "NJ"     "NV"     "NY"     "OH"    
## [29] "OK"     "OR"     "SD"     "TX"     "UT"     "VA"     "WA"     "WI"     "WV"

Well, we fixed most of them. Perhaps a more thorough check of zipcode-state consistency would make sense if we really cared about state of origin.

Save the cleaned data.

So here’s what we have at the end:

glimpse(cal1020)

## Rows: 3,252
## Columns: 19
## $ name.first       <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last        <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ city             <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ state            <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", NA, …
## $ division         <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ age              <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26…
## $ zip              <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ time.mile.5.25   <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, …
## $ time.mile.10     <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726…
## $ sex              <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ wheelchair       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ corral           <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ time.sec         <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485…
## $ speed.mph        <dbl> 12.500000, 12.478336, 12.121212, 11.757022, 11.676938, 11.228946, 11.173184, 11.107683, 10.945576, 10.8499…
## $ pace.min         <dbl> 4.800000, 4.808333, 4.950000, 5.103333, 5.138333, 5.343333, 5.370000, 5.401667, 5.481667, 5.530000, 5.5333…
## $ speed.mph.first  <dbl> 12.500000, 12.500000, 12.344873, 11.761045, 11.768369, 11.378688, 11.290323, 10.800000, 10.981987, 10.9756…
## $ speed.mph.second <dbl> 12.509144, 12.463557, 11.891516, 11.760660, 11.585366, 11.075130, 11.053652, 11.476510, 10.912572, 10.7210…
## $ pace.min.first   <dbl> 4.800000, 4.800000, 4.860317, 5.101587, 5.098413, 5.273016, 5.314286, 5.555556, 5.463492, 5.466667, NA, 5.…
## $ pace.min.second  <dbl> 4.796491, 4.814035, 5.045614, 5.101754, 5.178947, 5.417544, 5.428070, 5.228070, 5.498246, 5.596491, 11.649…

Note: we save this both as a csv file (to be easy to read by all other software), and as an ‘RData’ file, which is only useful in R, but preserves all the “data types” we defined.

write.csv(cal1020, file="cal1020.cleaned.csv")
save(list = "cal1020", file="cal1020.cleaned.Rdata")