Data cleaning – worked example
We will practice reading and cleaning data using the published race data for the California “10 20” running event (10 miles, 20 bands). These data were scraped from the official race website, then fixed up a bit and saved as a tab delimited text file. This text file can be found here.
We will use the various tidyverse packages to carry out this cleaning.
library(tidyverse)
Read file with readr
First, let’s read the data in using readr::read_tsv
, creating a data-frame saved as cal1020
. “tsv” generally refers to “tab separated values,” which is what we have.
= readr::read_tsv('http://vulstats.ucsd.edu/data/cal10202.txt') cal1020
The readr
functions (such as read_csv, read_tsv, etc.) are very similar to the base R functions (read.csv), but they have more sensible defaults, and also save the data as a “tibble,” which works just like a data frame, but has a few added bells and whistles.
First glimpse at the data.
What’s in this data frame? We can check with str()
, or a slightly neater display using the dplyr::glimpse()
function.
glimpse(cal1020)
## Rows: 3,252
## Columns: 22
## $ time.gun <time> 00:48:00, 00:48:05, 00:49:30, 00:51:03, 00:51:24, 00:53:26, 00:53:43, 00:54:09, 00:54:49, 00:58:49, 00:55…
## $ time.chip <time> 00:48:00, 00:48:05, 00:49:30, 00:51:02, 00:51:23, 00:53:26, 00:53:42, 00:54:01, 00:54:49, 00:55:18, 00:55…
## $ bib <dbl> 1205, 9, 13, 15, 1303, 1213, 3, 1055, 12, 1351, 1054, 1216, 1352, 1218, 6, 1220, 1304, 18, 1353, 1305, 101…
## $ name.first <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ City <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ State <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?",…
## $ Division <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ Class.Position <dbl> 1, 2, 3, 4, 5, 1, 1, 1, 1, 1, 1, 2, 3, 2, 2, 4, 1, 3, 2, 3, 1, 4, 1, 2, 1, 2, 5, 1, 1, 1, 5, 3, 1, 3, 1, 2…
## $ Overall.Place <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 8, 9, 2, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2…
## $ Age <chr> "25", "29", "32", "30", "28", "39", "26", "42", "27", "33", "60", "34", "33", "39", "26", "32", "41", "24"…
## $ Zip <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ Sex.place <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 1, 8, 2, 9, 10, 11, 2, 12, 13, 3, 14, 15, 16, 4, 17, 18, 19, 20, 5, 6, 7, 8, 21, 2…
## $ Div.tot <dbl> 3239, 3239, 3239, 3239, 3239, 3239, 3239, 13, 3239, 3239, 13, 3239, 3239, 3239, 3239, 3239, 3239, 3239, 32…
## $ Sex.tot <dbl> 1179, 1179, 1179, 1179, 1179, 1179, 1179, 11, 2060, 1179, 11, 1179, 1179, 1179, 2060, 1179, 1179, 2060, 11…
## $ AG.tot <dbl> 5, 5, 5, 5, 5, 118, 85, 1, 5, 139, 1, 139, 139, 118, 5, 139, 3, 5, 3, 3, 167, 5, 178, 167, 191, 191, 5, 81…
## $ `5.25.mile.rank` <chr> "3", "2", "4", "6", "5", "7", "8", "1", "9", "10", "INC", "11", "13", "15", "12", "16", "19", "14", "18", …
## $ `5.25.mile.time` <chr> "25:12:00", "25:12:00", "25:31:00", "26:47:00", "26:46:00", "27:41:00", "27:54:00", "29:10:00", "28:41:00"…
## $ `5.25.mile.pace` <chr> "4:48/M", "4:48/M", "4:52/M", "5:06/M", "5:06/M", "5:16/M", "5:19/M", "5:33/M", "5:28/M", "5:28/M", NA, "5…
## $ `10.mile.rank` <chr> "1", "2", "3", "4", "5", "6", "7", "1", "8", "9", "13", "10", "12", "15", "16", "13", "11", "25", "18", "1…
## $ `10.mile.time` <chr> "22:47", "22:52", "23:58", "24:14:00", "24:36:00", "25:44:00", "25:47:00", "24:50:00", "26:07:00", "26:35:…
## $ `10.mile.pace` <chr> "2:17/M", "2:17/M", "2:24/M", "2:25/M", "2:28/M", "2:34/M", "2:35/M", "2:29/M", "2:37/M", "2:40/M", "5:32/…
From this we learn lots of things, the most immediately relevant are:
the names of the variables (columns); we could have gotten just this part by running
names(cal1020)
)cal1020 is a data frame with 22 variables (columns) and 3252 observations (rows); we could have gotten just this part by running
dim(cal1020)
all the variables are saved as either a “chr” (a character string), an “int” (an integer), or a “time” variable. (if we had used the base R read.tsv, many of the strings would have been converted to factors, and the variables recognized as “time” would have been treated as strings as well).
Before we work on fixing this, let’s look at some of the rows in the data. Looking at 3252 observations at once is silly, so let’s just look at the first 10 using the head()
command. Because there are 22 variables, the printout doesn’t fit. In R markdown we get a fancy scrollable table, but in the console you would just get a truncated list (this behavior differs slightly between data frames and “tibbles,” with tibbles yielding more legible console output).
head(cal1020, n = 10)
## # A tibble: 10 x 22
## time.gun time.chip bib name.first name.last City State Division Class.Position Overall.Place Age Zip Sex.place Div.tot
## <time> <time> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 48'00" 48'00" 1205 Jordan Chipangama Flagst… AZ 10 Mile O… 1 1 25 86004 1 3239
## 2 48'05" 48'05" 9 Macdonard Ondara Grand … TX 10 Mile O… 2 2 29 75054 2 3239
## 3 49'30" 49'30" 13 Sergio Reyes Palmda… CA 10 Mile O… 3 3 32 93551 3 3239
## 4 51'03" 51'02" 15 Jamesom Mora Arroyo… CA 10 Mile O… 4 4 30 93420 4 3239
## 5 51'24" 51'23" 1303 Darren Brown Solana… CA 10 Mile O… 5 5 28 92075 5 3239
## 6 53'26" 53'26" 1213 Okwaro Raura Oceans… CA 10 Mile M… 1 6 39 92057 6 3239
## 7 53'43" 53'42" 3 Steven Underwood Encini… CA 10 Mile M… 1 7 26 92024 7 3239
## 8 54'09" 54'01" 1055 Edwin Figueroa Commer… CA Wheelchai… 1 1 42 90040 1 13
## 9 54'49" 54'49" 12 Lindsey Scherf High F… NY 10 Mile O… 1 8 27 12440 1 3239
## 10 58'49" 55'18" 1351 Derek Bradley Encini… CA 10 Mile M… 1 9 33 92024 8 3239
## # … with 8 more variables: Sex.tot <dbl>, AG.tot <dbl>, 5.25.mile.rank <chr>, 5.25.mile.time <chr>, 5.25.mile.pace <chr>,
## # 10.mile.rank <chr>, 10.mile.time <chr>, 10.mile.pace <chr>
What we want to do
rename some poorly named columns
convert variables saved as time objects into numbers (in seconds)
extract the runner’s sex from the “Division” string, or other clues
extract the runner’s “corral” number from the bib number
fix the age column, which, for some reason is a “character,” which means that some entries could not be parsed as numbers.
get time out of the somewhat mangled 5.25 and 10.mile columns (the pace columns are really mangled for 10 miles, so we will recalculate them ourselves).
get rid of unnecessary columns.
calculate any other convenience variables
rename columns
To rename columns, we will use the dplyr rename
function.
<- rename(cal1020,
cal1020 time.mile.5.25 = `5.25.mile.time`,
time.mile.10 = `10.mile.time`)
There’s lots more columns to rename, but let’s worry about those later, after we drop all the columns we won’t use. I wanted to rename these now, to avoid dealing working with the backticks when converting these two columns into seconds.
convert time columns into seconds.
The time.gun and time.chip columns seem to have been correctly parsed into time objects, so we can just cast them as.numeric to get the number of seconds they measure.
We will use the dplyr::mutate
function to change these columns (if we used a new column name, this command would just make a new column, but since those names exist, it assigns new values to them).
<- mutate(cal1020,
cal1020 time.gun = as.numeric(time.gun),
time.chip = as.numeric(time.chip))
The times recorded part of the way through the race are kind of messed up, so they could not be parsed into time objects automatically. Let’s try to figure out exactly how they are messed up.
Three asides here:
we use the dplyr
%>%
pipe operator to pass the output of one command as the first input into the next command. so the command above is equivalent toglimpse(select(cal1020, starts_with("time.mile")))
, but is much easier to read.we use the dplyr
select()
function to choose a subset of columns, by name.we use the dplyr
starts_with()
command to identify column names that start with “time.mile”
So, this allows us to just look at those two columns of interest.
%>% select(starts_with("time.mile")) %>% glimpse() cal1020
## Rows: 3,252
## Columns: 2
## $ time.mile.5.25 <chr> "25:12:00", "25:12:00", "25:31:00", "26:47:00", "26:46:00", "27:41:00", "27:54:00", "29:10:00", "28:41:00", …
## $ time.mile.10 <chr> "22:47", "22:52", "23:58", "24:14:00", "24:36:00", "25:44:00", "25:47:00", "24:50:00", "26:07:00", "26:35:00…
So, the 5.25 mile time measurement seems to have recorded minutes in the first part, then seconds, then zeros. the 10 mile time measurement seems to have done something similar, but dropped the zeros for some cases.
However, these are just the first few people we are looking at, if we look at the tail()
of the data, we see that this coding seems to change for people who took more than an hour!
%>% select(starts_with("time.mile")) %>% tail() cal1020
## # A tibble: 6 x 2
## time.mile.5.25 time.mile.10
## <chr> <chr>
## 1 1:31:59 1:36:12
## 2 ? 3:08:42
## 3 1:46:27 1:30:38
## 4 1:34:21 1:55:48
## 5 1:34:21 1:55:54
## 6 ? 3:30:18
So, to these strings are sometimes “?” (missing data), sometimes in a “mm:ss:00” format, sometimes in a “mm:ss” format, and sometimes in a “hh:mm:ss” format. To convert these into seconds, we will need to detect whether the string is missing, which format is being used, and do the appropriate calculation. We do this in a later section, as it is a bit more complicated than everything else we are undertaking…
Extract sex from division, and other clues
We do not yet have an explicit coding of runner sex, but the division column gives us a pretty big hint. Let’s look at the unique division names:
unique(cal1020$Division)
## [1] "10 Mile Overall" "10 Mile M 35-39" "10 Mile M 25-29" "Wheelchair Top Fin" "10 Mile M 30-34" "10 Mile Masters"
## [7] "10 Mile M 50-54" "10 Mile M 45-49" "10 Mile M 40-44" "10 Mile F 20-24" "10 Mile F 35-39" "10 Mile F 30-34"
## [13] "10 Mile F 25-29" "10 Mile M 15-19" "10 Mile M 55-59" "10 Mile F 15-19" "10 Mile F 40-44" "10 Mile F 50-54"
## [19] "10 Mile M 20-24" "10 Mile F 45-49" "10 Mile M 12-14" "10 Mile M 60-64" "10 Mile F 55-59" "10 Mile M 65-69"
## [25] "10 Mile F 60-64" "Wheelchair M 1-99" "10 Mile M 0- 0" "10 Mile M 70-74" "Wheelchair F 1-99" "10 Mile F 65-69"
## [31] "10 Mile F 12-14" "10 Mile M 75-79" "10 Mile F 70-74" "10 Mile F 75-79" "10 Mile M 80-99"
It seems that if the Division name includes " M " (with spaces!), we are talking about a male. If it includes " F ", we are talking about a female. We can use the stringr::str_detect
command to do this. We will assign these two values independently, to make sure we didn’t overlook something, before merging them into one column.
<- cal1020 %>%
cal1020 mutate(is.male = stringr::str_detect(cal1020$Division, " M "),
is.female = stringr::str_detect(cal1020$Division, " F "),
sex = case_when(
& is.female ~ "both",
is.male ~ "male",
is.male ~ "female",
is.female TRUE ~ "neither"))
Now let’s see how many of each category we have. In theory, we should have only “male” and “female” labels (not because of some claim about transgender folks in the world at large, but because that’s how the division sex categorization works in this data set). We use the dplyr::count(X)
command, which is a shortcut for group_by(X) %>% summarize(n=n())
.
%>% count(sex) cal1020
## # A tibble: 3 x 2
## sex n
## <chr> <int>
## 1 female 2054
## 2 male 1180
## 3 neither 18
Ok, so we don’t have any folks who were categorized as both male and female, but we have 18 folks who were not categorized as either based on division name. Who are they, and can we figure out their sex?
%>%
cal1020 filter(sex == 'neither') %>%
select(Overall.Place, name.first, Division, Sex.tot) %>%
head(18)
## # A tibble: 18 x 4
## Overall.Place name.first Division Sex.tot
## <dbl> <chr> <chr> <dbl>
## 1 1 Jordan 10 Mile Overall 1179
## 2 2 Macdonard 10 Mile Overall 1179
## 3 3 Sergio 10 Mile Overall 1179
## 4 4 Jamesom 10 Mile Overall 1179
## 5 5 Darren 10 Mile Overall 1179
## 6 1 Edwin Wheelchair Top Fin 11
## 7 8 Lindsey 10 Mile Overall 2060
## 8 2 Ralph Wheelchair Top Fin 11
## 9 13 Natasha 10 Mile Overall 2060
## 10 15 Jeff 10 Mile Masters 1179
## 11 16 Sarah 10 Mile Overall 2060
## 12 17 Brian 10 Mile Masters 1179
## 13 18 Robert 10 Mile Masters 1179
## 14 20 Ariana 10 Mile Overall 2060
## 15 25 Tori 10 Mile Overall 2060
## 16 40 Deeann 10 Mile Masters 2060
## 17 43 Theresa 10 Mile Masters 2060
## 18 46 Celestine 10 Mile Masters 2060
These people all have really high placements, and Division names that reflect that they were in a special Division (professionals, masters, etc). We can try to guess their sex based on their name, but we can also notice that the “Sex.tot” column seems to be correlated with sex. It should, since it should tell us exactly how many competitors there were of the same sex as a particular runner. Let’s check:
%>% count(sex,Sex.tot) cal1020
## # A tibble: 8 x 3
## sex Sex.tot n
## <chr> <dbl> <int>
## 1 female 2 2
## 2 female 2060 2052
## 3 male 5 5
## 4 male 11 9
## 5 male 1179 1166
## 6 neither 11 2
## 7 neither 1179 8
## 8 neither 2060 8
So, those that were labeled as female (based on division name), have either 2 or 2060 as Sex.tot, and males have either 5, 11, or 1179. This is a bit weird. Let’s try to figure out what these different Sex.tot groups mean.
%>%
cal1020 filter(Sex.tot %in% c(2, 5, 11)) %>%
select(Sex.tot, name.first, Division, sex) %>%
arrange(Sex.tot) %>%
head(20)
## # A tibble: 18 x 4
## Sex.tot name.first Division sex
## <dbl> <chr> <chr> <chr>
## 1 2 Lauren Wheelchair F 1-99 female
## 2 2 Karen Wheelchair F 1-99 female
## 3 5 Unknown 10 Mile M 0- 0 male
## 4 5 Unknown 10 Mile M 0- 0 male
## 5 5 Unknown 10 Mile M 0- 0 male
## 6 5 Unknown 10 Mile M 0- 0 male
## 7 5 Unknown 10 Mile M 0- 0 male
## 8 11 Edwin Wheelchair Top Fin neither
## 9 11 Ralph Wheelchair Top Fin neither
## 10 11 N. Mauricio Wheelchair M 1-99 male
## 11 11 Chuck Wheelchair M 1-99 male
## 12 11 Laird Wheelchair M 1-99 male
## 13 11 Riley Wheelchair M 1-99 male
## 14 11 James Wheelchair M 1-99 male
## 15 11 David Wheelchair M 1-99 male
## 16 11 Jake Wheelchair M 1-99 male
## 17 11 Randy Wheelchair M 1-99 male
## 18 11 Douglas Wheelchair M 1-99 male
Ok, so Sex.tot=2 are the female Wheelchair racers, Sex.tot=11 are the male Wheelchair racers, Sex.tot=5 are some class of weird male runners who are missing a bunch of information, sex.tot=1179 is just male runners, and Sex.tot 2060 is female runners. Let’s assign a new sex column based on Sex.tot, and compare the two, to make sure we didn’t mess anything up.
<- cal1020 %>%
cal1020 mutate(sex.2 = case_when(
%in% c(2, 2060) ~ 'female',
Sex.tot %in% c(11, 5, 1179) ~ 'male',
Sex.tot TRUE ~ 'neither'))
%>% count(sex, sex.2, match=sex==sex.2) cal1020
## # A tibble: 4 x 4
## sex sex.2 match n
## <chr> <chr> <lgl> <int>
## 1 female female TRUE 2054
## 2 male male TRUE 1180
## 3 neither female FALSE 8
## 4 neither male FALSE 10
So: we get exactly the same categorization from Sex.tot, except we can also correctly classify the folks who could not be identified from the Division name. Great, let’s reassign the sex.2 column to sex. While we are at it, let’s make a column that tells us whether someone was a wheelchair racer.
<- cal1020 %>%
cal1020 mutate(sex = sex.2,
wheelchair = stringr::str_detect(Division, "Wheelchair"))
Get corral from bib number
We are told that bib numbers correspond to starting corral, such that numbers 0-999 are in the first corral, 1000-1999 are in the second, etc. Let’s make a new column indicating corral number:
<- mutate(cal1020, corral = as.integer(floor(cal1020$bib/1000))) cal1020
fix the age column (which is a string for some reason)
Let’s see what’s happening with the age column, because it is a character, that means that at least some elements could not be interpreted as a number. If we force the age column to be a number, they will show up as NAs. So let’s see what information we would be throwing out by converting to NA:
%>% filter(is.na(as.numeric(Age))) %>% select(Age, name.first) cal1020
## # A tibble: 5 x 2
## Age name.first
## <chr> <chr>
## 1 ? Unknown
## 2 ? Unknown
## 3 ? Unknown
## 4 ? Unknown
## 5 ? Unknown
Ah, so the only problem are the missing values, so we can just convert age to a number, since we want the “?” cells to be NA.
<- mutate(cal1020, Age = as.numeric(Age)) cal1020
Get time out of the mangled 5.25 and 10 mile columns.
%>% select(starts_with("time.mile")) %>% tail() cal1020
## # A tibble: 6 x 2
## time.mile.5.25 time.mile.10
## <chr> <chr>
## 1 1:31:59 1:36:12
## 2 ? 3:08:42
## 3 1:46:27 1:30:38
## 4 1:34:21 1:55:48
## 5 1:34:21 1:55:54
## 6 ? 3:30:18
Remember, these strings are in a weird format: either “?” (missing data), “mm:ss:00” , “mm:ss,” or “hh:mm:ss.” We need to detect the format, and do the appropriate calculation. The function below does exactly that.
<- function(string){
parseWeirdTime if(string == "?"){
return(NA)
else {
} <- as.numeric(stringr::str_split(string, ":")[[1]])
sub.strings if((length(sub.strings) == 2) || (sub.strings[3]==0 & sub.strings[1]>15)){
# we are either in "mm:ss" or "mm:ss:00" format
= 0
hours <- sub.strings[1]
minutes <- sub.strings[2]
seconds else if(sub.strings[1] < 5) {
} # we are in "hh:mm:ss" format
<- sub.strings[1]
hours <- sub.strings[2]
minutes <- sub.strings[3]
seconds else {
} # if we see a number like 07:34:00,
# it doesn't make sense as either hh:mm:ss (too much time)
# nor as mm:ss (too short for running ~5 miles)
print(paste0("string format doesn't make sense: ", string))
return(NA)
}return(hours*60*60 + minutes*60 + seconds)
}
}parseWeirdTime("?")
## [1] NA
parseWeirdTime("30:44:00")
## [1] 1844
parseWeirdTime("30:44")
## [1] 1844
parseWeirdTime("2:30:44")
## [1] 9044
parseWeirdTime("14:30:00")
## [1] "string format doesn't make sense: 14:30:00"
## [1] NA
How does this work?
1. it takes a string, and checks if it is equal to “?” if so, it returns NA
, which is a special R data type that means “not available,” and indicates that the data are missing.
2. stringr::str_split(string, ":")
splits strings into vectors of strings using “:” as a separator.
3. however, that function returns a list of vectors, and we just want the first element of that list with [[1]]
(since our function is written to operate on just one string at a time)
4. as.numeric
converts the individual strings to numbers.
5. we check if those numbers make sense as either “mm:ss” or as “hh:mm:ss,” and assign hours/minutes/second accordingly.
6. we then calculate seconds from the hours, minutes, seconds, and return it.
Now we want to run this function on each element of the “time.mile” columns. However, this function is not “vectorized,” so we have to apply it to each element in isolation. the various “map” and “apply” functions are very well suited for this task. we use purr::map_dbl, to apply our function to a particular column, and return a list of numbers.
<- mutate(cal1020,
cal1020 time.mile.5.25 = map_dbl(cal1020$time.mile.5.25, parseWeirdTime),
time.mile.10 = map_dbl(cal1020$time.mile.10, parseWeirdTime))
glimpse(cal1020)
## Rows: 3,252
## Columns: 28
## $ time.gun <dbl> 2880, 2885, 2970, 3063, 3084, 3206, 3223, 3249, 3289, 3529, 3323, 3575, 3600, 3625, 3421, 3648, 3659, 3485…
## $ time.chip <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485…
## $ bib <dbl> 1205, 9, 13, 15, 1303, 1213, 3, 1055, 12, 1351, 1054, 1216, 1352, 1218, 6, 1220, 1304, 18, 1353, 1305, 101…
## $ name.first <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ City <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ State <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?",…
## $ Division <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ Class.Position <dbl> 1, 2, 3, 4, 5, 1, 1, 1, 1, 1, 1, 2, 3, 2, 2, 4, 1, 3, 2, 3, 1, 4, 1, 2, 1, 2, 5, 1, 1, 1, 5, 3, 1, 3, 1, 2…
## $ Overall.Place <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 8, 9, 2, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2…
## $ Age <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26…
## $ Zip <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ Sex.place <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 1, 8, 2, 9, 10, 11, 2, 12, 13, 3, 14, 15, 16, 4, 17, 18, 19, 20, 5, 6, 7, 8, 21, 2…
## $ Div.tot <dbl> 3239, 3239, 3239, 3239, 3239, 3239, 3239, 13, 3239, 3239, 13, 3239, 3239, 3239, 3239, 3239, 3239, 3239, 32…
## $ Sex.tot <dbl> 1179, 1179, 1179, 1179, 1179, 1179, 1179, 11, 2060, 1179, 11, 1179, 1179, 1179, 2060, 1179, 1179, 2060, 11…
## $ AG.tot <dbl> 5, 5, 5, 5, 5, 118, 85, 1, 5, 139, 1, 139, 139, 118, 5, 139, 3, 5, 3, 3, 167, 5, 178, 167, 191, 191, 5, 81…
## $ `5.25.mile.rank` <chr> "3", "2", "4", "6", "5", "7", "8", "1", "9", "10", "INC", "11", "13", "15", "12", "16", "19", "14", "18", …
## $ time.mile.5.25 <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, …
## $ `5.25.mile.pace` <chr> "4:48/M", "4:48/M", "4:52/M", "5:06/M", "5:06/M", "5:16/M", "5:19/M", "5:33/M", "5:28/M", "5:28/M", NA, "5…
## $ `10.mile.rank` <chr> "1", "2", "3", "4", "5", "6", "7", "1", "8", "9", "13", "10", "12", "15", "16", "13", "11", "25", "18", "1…
## $ time.mile.10 <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726…
## $ `10.mile.pace` <chr> "2:17/M", "2:17/M", "2:24/M", "2:25/M", "2:28/M", "2:34/M", "2:35/M", "2:29/M", "2:37/M", "2:40/M", "5:32/…
## $ is.male <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, F…
## $ is.female <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ sex <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ sex.2 <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ wheelchair <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ corral <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Let’s now check if this gave us sensible results. In theory, time for the first 5.25 miles, and the time between mile 5.25 and mile 10, should add up to the total time (time.chip; modulo some rounding error). If they do not, something is wrong.
.2sec = abs((cal1020$time.mile.5.25 + cal1020$time.mile.10) - cal1020$time.chip) > 2
off.bysum(is.na(off.by.2sec)) # number of rows with missing data which precluded calculation
## [1] 66
sum(!(off.by.2sec), na.rm=T) # number of rows where we were not off by more than 2 seconds
## [1] 3186
sum((off.by.2sec), na.rm=T) # number of rows where we were off by more than 2 seconds
## [1] 0
So it looks like everything worked out!
Get rid of some unnecessary columns
So, here’s where we are now:
glimpse(cal1020)
## Rows: 3,252
## Columns: 28
## $ time.gun <dbl> 2880, 2885, 2970, 3063, 3084, 3206, 3223, 3249, 3289, 3529, 3323, 3575, 3600, 3625, 3421, 3648, 3659, 3485…
## $ time.chip <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485…
## $ bib <dbl> 1205, 9, 13, 15, 1303, 1213, 3, 1055, 12, 1351, 1054, 1216, 1352, 1218, 6, 1220, 1304, 18, 1353, 1305, 101…
## $ name.first <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ City <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ State <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?",…
## $ Division <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ Class.Position <dbl> 1, 2, 3, 4, 5, 1, 1, 1, 1, 1, 1, 2, 3, 2, 2, 4, 1, 3, 2, 3, 1, 4, 1, 2, 1, 2, 5, 1, 1, 1, 5, 3, 1, 3, 1, 2…
## $ Overall.Place <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 8, 9, 2, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2…
## $ Age <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26…
## $ Zip <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ Sex.place <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 1, 8, 2, 9, 10, 11, 2, 12, 13, 3, 14, 15, 16, 4, 17, 18, 19, 20, 5, 6, 7, 8, 21, 2…
## $ Div.tot <dbl> 3239, 3239, 3239, 3239, 3239, 3239, 3239, 13, 3239, 3239, 13, 3239, 3239, 3239, 3239, 3239, 3239, 3239, 32…
## $ Sex.tot <dbl> 1179, 1179, 1179, 1179, 1179, 1179, 1179, 11, 2060, 1179, 11, 1179, 1179, 1179, 2060, 1179, 1179, 2060, 11…
## $ AG.tot <dbl> 5, 5, 5, 5, 5, 118, 85, 1, 5, 139, 1, 139, 139, 118, 5, 139, 3, 5, 3, 3, 167, 5, 178, 167, 191, 191, 5, 81…
## $ `5.25.mile.rank` <chr> "3", "2", "4", "6", "5", "7", "8", "1", "9", "10", "INC", "11", "13", "15", "12", "16", "19", "14", "18", …
## $ time.mile.5.25 <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, …
## $ `5.25.mile.pace` <chr> "4:48/M", "4:48/M", "4:52/M", "5:06/M", "5:06/M", "5:16/M", "5:19/M", "5:33/M", "5:28/M", "5:28/M", NA, "5…
## $ `10.mile.rank` <chr> "1", "2", "3", "4", "5", "6", "7", "1", "8", "9", "13", "10", "12", "15", "16", "13", "11", "25", "18", "1…
## $ time.mile.10 <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726…
## $ `10.mile.pace` <chr> "2:17/M", "2:17/M", "2:24/M", "2:25/M", "2:28/M", "2:34/M", "2:35/M", "2:29/M", "2:37/M", "2:40/M", "5:32/…
## $ is.male <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, F…
## $ is.female <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ sex <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ sex.2 <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ wheelchair <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ corral <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Let’s get rid of the redundant columns, to make this more manageable. Here are the columns I don’t think we need:
<- c('time.gun',
drop 'bib',
'Class.Position',
'Overall.Place',
'Sex.place',
'Div.tot',
'Sex.tot',
'AG.tot',
'5.25.mile.rank',
'5.25.mile.pace',
'10.mile.rank',
'10.mile.pace',
'is.male',
'is.female',
'sex.2')
<- cal1020 %>% select(-one_of(drop)) cal1020
Let’s also convert all the names to lower case.
names(cal1020) <- tolower(names(cal1020))
glimpse(cal1020)
## Rows: 3,252
## Columns: 13
## $ time.chip <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485, …
## $ name.first <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ralp…
## $ name.last <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "Pi…
## $ city <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Commer…
## $ state <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?", "…
## $ division <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M 35…
## $ age <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26, …
## $ zip <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078", …
## $ time.mile.5.25 <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, 18…
## $ time.mile.10 <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726, …
## $ sex <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", "m…
## $ wheelchair <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ corral <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
Getting new columns from simple transformations
Let’s specify that time is in seconds in the column name, and calculate the pace and speed for the total race, and each half.
min/mile from time in seconds We can get total pace over the 10 miles via division:
<- cal1020 %>%
cal1020 mutate(time.sec = time.chip,
speed.mph = 10/(time.sec/60/60),
pace.min = (time.sec/60)/10,
speed.mph.first = 5.25/(time.mile.5.25/60/60),
speed.mph.second= (10-5.25)/(time.mile.10/60/60),
pace.min.first = (time.mile.5.25/60)/5.25,
pace.min.second = (time.mile.10/60)/(10-5.25)) %>%
select(-time.chip)
glimpse(cal1020)
## Rows: 3,252
## Columns: 19
## $ name.first <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ city <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ state <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", "?",…
## $ division <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ age <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26…
## $ zip <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ time.mile.5.25 <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, …
## $ time.mile.10 <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726…
## $ sex <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ wheelchair <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ corral <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ time.sec <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485…
## $ speed.mph <dbl> 12.500000, 12.478336, 12.121212, 11.757022, 11.676938, 11.228946, 11.173184, 11.107683, 10.945576, 10.8499…
## $ pace.min <dbl> 4.800000, 4.808333, 4.950000, 5.103333, 5.138333, 5.343333, 5.370000, 5.401667, 5.481667, 5.530000, 5.5333…
## $ speed.mph.first <dbl> 12.500000, 12.500000, 12.344873, 11.761045, 11.768369, 11.378688, 11.290323, 10.800000, 10.981987, 10.9756…
## $ speed.mph.second <dbl> 12.509144, 12.463557, 11.891516, 11.760660, 11.585366, 11.075130, 11.053652, 11.476510, 10.912572, 10.7210…
## $ pace.min.first <dbl> 4.800000, 4.800000, 4.860317, 5.101587, 5.098413, 5.273016, 5.314286, 5.555556, 5.463492, 5.466667, NA, 5.…
## $ pace.min.second <dbl> 4.796491, 4.814035, 5.045614, 5.101754, 5.178947, 5.417544, 5.428070, 5.228070, 5.498246, 5.596491, 11.649…
Fixing states.
There are a few puzzling things remaining; the states look weird.
sort(unique(cal1020$state))
## [1] "?" "AB" "AL" "AP" "ARIZONA" "AZ" "B.C.N" "BC" "CA"
## [10] "CALIFORNIA" "CO" "CT" "DC" "EUR" "FL" "FLORIDA" "GA" "HI"
## [19] "ID" "IL" "KY" "MA" "MD" "ME" "MEXICO" "MI" "MN"
## [28] "MO" "MT" "ND" "NE" "NEBRASKA" "NEVADA" "NEW JERSEY" "NJ" "NL"
## [37] "NV" "NY" "OH" "OK" "ON" "OR" "SD" "TX" "UT"
## [46] "UTAH" "VA" "VIRGINIA" "WA" "WASHINGTON" "WI" "WV"
We can clearly have some redundant coding, and a few mysterious things like “AP” and “B.C.N.” Let’s make a dictionary to remap these values. BCN is likely Baja – let’s call it MEXICO. BC is likely british columbia – call it CANADA. AB is probably alberta, ON is ontario, NL is newfoundland/labrador (zip code makes sense), “AP” is us military in the pacific… let’s call that NA.
= c("ARIZONA" = "AZ",
state.rename "CALIFORNIA" = "CA",
"FLORIDA" = "FL",
"NEBRASKA" = "NE",
"NEVADA" = "NV",
"NEW JERSEY" = "NJ",
"VIRGINIA" = "VA",
"WASHINGTON" = "WA", # could be DC, but zipcode indicates WA state.
"B.C.N" = "MEXICO",
"BC" = "CANADA",
"AB" = "CANADA",
"NL" = "CANADA",
"ON" = "CANADA",
"AP" = NA,
"UTAH" = "UT",
"EUR" = "EUROPE",
"?" = NA)
<- cal1020 %>%
cal1020 mutate(state = ifelse(state %in% names(state.rename),
state.rename[state],
state))
sort(unique(cal1020$state))
## [1] "AL" "AZ" "CA" "CANADA" "CO" "CT" "DC" "EUROPE" "FL" "GA" "HI" "ID" "IL" "KY"
## [15] "MA" "MD" "ME" "MEXICO" "MI" "MN" "MO" "MT" "ND" "NE" "NJ" "NV" "NY" "OH"
## [29] "OK" "OR" "SD" "TX" "UT" "VA" "WA" "WI" "WV"
Well, we fixed most of them. Perhaps a more thorough check of zipcode-state consistency would make sense if we really cared about state of origin.
Save the cleaned data.
So here’s what we have at the end:
glimpse(cal1020)
## Rows: 3,252
## Columns: 19
## $ name.first <chr> "Jordan", "Macdonard", "Sergio", "Jamesom", "Darren", "Okwaro", "Steven", "Edwin", "Lindsey", "Derek", "Ra…
## $ name.last <chr> "Chipangama", "Ondara", "Reyes", "Mora", "Brown", "Raura", "Underwood", "Figueroa", "Scherf", "Bradley", "…
## $ city <chr> "Flagstaff", "Grand Prairie", "Palmdale", "Arroyo Grande", "Solana Beach", "Oceanside", "Encinitas", "Comm…
## $ state <chr> "AZ", "TX", "CA", "CA", "CA", "CA", "CA", "CA", "NY", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "AZ", NA, …
## $ division <chr> "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile Overall", "10 Mile M …
## $ age <dbl> 25, 29, 32, 30, 28, 39, 26, 42, 27, 33, 60, 34, 33, 39, 26, 32, 41, 24, 42, 48, 51, 33, 46, 50, 44, 44, 26…
## $ zip <chr> "86004", "75054", "93551", "93420", "92075", "92057", "92024", "90040", "12440", "92024", "91016", "92078"…
## $ time.mile.5.25 <dbl> 1512, 1512, 1531, 1607, 1606, 1661, 1674, 1750, 1721, 1722, NA, 1752, 1754, 1760, 1753, 1790, 1816, 1758, …
## $ time.mile.10 <dbl> 1367, 1372, 1438, 1454, 1476, 1544, 1547, 1490, 1567, 1595, 3320, 1610, 1633, 1652, 1667, 1644, 1628, 1726…
## $ sex <chr> "male", "male", "male", "male", "male", "male", "male", "male", "female", "male", "male", "male", "male", …
## $ wheelchair <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ corral <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ time.sec <dbl> 2880, 2885, 2970, 3062, 3083, 3206, 3222, 3241, 3289, 3318, 3320, 3363, 3388, 3413, 3421, 3435, 3445, 3485…
## $ speed.mph <dbl> 12.500000, 12.478336, 12.121212, 11.757022, 11.676938, 11.228946, 11.173184, 11.107683, 10.945576, 10.8499…
## $ pace.min <dbl> 4.800000, 4.808333, 4.950000, 5.103333, 5.138333, 5.343333, 5.370000, 5.401667, 5.481667, 5.530000, 5.5333…
## $ speed.mph.first <dbl> 12.500000, 12.500000, 12.344873, 11.761045, 11.768369, 11.378688, 11.290323, 10.800000, 10.981987, 10.9756…
## $ speed.mph.second <dbl> 12.509144, 12.463557, 11.891516, 11.760660, 11.585366, 11.075130, 11.053652, 11.476510, 10.912572, 10.7210…
## $ pace.min.first <dbl> 4.800000, 4.800000, 4.860317, 5.101587, 5.098413, 5.273016, 5.314286, 5.555556, 5.463492, 5.466667, NA, 5.…
## $ pace.min.second <dbl> 4.796491, 4.814035, 5.045614, 5.101754, 5.178947, 5.417544, 5.428070, 5.228070, 5.498246, 5.596491, 11.649…
Note: we save this both as a csv file (to be easy to read by all other software), and as an ‘RData’ file, which is only useful in R, but preserves all the “data types” we defined.
write.csv(cal1020, file="cal1020.cleaned.csv")
save(list = "cal1020", file="cal1020.cleaned.Rdata")