# A tibble: 20,868 × 6
year species species_latin how_many_counted total_hours
<int> <chr> <chr> <int> <dbl>
1 1921 "Snow Goose\r" Chen caerulescens 0 8
2 1922 "Snow Goose\r" Chen caerulescens 0 NA
3 1924 "Snow Goose\r" Chen caerulescens 0 NA
4 1925 "Snow Goose\r" Chen caerulescens 0 NA
5 1926 "Snow Goose\r" Chen caerulescens 0 NA
6 1928 "Snow Goose\r" Chen caerulescens 0 NA
7 1930 "Snow Goose\r" Chen caerulescens 0 NA
8 1931 "Snow Goose\r" Chen caerulescens 0 NA
9 1932 "Snow Goose\r" Chen caerulescens 0 NA
10 1933 "Snow Goose\r" Chen caerulescens 0 NA
# ℹ 20,858 more rows
# ℹ 1 more variable: how_many_counted_by_hour <dbl>
This is a series of five posts for this data:
Introduction
While I was visualizing the data, I realized I still needed to do a bit more cleaning. So this is a short post outlining my steps to do so.
To start, we’ll load all of the packages and the data:
Final cleaning touches
Particularly, I want to:
Remove
hybrid
birdsConsolidate the names of some species that had variations in them
Let’s see how many hybrid species we have and remove them:
<- hamilton_cbc %>%
hamilton_cbc mutate(species = str_remove(species, "\r")) # Remove the trailing "\r"
%>%
hamilton_cbc filter(str_detect(species, "hybrid")) %>%
distinct(species)
# A tibble: 5 × 1
species
<chr>
1 Snow x Canada Goose (hybrid)
2 American Black Duck x Mallard (hybrid)
3 Mallard x Northern Pintail (hybrid)
4 Herring x Glaucous Gull (hybrid)
5 Herring x Great Black-backed Gull (hybrid)
<- hamilton_cbc %>%
hamilton_cbc filter(!str_detect(species, "hybrid"))
Now, onto cleaning the trickier stuff. Sometimes, species have sub-species names or groups that have different total counts. For example, the Juncos (where total_counted
is the sum of the counts over all years for that species):
%>%
hamilton_cbc filter(str_detect(species, "Junco")) %>%
group_by(species, species_latin) %>%
summarise(total_counted = sum(how_many_counted)) %>%
ungroup()
# A tibble: 4 × 3
species species_latin total_counted
<chr> <chr> <int>
1 Dark-eyed Junco Junco hyemalis 14426
2 Dark-eyed Junco (Oregon) Junco hyemalis [oreganus Group 39
3 Dark-eyed Junco (Slate-colored) Junco hyemalis hyemalis/carolin… 46764
4 Dark-eyed Junco (White-winged) Junco hyemalis aikeni 1
I just want there to be one Dark-eyed Junco species in this dataset, so I am going to consolidate these four sub-species into one species. (Even though people get way more excited about seeing the Oregon sub-species in Hamilton than the Slate-colored 😄.)
The first step is to only keep the first two words of the species_latin
variable:
<- hamilton_cbc %>%
hamilton_cbc mutate(species_latin = word(species_latin, start = 1, end = 2))
We can also see who else is in this list:
%>%
hamilton_cbc group_by(species_latin) %>%
filter(n_distinct(species) > 1) %>%
group_by(species, species_latin) %>%
summarise(total_counted = sum(how_many_counted)) %>%
ungroup()
# A tibble: 26 × 3
species species_latin total_counted
<chr> <chr> <int>
1 American Kestrel Falco sparverius 1520
2 American Kestrel (Northern) Falco sparverius 4
3 Brant Branta bernicla 8
4 Brant (Atlantic) Branta bernicla 1
5 Common Grackle Quiscalus quiscula 173
6 Common Grackle (Purple) Quiscalus quiscula 17
7 Dark-eyed Junco Junco hyemalis 14426
8 Dark-eyed Junco (Oregon) Junco hyemalis 39
9 Dark-eyed Junco (Slate-colored) Junco hyemalis 46764
10 Dark-eyed Junco (White-winged) Junco hyemalis 1
# ℹ 16 more rows
The second step is to sum up the counts for each year across all of the sub-species so the counts are the same, and then filter to only keep the first instance of each species
(which, when arranged alphabetically, is the shortest species name):
<- hamilton_cbc %>%
hamilton_cbc group_by(year, species_latin) %>%
mutate(how_many_counted = sum(how_many_counted)) %>%
arrange(year, species) %>%
filter(row_number() == 1) %>%
ungroup()
%>%
hamilton_cbc filter(str_detect(species, "Junco")) %>%
group_by(species, species_latin) %>%
summarise(total_counted = sum(how_many_counted)) %>%
ungroup()
# A tibble: 1 × 3
species species_latin total_counted
<chr> <chr> <int>
1 Dark-eyed Junco Junco hyemalis 61230
Perfect! No more sub-species. The last group of species to deal with is species
where the name has either a (
or a /
:
%>%
hamilton_cbc group_by(species, species_latin) %>%
summarise(total_counted = sum(how_many_counted)) %>%
ungroup() %>%
filter(str_detect(species, "\\(|/")) # The "|" is an "or" within the regex
# A tibble: 11 × 3
species species_latin total_counted
<chr> <chr> <int>
1 Barn Owl (American) Tyto alba 1
2 Bullock's/Baltimore Oriole Icterus bullockii… 1
3 Great Blue Heron (Blue form) Ardea herodias 362
4 Greater/Lesser Scaup Aythya marila/aff… 26558
5 Pacific/Winter Wren Troglodytes pacif… 498
6 Palm Warbler (Western) Setophaga palmarum 1
7 Rock Pigeon (Feral Pigeon) Columba livia 60114
8 Spotted/Eastern Towhee (Rufous-sided Towhee) Pipilo maculatus/… 28
9 Western/Eastern Meadowlark Sturnella neglect… 49
10 Wilson's/Common Snipe Gallinago delicat… 13
11 Yellow-rumped Warbler (Myrtle) Setophaga coronata 65
I am going to make some executive decisions about what to do with these species
:
- Delete species guess: Greater/Lesser Scaup
- Assume super-rare species were in fact the more common species:
- Bullock’s/Baltimore Oriole were Baltimore Orioles
- Western/Eastern Meadowlark were Eastern Meadowlarks
- Wilson’s/Common Snipe were Common Snipes
- Spotted/Eastern Towhee (Rufous-sided Towhee) were Eastern Towhees
- Pacific/Winter Wren were Winter Wrens
- Remove parentheses on the remaining species for neatness
<- hamilton_cbc %>%
hamilton_cbc filter(!(species == "Greater/Lesser Scaup")) %>%
mutate(species = case_when(species == "Bullock's/Baltimore Oriole" ~ "Baltimore Oriole",
== "Western/Eastern Meadowlark" ~ "Eastern Meadowlark",
species == "Wilson's/Common Snipe" ~ "Common Snipe",
species == "Spotted/Eastern Towhee (Rufous-sided Towhee)" ~ "Eastern Towhee",
species == "Pacific/Winter Wren" ~ "Winter Wren",
species TRUE ~ species),
species_latin = case_when(species_latin == "Icterus bullockii/galbula" ~ "Icterus galbula",
== "Sturnella neglecta/magna" ~ "Sturnella magna",
species_latin == "Gallinago delicata/gallinago" ~ "Gallinago gallinago",
species_latin == "Pipilo maculatus/erythrophthalmus" ~ "Pipilo erythrophthalmus",
species_latin == "Troglodytes pacificus/hiemalis" ~ "Troglodytes hiemalis",
species_latin TRUE ~ species_latin),
species = case_when(species == "Barn Owl (American)" ~ "Barn Owl",
== "Great Blue Heron (Blue form)" ~ "Great Blue Heron",
species == "Palm Warbler (Western)" ~ "Palm Warbler",
species == "Rock Pigeon (Feral Pigeon)" ~ "Rock Pigeon",
species == "Yellow-rumped Warbler (Myrtle)" ~ "Yellow-rumped Warbler",
species TRUE ~ species))
# Consolidate the counts between the species whose names were just updated
# This is the same step as was done in the earlier sub-species section
<- hamilton_cbc %>%
hamilton_cbc group_by(year, species) %>%
mutate(how_many_counted = sum(how_many_counted)) %>%
arrange(year, species) %>%
filter(row_number() == 1) %>%
ungroup()
Finally, I am going to recalculate the how_many_counted_by_hour
variable that depends on how_many_counted
:
<- hamilton_cbc %>%
hamilton_cbc mutate(how_many_counted_by_hour = as.double(how_many_counted) / total_hours)
Number of species counted each year
In the course of creating a plot, I believe there was a error in the total hours recorded for 1982, where the total number of hours was only 64, but there was no drop in the number of species counted that year. I think it should have actually been 164 hours, because, in 1981, there were 167 hours, and in 1983, there were 168 hours. So, in the below chunk, I’ve mutate
d 1982 to have 164 total hours.
# Mutating total_hours and how_many_counted_by_hour that depends on it
<- hamilton_cbc %>%
hamilton_cbc mutate(total_hours = ifelse(year == 1982, 164, total_hours),
how_many_counted_by_hour = as.double(how_many_counted) / total_hours)
If you would like to download this final, cleaned dataset in .rds
format, you can do so here.
We are now ready to visualize! Please look at the next post in this series for the visualizing!
And thank you to the Christmas Bird Count! The Christmas Bird Count Data was provided by National Audubon Society and through the generous efforts of Bird Studies Canada and countless volunteers across the western hemisphere.
Session info
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.0 (2023-04-21 ucrt)
os Windows 11 x64 (build 22000)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_Canada.utf8
ctype English_Canada.utf8
tz Pacific/Honolulu
date 2023-09-21
pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.1)
cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.0)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.0)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
devtools * 2.4.5 2022-10-11 [1] CRAN (R 4.3.1)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0)
emo * 0.0.0.9000 2023-07-22 [1] Github (hadley/emo@3f03b11)
evaluate 0.20 2023-01-17 [1] CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
fs 1.6.2 2023-04-25 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)
httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.3.1)
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.3.0)
knitr 1.42 2023-01-25 [1] CRAN (R 4.3.0)
later 1.3.1 2023-05-02 [1] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
lubridate 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgbuild 1.4.0 2022-11-27 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
pkgload 1.3.2 2022-11-16 [1] CRAN (R 4.3.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.3.0)
processx 3.8.1 2023-04-18 [1] CRAN (R 4.3.0)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.3.0)
ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.0)
purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.3.0)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.3.0)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)
rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
shiny 1.7.4 2022-12-15 [1] CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.0)
usethis * 2.2.2 2023-07-06 [1] CRAN (R 4.3.1)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
vctrs 0.6.2 2023-04-19 [1] CRAN (R 4.3.0)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.0)
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
[1] C:/Users/sharl/AppData/Local/R/win-library/4.3
[2] C:/Program Files/R/R-4.3.0/library
──────────────────────────────────────────────────────────────────────────────