library(dplyr)
library(janitor)
library(readr)
library(naniar)
library(lubridate)
library(stringr)
library(tidyr)
library(here)
<- read_csv(here::here(
hamilton_cbc "posts",
"2019-01-07-hamilton-cbc-part-1",
"hamilton-cbc-all-years-csv.csv"))
This is a series of five posts for this data:
Introduction
About two years ago, I was taking my dog for a walk through a park and I began to notice the birds and how fascinating they were! 🐦 I began regularly going out birding (aka “bird-watching”) and reading up on these cool little flying dinosaurs.
It turns out there’s a lot of data in the birding world as well. Birding attracts the sort of detail-oriented person who likes to count and record stuff.
So there are opportunities to get involved in citizen science projects, including a long-running project called the Christmas Bird Count (CBC). It started in 1900, when Frank Chapman, an ornithologist, came up with the idea of counting birds as an alternative to hunting them at Christmas (hunting them being the previous tradition).1
Birders have been going out every year around Christmas, to spend the day walking, biking, or driving through a census area to count all the birds they see or hear.
For the past two years, I have gone out with Hamilton’s Christmas Bird Count. I learn a lot while I’m out there and it feels like we are contributing to a larger purpose because of the data we are collecting.
So I thought I would look at the data and see what it could tell me!
Specifically, I’ve noticed birders will say things like, “the House Sparrows are getting worse every year” or, “the number of Bald Eagles has increased”, and I was wondering if the Christmas Bird Count data would agree or disagree with those statements.
To access the data, I went on the Bird Studies Canada website, clicked on Citizen Science, then Christmas Bird Count, then CBC Audubon Database, and then Historical Results by Count. I downloaded all years of data that existed for the Hamilton count.
If you would like to directly access the csv file that I used from my Github page, here it is!
Data import
I started by loading all of the packages I will be using and reading in the data using the readr
and here
packages.
Data cleaning
As shown below, it turns out that the first row just gives information about the count name and latitude/longitude, so I extracted those two pieces of information as current_circle_name
and lat_long
and then slice
d the file so that the first two lines were excluded from the dataset. I then used clean_names
from the janitor
package.
%>%
hamilton_cbc head()
# A tibble: 6 × 9
CircleName Abbrev LatLong ...4 ...5 ...6 ...7 ...8 ...9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Hamilton ONHA 43.2678790000/-7… <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 CountYear3 LowTemp HighTemp AMCl… PMCl… AMRa… "PMR… AMSn… PMSn…
4 118 -18.0 Celsius -13.0 Celsius Clear Clear None "Non… None None
5 117 -2.0 Celsius 11.0 Celsius Clou… Clou… Light "Hea… None None
6 116 -2.0 Celsius 5.0 Celsius Part… Part… None "Lig… None None
<- hamilton_cbc[1, 1]
current_circle_name <- hamilton_cbc[1, 3]
lat_long
<- hamilton_cbc %>%
hamilton_cbc slice(3 : n())
<- hamilton_cbc %>%
hamilton_cbc clean_names()
Since I played around with the data before writing this, I know that there are actually six tables in this dataset.
The first three tables contain count day weather data. A lot of the weather data is missing and inconsistent. I will remove these three tables from hamilton_cbc
.
Here is the end of the first table and the start of the second table. Notice the line of NA
s between the two tables:
%>%
hamilton_cbc slice(47:54)
# A tibble: 8 × 9
circle_name abbrev lat_long x4 x5 x6 x7 x8 x9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 66 10 25 Clear Clear Unknown Unkno… Unkn… Unkn…
2 65 18 31 Cloudy Cloudy Unknown Unkno… Unkn… Unkn…
3 64 26 34 Cloudy Cloudy Unknown Unkno… Unkn… Unkn…
4 63 21 36 Cloudy Cloudy Unknown Unkno… Unkn… Unkn…
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 CountYear5 LowTemp3 HighTemp2 AMCloud2 PMClouds2 <NA> <NA> <NA> <NA>
7 118 12/26/2017 85 198.75 100 <NA> <NA> <NA> <NA>
8 117 12/26/2016 95 216.65 97 <NA> <NA> <NA> <NA>
Here is the end of the second table and the start of the third table. Notice the line of NA
s between the two tables:
%>%
hamilton_cbc slice(143:150)
# A tibble: 8 × 9
circle_name abbrev lat_long x4 x5 x6 x7 x8 x9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 26 12/26/1925 10 <NA> <NA> <NA> <NA> <NA> <NA>
2 25 12/27/1924 8 <NA> <NA> <NA> <NA> <NA> <NA>
3 23 12/26/1922 9 <NA> <NA> <NA> <NA> <NA> <NA>
4 22 12/23/1921 2 8 <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 CountYear4 LowTemp2 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
7 118 Hamilton Naturalists… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
8 117 Hamilton Naturalists… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Here is the end of the third table and the start of the fourth table. Notice that there is a line of NA
s between the two tables. The fourth table is where the bird count data actually starts!
%>%
hamilton_cbc slice(239:246)
# A tibble: 8 × 9
circle_name abbrev lat_long x4 x5 x6 x7 x8 x9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "26" <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 "25" <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 "23" <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 "22" <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 "COM_NAME" "Coun… how_man… Numb… Flags <NA> <NA> <NA> <NA>
7 "Snow Goose\r\n[Chen caer… "1921… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
8 "Snow Goose\r\n[Chen caer… "1922… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
The last two tables of the six tables contain the names of the people who went out counting each year. I will also remove these two tables.
Since the tables are separated by having a line of NA
’s in between each table, I will first figure out which rows are a line of NAs. Then I will only keep the rows of the fourth table.
<- hamilton_cbc %>%
blank_lines mutate(row_num = row_number()) %>%
filter(is.na(circle_name))
blank_lines
# A tibble: 5 × 10
circle_name abbrev lat_long x4 x5 x6 x7 x8 x9 row_num
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 51
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 147
3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 243
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 23463
5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 23473
<- blank_lines %>%
starting_line filter(row_number() == 3) %>%
pull(row_num)
<- blank_lines %>%
ending_line filter(row_number() == 4) %>%
pull(row_num)
So, with those values of starting_line
and ending_line
, we can slice
our dataset to only have the rows between those two values. Here’s what it looks like:
<- hamilton_cbc %>%
hamilton_cbc # Only keep the rows within the fourth table
slice((starting_line + 1):(ending_line - 1))
%>%
hamilton_cbc head(n = 3)
# A tibble: 3 × 9
circle_name abbrev lat_long x4 x5 x6 x7 x8 x9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "COM_NAME" "Coun… how_man… Numb… Flags <NA> <NA> <NA> <NA>
2 "Snow Goose\r\n[Chen caer… "1921… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 "Snow Goose\r\n[Chen caer… "1922… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
%>%
hamilton_cbc tail(n = 3)
# A tibble: 3 × 9
circle_name abbrev lat_long x4 x5 x6 x7 x8 x9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "House Sparrow\r\n[Passer… "2015… 2326 10.5… <NA> <NA> <NA> <NA> <NA>
2 "House Sparrow\r\n[Passer… "2016… 2565 11.8… <NA> <NA> <NA> <NA> <NA>
3 "House Sparrow\r\n[Passer… "2017… 2731 13.7… <NA> <NA> <NA> <NA> <NA>
You can see that the table starts with Snow Goose data from 1921 and goes until House Sparrow data in 2017.
Now we can clean this dataset up a bit more using the janitor
package ❤️! This package will remove any empty columns, convert the top row to the column names of the dataset and it will clean the names.
# Janitor package to the rescue!
<- hamilton_cbc %>%
hamilton_cbc ::remove_empty(which = "cols") %>%
janitor::row_to_names(row_number = 1) %>%
janitor::clean_names() %>%
janitorrename(species = com_name)
%>%
hamilton_cbc head()
# A tibble: 6 × 5
species count_year how_many_cw number_by_party_hours flags
<chr> <chr> <chr> <chr> <chr>
1 "Snow Goose\r\n[Chen caeru… "1921 [22… <NA> <NA> <NA>
2 "Snow Goose\r\n[Chen caeru… "1922 [23… <NA> <NA> <NA>
3 "Snow Goose\r\n[Chen caeru… "1924 [25… <NA> <NA> <NA>
4 "Snow Goose\r\n[Chen caeru… "1925 [26… <NA> <NA> <NA>
5 "Snow Goose\r\n[Chen caeru… "1926 [27… <NA> <NA> <NA>
6 "Snow Goose\r\n[Chen caeru… "1928 [29… <NA> <NA> <NA>
species
gives the species name in English and the scientific name, in parenthesescount_year
data has a lot of information that we will parse out in a momenthow_many_cw
provides the actual bird countnumber_by_party_hours
is how many birds were counted divided by the number of person-hours that yearflags
contains values likeUS
for “unusual” bird (as per the Christmas Bird Count documentation)
Now we do some regex!
First, I want to split up the species
variable into the common species
name and the scientific species_latin
name.
For the first mutate: I will use @kohske
’s regex I found on StackOverflow, which, as Nettle writes:
I like @kohske’s regex, which looks behind for an open parenthesis ?<=\(, looks ahead for a closing parenthesis ?=\), and grabs everything in the middle (lazily) .+?, in other words (?<=\().+?(?=\)) s
For the second mutate: As you can see in the code below, there is a line break (\n
) between every English name and every scientific name in species
. We will use that to parse out the scientific name:
%>%
hamilton_cbc filter(row_number() == 1) %>%
pull(species)
[1] "Snow Goose\r\n[Chen caerulescens]"
Here are the two mutate
s together:
# Putting it together: Mutating the two variables
<- hamilton_cbc %>%
hamilton_cbc mutate(species_latin = str_extract(species, "(?<=\\[).+?(?=\\])"),
species = word(species, start = 1, sep = fixed('\n[')))
Now we will look at the count_year
variable. Let’s get a sense of what the variable looks like, using the White-Breasted Nuthatch count in 2016:
%>%
hamilton_cbc filter(row_number() == 15133) %>%
pull(count_year)
[1] "2016 [117]\r\nCount Date: 12/26/2016\r\n# Participants: 1\r\n# Species Reported: 97\r\nTotal Hrs.: 216.65"
The count_year
variable is actually several variables in one:
- calendar year
- [CBC count number]
- calendar count date
- number of participants
- number of species reported
- total hours spent that year on the count
This is all metadata and we can take most of it out of this dataset. The only variable we will keep in the hamilton_cbc
dataset is the calendar year.
And where are we at with the hamilton_cbc
dataset?
%>%
hamilton_cbc tail()
# A tibble: 6 × 6
species count_year how_many_cw number_by_party_hours flags species_latin
<chr> <chr> <chr> <chr> <chr> <chr>
1 "House Sparr… "2012 [11… 1473 7.5713 <NA> Passer domes…
2 "House Sparr… "2013 [11… 1802 9.8902 <NA> Passer domes…
3 "House Sparr… "2014 [11… 1318 7.3529 <NA> Passer domes…
4 "House Sparr… "2015 [11… 2326 10.5249 <NA> Passer domes…
5 "House Sparr… "2016 [11… 2565 11.8394 <NA> Passer domes…
6 "House Sparr… "2017 [11… 2731 13.7409 <NA> Passer domes…
Let’s clean up the variables a bit more:
<- hamilton_cbc %>%
hamilton_cbc rename(participant_info = count_year,
how_many_counted = how_many_cw) %>%
mutate(year = as.integer(word(participant_info)), # We will keep year and total_hours
total_hours = as.double(
str_extract(
"(?<=Hrs\\.:\\s).*$"))) participant_info,
We almost have a clean dataset! ✨
I am going to remove the flags
variable. I am also going to remove number_by_party_hours
and derive it myself instead.
<- hamilton_cbc %>%
hamilton_cbc select(year, species, species_latin, how_many_counted, total_hours)
It turns out that how_many_counted
also has a cw
value, which means the bird was not seen on count day itself, but was seen on a day close to the count. I am going to set these bird counts to be NA
, as they don’t have a specified value.
<- hamilton_cbc %>%
hamilton_cbc mutate(how_many_counted = ifelse(how_many_counted == "cw", NA, how_many_counted),
how_many_counted = as.integer(how_many_counted))
In the species
variable, there are some rows that are identified only to the genus level (and not to the species level). I will exclude these records, as I believe eBird excludes them too.
%>%
hamilton_cbc filter(str_detect(species, "sp\\.")) %>%
distinct(species)
# A tibble: 25 × 1
species
<chr>
1 "scoter sp.\r"
2 "duck sp.\r"
3 "loon sp.\r"
4 "Accipiter sp.\r"
5 "hawk sp.\r"
6 "eagle sp.\r"
7 "jaeger sp.\r"
8 "gull sp.\r"
9 "screech-owl sp.\r"
10 "owl sp.\r"
# ℹ 15 more rows
<- hamilton_cbc %>%
hamilton_cbc filter(!(str_detect(species, "sp\\.")))
Two final mutates:
- Using
tidyr
’sreplace_na
function, let’s make all of theNA
s equal to 0 forhow_many_counted
. That means we are assuming that all birds in the area were successfully counted on count day. - Let’s also calculate the number of birds counted (within each species) divided by the total number of count hours that happened that year.
<- hamilton_cbc %>%
hamilton_cbc mutate(how_many_counted = replace_na(how_many_counted, 0),
how_many_counted_by_hour = as.double(how_many_counted) / total_hours)
And that’s it! 😄 🎉 We have cleaned the dataset and are ready to do some visualizing 👀 in Part 2!
Final dataset
Here is a glimpse of our final dataset:
%>%
hamilton_cbc tail()
# A tibble: 6 × 6
year species species_latin how_many_counted total_hours
<int> <chr> <chr> <int> <dbl>
1 2012 "House Sparrow\r" Passer domesticus 1473 195.
2 2013 "House Sparrow\r" Passer domesticus 1802 182.
3 2014 "House Sparrow\r" Passer domesticus 1318 179.
4 2015 "House Sparrow\r" Passer domesticus 2326 221
5 2016 "House Sparrow\r" Passer domesticus 2565 217.
6 2017 "House Sparrow\r" Passer domesticus 2731 199.
# ℹ 1 more variable: how_many_counted_by_hour <dbl>
And thank you to the CBC! The CBC Data was provided by National Audubon Society and through the generous efforts of Bird Studies Canada and countless volunteers across the western hemisphere.
Session info
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.0 (2023-04-21 ucrt)
os Windows 11 x64 (build 22000)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_Canada.utf8
ctype English_Canada.utf8
tz Pacific/Honolulu
date 2023-09-21
pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.1)
bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)
cachem 1.0.8 2023-05-01 [1] CRAN (R 4.3.0)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.0)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
devtools * 2.4.5 2022-10-11 [1] CRAN (R 4.3.1)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0)
emo * 0.0.0.9000 2023-07-22 [1] Github (hadley/emo@3f03b11)
evaluate 0.20 2023-01-17 [1] CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
fs 1.6.2 2023-04-25 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 3.4.3 2023-08-14 [1] CRAN (R 4.3.1)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.0)
httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.3.1)
janitor * 2.2.0 2023-02-02 [1] CRAN (R 4.3.0)
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.3.0)
knitr 1.42 2023-01-25 [1] CRAN (R 4.3.0)
later 1.3.1 2023-05-02 [1] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
naniar * 1.0.0 2023-02-02 [1] CRAN (R 4.3.1)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgbuild 1.4.0 2022-11-27 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
pkgload 1.3.2 2022-11-16 [1] CRAN (R 4.3.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.3.0)
processx 3.8.1 2023-04-18 [1] CRAN (R 4.3.0)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.3.0)
ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.0)
purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.3.0)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.3.0)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.3.0)
rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.1)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
shiny 1.7.4 2022-12-15 [1] CRAN (R 4.3.0)
snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.0)
usethis * 2.2.2 2023-07-06 [1] CRAN (R 4.3.1)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
vctrs 0.6.2 2023-04-19 [1] CRAN (R 4.3.0)
visdat 0.6.0 2023-02-02 [1] CRAN (R 4.3.1)
vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.0)
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
[1] C:/Users/sharl/AppData/Local/R/win-library/4.3
[2] C:/Program Files/R/R-4.3.0/library
──────────────────────────────────────────────────────────────────────────────
Footnotes
https://news.nationalgeographic.com/news/2014/12/141227-christmas-bird-count-anniversary-audubon-animals-science/↩︎