Hamilton Christmas Bird Count: Part 2a

Further cleaning of the Hamilton Christmas Bird Count data.
data cleaning
birding
Author

Sharleen Weatherley

Published

February 23, 2019

Introduction

While I was visualizing the data, I realized I still needed to do a bit more cleaning. So this is a short post outlining my steps to do so.

To start, we’ll load all of the packages and the data:

# A tibble: 20,868 × 6
    year species        species_latin     how_many_counted total_hours
   <int> <chr>          <chr>                        <int>       <dbl>
 1  1921 "Snow Goose\r" Chen caerulescens                0           8
 2  1922 "Snow Goose\r" Chen caerulescens                0          NA
 3  1924 "Snow Goose\r" Chen caerulescens                0          NA
 4  1925 "Snow Goose\r" Chen caerulescens                0          NA
 5  1926 "Snow Goose\r" Chen caerulescens                0          NA
 6  1928 "Snow Goose\r" Chen caerulescens                0          NA
 7  1930 "Snow Goose\r" Chen caerulescens                0          NA
 8  1931 "Snow Goose\r" Chen caerulescens                0          NA
 9  1932 "Snow Goose\r" Chen caerulescens                0          NA
10  1933 "Snow Goose\r" Chen caerulescens                0          NA
# ℹ 20,858 more rows
# ℹ 1 more variable: how_many_counted_by_hour <dbl>

Final cleaning touches

Particularly, I want to:

  1. Remove hybrid birds

  2. Consolidate the names of some species that had variations in them

Let’s see how many hybrid species we have and remove them:

hamilton_cbc <- hamilton_cbc %>%
  mutate(species = str_remove(species, "\r"))  # Remove the trailing "\r"

hamilton_cbc %>%
  filter(str_detect(species, "hybrid")) %>%
  distinct(species)
# A tibble: 5 × 1
  species                                   
  <chr>                                     
1 Snow x Canada Goose (hybrid)              
2 American Black Duck x Mallard (hybrid)    
3 Mallard x Northern Pintail (hybrid)       
4 Herring x Glaucous Gull (hybrid)          
5 Herring x Great Black-backed Gull (hybrid)
hamilton_cbc <- hamilton_cbc %>%
  filter(!str_detect(species, "hybrid"))

Now, onto cleaning the trickier stuff. Sometimes, species have sub-species names or groups that have different total counts. For example, the Juncos (where total_counted is the sum of the counts over all years for that species):

hamilton_cbc %>%
  filter(str_detect(species, "Junco")) %>%
  group_by(species, species_latin) %>%
  summarise(total_counted = sum(how_many_counted)) %>%
  ungroup()
# A tibble: 4 × 3
  species                         species_latin                    total_counted
  <chr>                           <chr>                                    <int>
1 Dark-eyed Junco                 Junco hyemalis                           14426
2 Dark-eyed Junco (Oregon)        Junco hyemalis [oreganus Group              39
3 Dark-eyed Junco (Slate-colored) Junco hyemalis hyemalis/carolin…         46764
4 Dark-eyed Junco (White-winged)  Junco hyemalis aikeni                        1

I just want there to be one Dark-eyed Junco species in this dataset, so I am going to consolidate these four sub-species into one species. (Even though people get way more excited about seeing the Oregon sub-species in Hamilton than the Slate-colored 😄.)

The first step is to only keep the first two words of the species_latin variable:

hamilton_cbc <- hamilton_cbc %>%
  mutate(species_latin = word(species_latin, start = 1, end = 2))

We can also see who else is in this list:

hamilton_cbc %>%
  group_by(species_latin) %>%
  filter(n_distinct(species) > 1) %>%
  group_by(species, species_latin) %>%
  summarise(total_counted = sum(how_many_counted)) %>%
  ungroup()
# A tibble: 26 × 3
   species                         species_latin      total_counted
   <chr>                           <chr>                      <int>
 1 American Kestrel                Falco sparverius            1520
 2 American Kestrel (Northern)     Falco sparverius               4
 3 Brant                           Branta bernicla                8
 4 Brant (Atlantic)                Branta bernicla                1
 5 Common Grackle                  Quiscalus quiscula           173
 6 Common Grackle (Purple)         Quiscalus quiscula            17
 7 Dark-eyed Junco                 Junco hyemalis             14426
 8 Dark-eyed Junco (Oregon)        Junco hyemalis                39
 9 Dark-eyed Junco (Slate-colored) Junco hyemalis             46764
10 Dark-eyed Junco (White-winged)  Junco hyemalis                 1
# ℹ 16 more rows

The second step is to sum up the counts for each year across all of the sub-species so the counts are the same, and then filter to only keep the first instance of each species (which, when arranged alphabetically, is the shortest species name):

hamilton_cbc <- hamilton_cbc %>%
  group_by(year, species_latin) %>%
  mutate(how_many_counted = sum(how_many_counted)) %>%
  arrange(year, species) %>%
  filter(row_number() == 1) %>%
  ungroup()

hamilton_cbc %>%
  filter(str_detect(species, "Junco")) %>%
  group_by(species, species_latin) %>%
  summarise(total_counted = sum(how_many_counted)) %>%
  ungroup()
# A tibble: 1 × 3
  species         species_latin  total_counted
  <chr>           <chr>                  <int>
1 Dark-eyed Junco Junco hyemalis         61230

Perfect! No more sub-species. The last group of species to deal with is species where the name has either a ( or a /:

hamilton_cbc %>%
  group_by(species, species_latin) %>%
  summarise(total_counted = sum(how_many_counted)) %>%
  ungroup() %>%
  filter(str_detect(species, "\\(|/"))  # The "|" is an "or" within the regex
# A tibble: 11 × 3
   species                                      species_latin      total_counted
   <chr>                                        <chr>                      <int>
 1 Barn Owl (American)                          Tyto alba                      1
 2 Bullock's/Baltimore Oriole                   Icterus bullockii…             1
 3 Great Blue Heron (Blue form)                 Ardea herodias               362
 4 Greater/Lesser Scaup                         Aythya marila/aff…         26558
 5 Pacific/Winter Wren                          Troglodytes pacif…           498
 6 Palm Warbler (Western)                       Setophaga palmarum             1
 7 Rock Pigeon (Feral Pigeon)                   Columba livia              60114
 8 Spotted/Eastern Towhee (Rufous-sided Towhee) Pipilo maculatus/…            28
 9 Western/Eastern Meadowlark                   Sturnella neglect…            49
10 Wilson's/Common Snipe                        Gallinago delicat…            13
11 Yellow-rumped Warbler (Myrtle)               Setophaga coronata            65

I am going to make some executive decisions about what to do with these species:

  1. Delete species guess: Greater/Lesser Scaup
  2. Assume super-rare species were in fact the more common species:
    • Bullock’s/Baltimore Oriole were Baltimore Orioles
    • Western/Eastern Meadowlark were Eastern Meadowlarks
    • Wilson’s/Common Snipe were Common Snipes
    • Spotted/Eastern Towhee (Rufous-sided Towhee) were Eastern Towhees
    • Pacific/Winter Wren were Winter Wrens
  3. Remove parentheses on the remaining species for neatness
hamilton_cbc <- hamilton_cbc %>%
  filter(!(species == "Greater/Lesser Scaup")) %>%
  mutate(species = case_when(species == "Bullock's/Baltimore Oriole" ~ "Baltimore Oriole",
                             species == "Western/Eastern Meadowlark" ~ "Eastern Meadowlark",
                             species == "Wilson's/Common Snipe" ~ "Common Snipe",
                             species == "Spotted/Eastern Towhee (Rufous-sided Towhee)" ~ "Eastern Towhee",
                             species == "Pacific/Winter Wren" ~ "Winter Wren",
                             TRUE ~ species),
         species_latin = case_when(species_latin == "Icterus bullockii/galbula" ~ "Icterus galbula",
                             species_latin == "Sturnella neglecta/magna" ~ "Sturnella magna",
                             species_latin == "Gallinago delicata/gallinago" ~ "Gallinago gallinago",
                             species_latin == "Pipilo maculatus/erythrophthalmus" ~ "Pipilo erythrophthalmus",
                             species_latin == "Troglodytes pacificus/hiemalis" ~ "Troglodytes hiemalis",
                             TRUE ~ species_latin),
         species = case_when(species == "Barn Owl (American)" ~ "Barn Owl",
                             species == "Great Blue Heron (Blue form)" ~ "Great Blue Heron",
                             species == "Palm Warbler (Western)" ~ "Palm Warbler",
                             species == "Rock Pigeon (Feral Pigeon)" ~ "Rock Pigeon",
                             species == "Yellow-rumped Warbler (Myrtle)" ~ "Yellow-rumped Warbler",
                             TRUE ~ species))

# Consolidate the counts between the species whose names were just updated
# This is the same step as was done in the earlier sub-species section
hamilton_cbc <- hamilton_cbc %>%
  group_by(year, species) %>%
  mutate(how_many_counted = sum(how_many_counted)) %>%
  arrange(year, species) %>%
  filter(row_number() == 1) %>%
  ungroup()

Finally, I am going to recalculate the how_many_counted_by_hour variable that depends on how_many_counted:

hamilton_cbc <- hamilton_cbc %>%
  mutate(how_many_counted_by_hour = as.double(how_many_counted) / total_hours)

Number of species counted each year

In the course of creating a plot, I believe there was a error in the total hours recorded for 1982, where the total number of hours was only 64, but there was no drop in the number of species counted that year. I think it should have actually been 164 hours, because, in 1981, there were 167 hours, and in 1983, there were 168 hours. So, in the below chunk, I’ve mutated 1982 to have 164 total hours.

# Mutating total_hours and how_many_counted_by_hour that depends on it

hamilton_cbc <- hamilton_cbc %>%
  mutate(total_hours = ifelse(year == 1982, 164, total_hours),
         how_many_counted_by_hour = as.double(how_many_counted) / total_hours)

If you would like to download this final, cleaned dataset in .rds format, you can do so here.

We are now ready to visualize! Please look at the next post in this series for the visualizing!

And thank you to the Christmas Bird Count! The Christmas Bird Count Data was provided by National Audubon Society and through the generous efforts of Bird Studies Canada and countless volunteers across the western hemisphere.


Session info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.0 (2023-04-21 ucrt)
 os       Windows 11 x64 (build 22000)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_Canada.utf8
 ctype    English_Canada.utf8
 tz       Pacific/Honolulu
 date     2023-09-21
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version    date (UTC) lib source
 assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.3.1)
 cachem        1.0.8      2023-05-01 [1] CRAN (R 4.3.0)
 callr         3.7.3      2022-11-02 [1] CRAN (R 4.3.0)
 cli           3.6.1      2023-03-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2      2022-09-29 [1] CRAN (R 4.3.0)
 devtools    * 2.4.5      2022-10-11 [1] CRAN (R 4.3.1)
 digest        0.6.31     2022-12-11 [1] CRAN (R 4.3.0)
 dplyr       * 1.1.2      2023-04-20 [1] CRAN (R 4.3.0)
 ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.3.0)
 emo         * 0.0.0.9000 2023-07-22 [1] Github (hadley/emo@3f03b11)
 evaluate      0.20       2023-01-17 [1] CRAN (R 4.3.0)
 fansi         1.0.4      2023-01-22 [1] CRAN (R 4.3.0)
 fastmap       1.1.1      2023-02-24 [1] CRAN (R 4.3.0)
 fs            1.6.2      2023-04-25 [1] CRAN (R 4.3.0)
 generics      0.1.3      2022-07-05 [1] CRAN (R 4.3.0)
 glue          1.6.2      2022-02-24 [1] CRAN (R 4.3.0)
 here        * 1.0.1      2020-12-13 [1] CRAN (R 4.3.0)
 hms           1.1.3      2023-03-21 [1] CRAN (R 4.3.0)
 htmltools     0.5.5      2023-03-23 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2      2023-03-17 [1] CRAN (R 4.3.0)
 httpuv        1.6.11     2023-05-11 [1] CRAN (R 4.3.1)
 jsonlite      1.8.4      2022-12-06 [1] CRAN (R 4.3.0)
 knitr         1.42       2023-01-25 [1] CRAN (R 4.3.0)
 later         1.3.1      2023-05-02 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3      2022-10-07 [1] CRAN (R 4.3.0)
 lubridate     1.9.2      2023-02-10 [1] CRAN (R 4.3.0)
 magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.3.0)
 memoise       2.0.1      2021-11-26 [1] CRAN (R 4.3.0)
 mime          0.12       2021-09-28 [1] CRAN (R 4.3.0)
 miniUI        0.1.1.1    2018-05-18 [1] CRAN (R 4.3.0)
 pillar        1.9.0      2023-03-22 [1] CRAN (R 4.3.0)
 pkgbuild      1.4.0      2022-11-27 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.3.0)
 pkgload       1.3.2      2022-11-16 [1] CRAN (R 4.3.0)
 prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.3.0)
 processx      3.8.1      2023-04-18 [1] CRAN (R 4.3.0)
 profvis       0.3.8      2023-05-02 [1] CRAN (R 4.3.0)
 promises      1.2.0.1    2021-02-11 [1] CRAN (R 4.3.0)
 ps            1.7.5      2023-04-18 [1] CRAN (R 4.3.0)
 purrr         1.0.1      2023-01-10 [1] CRAN (R 4.3.0)
 R6            2.5.1      2021-08-19 [1] CRAN (R 4.3.0)
 Rcpp          1.0.10     2023-01-22 [1] CRAN (R 4.3.0)
 readr       * 2.1.4      2023-02-10 [1] CRAN (R 4.3.0)
 remotes       2.4.2      2021-11-30 [1] CRAN (R 4.3.0)
 rlang         1.1.1      2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.21       2023-03-26 [1] CRAN (R 4.3.0)
 rprojroot     2.0.3      2022-04-02 [1] CRAN (R 4.3.0)
 rstudioapi    0.15.0     2023-07-07 [1] CRAN (R 4.3.1)
 sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.3.0)
 shiny         1.7.4      2022-12-15 [1] CRAN (R 4.3.0)
 stringi       1.7.12     2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0      2022-12-02 [1] CRAN (R 4.3.0)
 tibble        3.2.1      2023-03-20 [1] CRAN (R 4.3.0)
 tidyselect    1.2.0      2022-10-10 [1] CRAN (R 4.3.0)
 timechange    0.2.0      2023-01-11 [1] CRAN (R 4.3.0)
 tzdb          0.3.0      2022-03-28 [1] CRAN (R 4.3.0)
 urlchecker    1.0.1      2021-11-30 [1] CRAN (R 4.3.0)
 usethis     * 2.2.2      2023-07-06 [1] CRAN (R 4.3.1)
 utf8          1.2.3      2023-01-31 [1] CRAN (R 4.3.0)
 vctrs         0.6.2      2023-04-19 [1] CRAN (R 4.3.0)
 xfun          0.39       2023-04-20 [1] CRAN (R 4.3.0)
 xtable        1.8-4      2019-04-21 [1] CRAN (R 4.3.0)
 yaml          2.3.7      2023-01-23 [1] CRAN (R 4.3.0)

 [1] C:/Users/sharl/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.0/library

──────────────────────────────────────────────────────────────────────────────