class: center middle main-title section-title-2 # Transform data<br>with dplyr .class-info[ <figure> <img src="img/01-class/04/dplyr.png" alt="dplyr" title="dplyr" width="15%"> </figure> ] --- layout: false class: title title-2 section-title-inv-2 # Your turn #0: Load data 1. Run the setup chunk 2. Take a look at the `gapminder` data
−
+
02
:
00
--- .small[ ```r gapminder ``` ``` ## # A tibble: 1,704 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786. ## 7 Afghanistan Asia 1982 39.9 12881816 978. ## 8 Afghanistan Asia 1987 40.8 13867957 852. ## 9 Afghanistan Asia 1992 41.7 16317921 649. ## 10 Afghanistan Asia 1997 41.8 22227415 635. ## # ℹ 1,694 more rows ``` ] --- class: title title-2 # The tidyverse <figure> <img src="img/01-class/02/tidyverse-language.png" alt="tidyverse and language" title="tidyverse and language" width="100%"> </figure> ??? From "Master the Tidyverse" by RStudio --- class: title title-2 # The tidyverse .center[ <figure> <img src="img/01-class/02/tidyverse.png" alt="The tidyverse" title="The tidyverse" width="50%"> </figure> ] --- class: title title-2 # dplyr: verbs for manipulating data <table> <tr> <td>Extract rows with <code>filter()</code></td> <td><img src="img/01-class/04/filter.png" alt="filter" title="filter" height="80px"></td> </tr> <tr> <td>Extract columns with <code>select()</code></td> <td><img src="img/01-class/04/select.png" alt="select" title="select" height="80px"></td> </tr> <tr> <td>Arrange/sort rows with <code>arrange()</code></td> <td><img src="img/01-class/04/arrange.png" alt="arrange" title="arrange" height="80px"></td> </tr> <tr> <td>Make new columns with <code>mutate()</code></td> <td><img src="img/01-class/04/mutate.png" alt="mutate" title="mutate" height="80px"></td> </tr> <tr> <td>Make group summaries with<br><code>group_by() |> summarize()</code></td> <td><img src="img/01-class/04/summarize.png" alt="summarize" title="summarize" height="80px"></td> </tr> </table> --- class: center middle section-title section-title-2 # `filter()` --- layout: false class: title title-2 # `filter()` .box-inv-2[Extract rows that meet some sort of test] .pull-left[ <code class ='r hljs remark-code'>filter(.data = <b><span style="background-color:#FFDFD1">DATA</span></b>, <b><span style="background-color:#FFD0CF">...</span></b>)</code> ] .pull-right[ - <b><span style="background: #FFDFD1">`DATA`</span></b> = Data frame to transform - <b><span style="background: #FFD0CF">`...`</span></b> = One or more tests <br>.small[`filter()` returns each row for which the test is TRUE] ] --- <code class ='r hljs remark-code'>filter(.data = <b><span style="background-color:#FFDFD1">gapminder</span></b>, <b><span style="background-color:#FFD0CF">country == "Denmark"</span></b>)</code> .pull-left[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:left;"> year </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1952 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1957 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1962 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1967 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1972 </td> </tr> <tr> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> </tr> </tbody> </table> ] -- .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Denmark </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1952 </td> </tr> <tr> <td style="text-align:left;"> Denmark </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1957 </td> </tr> <tr> <td style="text-align:left;"> Denmark </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1962 </td> </tr> <tr> <td style="text-align:left;"> Denmark </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1967 </td> </tr> <tr> <td style="text-align:left;"> Denmark </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1972 </td> </tr> <tr> <td style="text-align:left;"> Denmark </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 1977 </td> </tr> </tbody> </table> ] --- class: title title-2 # `filter()` .pull-left[ <code class ='r hljs remark-code'>filter(<br>  .data = <b><span style="background-color:#FFDFD1">gapminder</span></b>, <br>  <b><span style="background-color:#FFD0CF">country == "Denmark"</span></b><br>)</code> ] .pull-right[ .box-inv-2[One `=` sets an argument] .box-inv-2[Two `==` tests if equal<br>.small[returns TRUE or FALSE)]] ] --- class: title title-2 # Logical tests <table> <tr> <th class="cell-center">Test</th> <th class="cell-left">Meaning</th> <th class="cell-center">Test</th> <th class="cell-left">Meaning</th> </tr> <tr> <td class="cell-center"><code class="remark-inline-code">x < y</code></td> <td class="cell-left">Less than</td> <td class="cell-center"><code class="remark-inline-code">x %in% y</code></td> <td class="cell-left">In (group membership)</td> </tr> <tr> <td class="cell-center"><code class="remark-inline-code">x > y</code></td> <td class="cell-left">Greater than</td> <td class="cell-center"><code class="remark-inline-code">is.na(x)</code></td> <td class="cell-left">Is missing</td> </tr> <tr> <td class="cell-center"><code class="remark-inline-code">==</code></td> <td class="cell-left">Equal to</td> <td class="cell-center"><code class="remark-inline-code">!is.na(x)</code></td> <td class="cell-left">Is not missing</td> </tr> <tr> <td class="cell-center"><code class="remark-inline-code">x <= y</code></td> <td class="cell-left">Less than or equal to</td> </tr> <tr> <td class="cell-center"><code class="remark-inline-code">x >= y</code></td> <td class="cell-left">Greater than or equal to</td> </tr> <tr> <td class="cell-center"><code class="remark-inline-code">x != y</code></td> <td class="cell-left">Not equal to</td> </tr> </table> --- class: title title-2 section-title-inv-2 # Your turn #1: Filtering .box-2[Use `filter()` and logical tests to show…] 1. The data for Canada 2. All data for countries in Oceania 3. Rows where the life expectancy is greater than 82
−
+
04
:
00
--- .medium[ ```r filter(gapminder, country == "Canada") ``` ] -- .medium[ ```r filter(gapminder, continent == "Oceania") ``` ] -- .medium[ ```r filter(gapminder, lifeExp > 82) ``` ] --- class: title title-2 # Common mistakes .pull-left[ .box-inv-2[Using `=` instead of `==`] <code class ='r hljs remark-code'>filter(gapminder, <br>       country <b><span style="color:#FF4136">=</span></b> "Canada")</code> <code class ='r hljs remark-code'>filter(gapminder, <br>       country <b><span style="color:#2ECC40">==</span></b> "Canada")</code> ] -- .pull-right[ .box-inv-2[Quote use] <code class ='r hljs remark-code'>filter(gapminder, <br>       country == <b><span style="color:#FF4136">Canada</span></b>)</code> <code class ='r hljs remark-code'>filter(gapminder, <br>       country == <b><span style="color:#2ECC40">"Canada"</span></b>)</code> ] --- class: title title-2 # `filter()` with multiple conditions .box-inv-2[Extract rows that meet *every* test] <code class ='r hljs remark-code'>filter(<b><span style="background-color:#FFDFD1">gapminder</span></b>, <b><span style="background-color:#FFD0CF">country == "Denmark", year > 2000</span></b>)</code> --- <code class ='r hljs remark-code'>filter(<b><span style="background-color:#FFDFD1">gapminder</span></b>, <b><span style="background-color:#FFD0CF">country == "Denmark", year > 2000</span></b>)</code> .pull-left[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:left;"> year </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1952 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1957 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1962 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1967 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1972 </td> </tr> <tr> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> </tr> </tbody> </table> ] -- .pull-right[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> year </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Denmark </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 2002 </td> </tr> <tr> <td style="text-align:left;"> Denmark </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 2007 </td> </tr> </tbody> </table> ] --- class: title title-2 # Boolean operators <table> <tr> <th class="cell-center">Operator</th> <th class="cell-center">Meaning</th> </tr> <tr> <td class="cell-center"><code class="remark-inline-code">a & b</code></td> <td class="cell-center">and</td> </tr> <tr> <td class="cell-center"><code class="remark-inline-code">a | b</code></td> <td class="cell-center">or</td> </tr> <tr> <td class="cell-center"><code class="remark-inline-code">!a</code></td> <td class="cell-center">not</td> </tr> </table> --- class: title title-2 # Default is "and" .box-inv-2[These do the same thing:] <code class ='r hljs remark-code'>filter(<b><span style="background-color:#FFDFD1">gapminder</span></b>, <b><span style="background-color:#FFD0CF">country == "Denmark", year > 2000</span></b>)</code> <code class ='r hljs remark-code'>filter(<b><span style="background-color:#FFDFD1">gapminder</span></b>, <b><span style="background-color:#FFD0CF">country == "Denmark" & year > 2000</span></b>)</code> --- class: title title-2 section-title-inv-2 # Your turn #2: Filtering .box-2[Use `filter()` and Boolean logical tests to show…] 1. Canada before 1970 2. Countries where life expectancy in 2007 is below 50 3. Countries where life expectancy in 2007 is below 50 and are not in Africa
−
+
04
:
00
--- ```r filter(gapminder, country == "Canada", year < 1970) ``` -- ```r filter(gapminder, year == 2007, lifeExp < 50) ``` -- ```r filter(gapminder, year == 2007, lifeExp < 50, continent != "Africa") ``` --- class: title title-2 # Common mistakes .pull-left[ .box-inv-2[Collapsing multiple tests<br>into one] .small-code[ <code class ='r hljs remark-code'>filter(gapminder, <b><span style="color:#FF4136">1960 < year < 1980</span></b>)</code> ] .small-code[ <code class ='r hljs remark-code'>filter(gapminder, <br>       <b><span style="color:#2ECC40">year > 1960, year < 1980</span></b>)</code> ] ] -- .pull-right[ .box-inv-2[Using multiple tests<br>instead of `%in%`] .small-code[ <code class ='r hljs remark-code'>filter(gapminder, <br>       <b><span style="color:#FF4136">country == "Mexico", <br>       country == "Canada", <br>       country == "United States"</span></b>)</code> ] .small-code[ <code class ='r hljs remark-code'>filter(gapminder, <br>       <b><span style="color:#2ECC40">country %in% c("Mexico", "Canada", <br>                      "United States")</span></b>)</code> ] ] --- class: title title-2 # Common syntax .box-inv-2[Every dplyr verb function follows the same pattern] .box-inv-2[First argument is a data frame; returns a data frame] .pull-left[ <code class ='r hljs remark-code'><b><span style="background-color:#EFB3FF">VERB</span></b>(<b><span style="background-color:#FFDFD1">DATA</span></b>, <b><span style="background-color:#FFD0CF">...</span></b>)</code> ] .pull-right[ - <b><span style="background: #EFB3FF">`VERB`</span></b> = dplyr function/verb - <b><span style="background: #FFDFD1">`DATA`</span></b> = Data frame to transform - <b><span style="background: #FFD0CF">`...`</span></b> = Stuff the verb does ] --- class: title title-2 # `mutate()` .box-inv-2[Create new columns] .pull-left[ <code class ='r hljs remark-code'>mutate(<b><span style="background-color:#FFDFD1">.data</span></b>, <b><span style="background-color:#FFD0CF">...</span></b>)</code> ] .pull-right[ - <b><span style="background: #FFDFD1">`DATA`</span></b> = Data frame to transform - <b><span style="background: #FFD0CF">`...`</span></b> = Columns to make ] --- <code class ='r hljs remark-code'>mutate(<b><span style="background-color:#FFDFD1">gapminder</span></b>, <b><span style="background-color:#FFD0CF">gdp = gdpPercap * pop</span></b>)</code> .pull-left.small[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> gdpPercap </th> <th style="text-align:left;"> pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1952 </td> <td style="text-align:left;"> 779.4453145 </td> <td style="text-align:left;"> 8425333 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 820.8530296 </td> <td style="text-align:left;"> 9240934 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1962 </td> <td style="text-align:left;"> 853.10071 </td> <td style="text-align:left;"> 10267083 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1967 </td> <td style="text-align:left;"> 836.1971382 </td> <td style="text-align:left;"> 11537966 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 739.9811058 </td> <td style="text-align:left;"> 13079460 </td> </tr> <tr> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> </tr> </tbody> </table> ] -- .pull-right.small[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:left;"> … </th> <th style="text-align:right;"> gdp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1952 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 6567086330 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1957 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 7585448670 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1962 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 8758855797 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1967 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 9648014150 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1972 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 9678553274 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1977 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 11697659231 </td> </tr> </tbody> </table> ] --- <code class ='r hljs remark-code'>mutate(<b><span style="background-color:#FFDFD1">gapminder</span></b>, <b><span style="background-color:#FFD0CF">gdp = gdpPercap * pop,</span></b><br>                  <b><span style="background-color:#FFD0CF">pop_mil = round(pop / 1000000)</span></b>)</code> .pull-left.small[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> gdpPercap </th> <th style="text-align:left;"> pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1952 </td> <td style="text-align:left;"> 779.4453145 </td> <td style="text-align:left;"> 8425333 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 820.8530296 </td> <td style="text-align:left;"> 9240934 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1962 </td> <td style="text-align:left;"> 853.10071 </td> <td style="text-align:left;"> 10267083 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1967 </td> <td style="text-align:left;"> 836.1971382 </td> <td style="text-align:left;"> 11537966 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 739.9811058 </td> <td style="text-align:left;"> 13079460 </td> </tr> <tr> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> </tr> </tbody> </table> ] -- .pull-right.small[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:left;"> … </th> <th style="text-align:right;"> gdp </th> <th style="text-align:right;"> pop_mil </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1952 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 6567086330 </td> <td style="text-align:right;"> 8 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1957 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 7585448670 </td> <td style="text-align:right;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1962 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 8758855797 </td> <td style="text-align:right;"> 10 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1967 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 9648014150 </td> <td style="text-align:right;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1972 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 9678553274 </td> <td style="text-align:right;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:right;"> 1977 </td> <td style="text-align:left;"> … </td> <td style="text-align:right;"> 11697659231 </td> <td style="text-align:right;"> 15 </td> </tr> </tbody> </table> ] --- class: title title-2 # `ifelse()` .box-inv-2[Do conditional tests within `mutate()`] .pull-left[ <code class ='r hljs remark-code'>ifelse(<b><span style="background-color:#FFC0DC">TEST</span></b>, <br>       <b><span style="background-color:#FFDFD1">VALUE_IF_TRUE</span></b>, <br>       <b><span style="background-color:#CBB5FF">VALUE_IF_FALSE</span></b>)</code> ] .pull-right[ - <b><span style="background: #FFC0DC">`TEST`</span></b> = A logical test - <b><span style="background: #FFDFD1">`VALUE_IF_TRUE`</span></b> = What happens if test is true - <b><span style="background: #CBB5FF">`VALUE_IF_FALSE`</span></b> = What happens if test is false ] --- <code class ='r hljs remark-code'>mutate(gapminder, <br>       after_1960 = ifelse(<b><span style="background-color:#FFC0DC">year > 1960</span></b>, <b><span style="background-color:#FFDFD1">TRUE</span></b>, <b><span style="background-color:#CBB5FF">FALSE</span></b>))</code> <code class ='r hljs remark-code'>mutate(gapminder, <br>       after_1960 = ifelse(<b><span style="background-color:#FFC0DC">year > 1960</span></b>, <br>                           <b><span style="background-color:#FFDFD1">"After 1960"</span></b>, <br>                           <b><span style="background-color:#CBB5FF">"Before 1960"</span></b>))</code> --- class: title title-2 section-title-inv-2 # Your turn #3: Mutating .box-2[Use `mutate()` to…] 1. Add an `africa` column that is TRUE if the country is on the African continent 2. Add a column for logged GDP per capita (hint: use `log()`) 3. Add an `africa_asia` column that says “Africa or Asia” if the country is in Africa or Asia, and “Not Africa or Asia” if it’s not
−
+
05
:
00
--- ```r mutate(gapminder, africa = ifelse(continent == "Africa", TRUE, FALSE)) ``` -- ```r mutate(gapminder, log_gdpPercap = log(gdpPercap)) ``` -- ```r mutate(gapminder, africa_asia = ifelse(continent %in% c("Africa", "Asia"), "Africa or Asia", "Not Africa or Asia")) ``` --- class: title title-2 # What if you have multiple verbs? .box-inv-2.sp-after[Make a dataset for just 2002 *and* calculate logged GDP per capita] -- .box-inv-2[Solution 1: Intermediate variables] <code class ='r hljs remark-code'><b><span style="background-color:#FFC0DC">gapminder_2002</span></b> <- filter(gapminder, year == 2002)<br><br><b><span style="background-color:#FFC0DC">gapminder_2002</span></b>_log <- mutate(<b><span style="background-color:#FFC0DC">gapminder_2002</span></b>,<br>                             log_gdpPercap = log(gdpPercap))</code> --- class: title title-2 # What if you have multiple verbs? .box-inv-2.sp-after[Make a dataset for just 2002 *and* calculate logged GDP per capita] .box-inv-2[Solution 2: Nested functions] <code class ='r hljs remark-code'><b><span style="background-color:#FFC0DC">filter(</span></b><b><span style="background-color:#FFDFD1">mutate(gapminder_2002,</span></b> <br>              <b><span style="background-color:#FFDFD1">log_gdpPercap = log(gdpPercap))</span></b>, <br>       <b><span style="background-color:#FFC0DC">year == 2002)</span></b></code> --- class: title title-2 # What if you have multiple verbs? .box-inv-2.sp-after[Make a dataset for just 2002 *and* calculate logged GDP per capita] .box-inv-2[Solution 3: Pipes!] .box-inv-2[The `|>` operator (pipe) takes an object on the left<br>and passes it as the first argument of the function on the right] <code class ='r hljs remark-code'><b><span style="background-color:#FFC0DC">gapminder</span></b> |> filter(<b><span style="background-color:#FFC0DC">_____</span></b>, country == "Canada")</code> --- class: title title-2 # What if you have multiple verbs? .box-inv-2[These do the same thing!] <code class ='r hljs remark-code'>filter(<b><span style="background-color:#FFC0DC">gapminder</span></b>, country == "Canada")</code> <code class ='r hljs remark-code'><b><span style="background-color:#FFC0DC">gapminder</span></b> |> filter(country == "Canada")</code> --- class: title title-2 # What if you have multiple verbs? .box-inv-2.sp-after[Make a dataset for just 2002 *and* calculate logged GDP per capita] .box-inv-2[Solution 3: Pipes!] <code class ='r hljs remark-code'>gapminder |> <br>  filter(year == 2002) |> <br>  mutate(log_gdpPercap = log(gdpPercap))</code> --- class: title title-2 # `|>` <code class ='r hljs remark-code'><b>leave_house</b>(<b>get_dressed</b>(<b>get_out_of_bed</b>(<b>wake_up</b>(<span style="color:#E16462">me</span>, <span style="color:#0D0887">time</span> = <span style="color:#E16462">"8:00"</span>), <span style="color:#0D0887">side</span> = <span style="color:#E16462">"correct"</span>), <span style="color:#0D0887">pants</span> = <span style="color:#E16462">TRUE</span>, <span style="color:#0D0887">shirt</span> = <span style="color:#E16462">TRUE</span>), <span style="color:#0D0887">car</span> = <span style="color:#E16462">TRUE</span>, <span style="color:#0D0887">bike</span> = <span style="color:#E16462">FALSE</span>)</code> -- <code class ='r hljs remark-code'>me |> <br>  <b>wake_up</b>(<span style="color:#0D0887">time</span> = <span style="color:#E16462">"8:00"</span>) |> <br>  <b>get_out_of_bed</b>(<span style="color:#0D0887">side</span> = <span style="color:#E16462">"correct"</span>) |> <br>  <b>get_dressed</b>(<span style="color:#0D0887">pants</span> = <span style="color:#E16462">TRUE</span>, <span style="color:#0D0887">shirt</span> = <span style="color:#E16462">TRUE</span>) |> <br>  <b>leave_house</b>(<span style="color:#0D0887">car</span> = <span style="color:#E16462">TRUE</span>, <span style="color:#0D0887">bike</span> = <span style="color:#E16462">FALSE</span>)</code> --- class: title title-2 # `|>` vs `%>%` -- .box-inv-2.medium[There are actually multiple pipes!] -- .box-inv-2[`%>%` was invented first, but requires a package to use] .box-inv-2.sp-after[`|>` is part of base R] -- .box-2.medium[They're interchangeable 99% of the time] .box-2.small[(Just be consistent)] --- class: title title-2 # `summarize()` .box-inv-2[Compute a table of summaries] <code class ='r hljs remark-code'><b><span style="background-color:#FFDFD1">gapminder</span></b> |> summarize(<b><span style="background-color:#FFD0CF">mean_life = mean(lifeExp)</span></b>)</code> .pull-left.small[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> lifeExp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1952 </td> <td style="text-align:left;"> 28.801 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 30.332 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1962 </td> <td style="text-align:left;"> 31.997 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1967 </td> <td style="text-align:left;"> 34.02 </td> </tr> <tr> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> </tr> </tbody> </table> ] -- .pull-right.small[ <table> <thead> <tr> <th style="text-align:right;"> mean_life </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 59.47444 </td> </tr> </tbody> </table> ] --- class: title title-2 # `summarize()` <code class ='r hljs remark-code'><b><span style="background-color:#FFDFD1">gapminder</span></b> |> summarize(<b><span style="background-color:#FFD0CF">mean_life = mean(lifeExp),</span></b><br>                        <b><span style="background-color:#FFD0CF">min_life = min(lifeExp)</span></b>)</code> .pull-left.small[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> lifeExp </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1952 </td> <td style="text-align:left;"> 28.801 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 30.332 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1962 </td> <td style="text-align:left;"> 31.997 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1967 </td> <td style="text-align:left;"> 34.02 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 36.088 </td> </tr> <tr> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> <td style="text-align:left;"> … </td> </tr> </tbody> </table> ] -- .pull-right.small[ <table> <thead> <tr> <th style="text-align:right;"> mean_life </th> <th style="text-align:right;"> min_life </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 59.47444 </td> <td style="text-align:right;"> 23.599 </td> </tr> </tbody> </table> ] --- class: title title-2 section-title-inv-2 # Your turn #4: Summarizing .box-2[Use `summarize()` to calculate…] 1. The first (minimum) year in the dataset 2. The last (maximum) year in the dataset 3. The number of rows in the dataset (use the cheatsheet) 4. The number of distinct countries in the dataset (use the cheatsheet)
−
+
04
:
00
--- ```r gapminder |> summarize(first = min(year), last = max(year), num_rows = n(), num_unique = n_distinct(country)) ``` .small[ <table> <thead> <tr> <th style="text-align:right;"> first </th> <th style="text-align:right;"> last </th> <th style="text-align:right;"> num_rows </th> <th style="text-align:right;"> num_unique </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1952 </td> <td style="text-align:right;"> 2007 </td> <td style="text-align:right;"> 1704 </td> <td style="text-align:right;"> 142 </td> </tr> </tbody> </table> ] --- class: title title-2 section-title-inv-2 # Your turn #5: Summarizing .box-2[Use `filter()` and `summarize()` to calculate<br>(1) the number of unique countries and<br>(2) the median life expectancy on the<br>African continent in 2007]
−
+
04
:
00
--- ```r gapminder |> filter(continent == "Africa", year == 2007) |> summarise(n_countries = n_distinct(country), med_le = median(lifeExp)) ``` .small[ <table> <thead> <tr> <th style="text-align:right;"> n_countries </th> <th style="text-align:right;"> med_le </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 52 </td> <td style="text-align:right;"> 52.9265 </td> </tr> </tbody> </table> ] --- class: title title-2 # `group_by()` .box-inv-2[Put rows into groups based on values in a column] <code class ='r hljs remark-code'><b><span style="background-color:#FFDFD1">gapminder</span></b> |> group_by(<b><span style="background-color:#FFD0CF">continent</span></b>)</code>   -- .box-inv-2[Nothing happens by itself!] -- .box-inv-2[Powerful when combined with `summarize()`] --- ```r gapminder |> group_by(continent) |> summarize(n_countries = n_distinct(country)) ``` -- .small[ <table> <thead> <tr> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> n_countries </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 52 </td> </tr> <tr> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 33 </td> </tr> <tr> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 30 </td> </tr> <tr> <td style="text-align:left;"> Oceania </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table> ] --- <code class ='r hljs remark-code'><b><span style="background-color:#FFDFD1">pollution</span></b> |> <br>  summarize(<b><span style="background-color:#FFD0CF">mean = mean(amount), sum = sum(amount), n = n()</span></b>)</code> .pull-left.small[ <table> <thead> <tr> <th style="text-align:left;"> city </th> <th style="text-align:left;"> particle_size </th> <th style="text-align:right;"> amount </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> New York </td> <td style="text-align:left;"> Large </td> <td style="text-align:right;"> 23 </td> </tr> <tr> <td style="text-align:left;"> New York </td> <td style="text-align:left;"> Small </td> <td style="text-align:right;"> 14 </td> </tr> <tr> <td style="text-align:left;"> London </td> <td style="text-align:left;"> Large </td> <td style="text-align:right;"> 22 </td> </tr> <tr> <td style="text-align:left;"> London </td> <td style="text-align:left;"> Small </td> <td style="text-align:right;"> 16 </td> </tr> <tr> <td style="text-align:left;"> Beijing </td> <td style="text-align:left;"> Large </td> <td style="text-align:right;"> 121 </td> </tr> <tr> <td style="text-align:left;"> Beijing </td> <td style="text-align:left;"> Small </td> <td style="text-align:right;"> 56 </td> </tr> </tbody> </table> ] -- .pull-right.small[ <table> <thead> <tr> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> sum </th> <th style="text-align:right;"> n </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 42 </td> <td style="text-align:right;"> 252 </td> <td style="text-align:right;"> 6 </td> </tr> </tbody> </table> ] --- <code class ='r hljs remark-code'><b><span style="background-color:#FFDFD1">pollution</span></b> |> <br>  group_by(<b><span style="background-color:#FFD0CF">city</span></b>) |> <br>  summarize(<b><span style="background-color:#FFD0CF">mean = mean(amount), sum = sum(amount), n = n()</span></b>)</code> .pull-left.small[ <table> <thead> <tr> <th style="text-align:left;"> city </th> <th style="text-align:left;"> particle_size </th> <th style="text-align:right;"> amount </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: #B2B1F9 !important;"> New York </td> <td style="text-align:left;background-color: #B2B1F9 !important;"> Large </td> <td style="text-align:right;background-color: #B2B1F9 !important;"> 23 </td> </tr> <tr> <td style="text-align:left;background-color: #B2B1F9 !important;"> New York </td> <td style="text-align:left;background-color: #B2B1F9 !important;"> Small </td> <td style="text-align:right;background-color: #B2B1F9 !important;"> 14 </td> </tr> <tr> <td style="text-align:left;background-color: #EFB3FF !important;"> London </td> <td style="text-align:left;background-color: #EFB3FF !important;"> Large </td> <td style="text-align:right;background-color: #EFB3FF !important;"> 22 </td> </tr> <tr> <td style="text-align:left;background-color: #EFB3FF !important;"> London </td> <td style="text-align:left;background-color: #EFB3FF !important;"> Small </td> <td style="text-align:right;background-color: #EFB3FF !important;"> 16 </td> </tr> <tr> <td style="text-align:left;background-color: #FFD0CF !important;"> Beijing </td> <td style="text-align:left;background-color: #FFD0CF !important;"> Large </td> <td style="text-align:right;background-color: #FFD0CF !important;"> 121 </td> </tr> <tr> <td style="text-align:left;background-color: #FFD0CF !important;"> Beijing </td> <td style="text-align:left;background-color: #FFD0CF !important;"> Small </td> <td style="text-align:right;background-color: #FFD0CF !important;"> 56 </td> </tr> </tbody> </table> ] -- .pull-right.small[ <table> <thead> <tr> <th style="text-align:left;"> city </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> sum </th> <th style="text-align:right;"> n </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: #FFD0CF !important;"> Beijing </td> <td style="text-align:right;background-color: #FFD0CF !important;"> 88.5 </td> <td style="text-align:right;background-color: #FFD0CF !important;"> 177 </td> <td style="text-align:right;background-color: #FFD0CF !important;"> 2 </td> </tr> <tr> <td style="text-align:left;background-color: #EFB3FF !important;"> London </td> <td style="text-align:right;background-color: #EFB3FF !important;"> 19.0 </td> <td style="text-align:right;background-color: #EFB3FF !important;"> 38 </td> <td style="text-align:right;background-color: #EFB3FF !important;"> 2 </td> </tr> <tr> <td style="text-align:left;background-color: #B2B1F9 !important;"> New York </td> <td style="text-align:right;background-color: #B2B1F9 !important;"> 18.5 </td> <td style="text-align:right;background-color: #B2B1F9 !important;"> 37 </td> <td style="text-align:right;background-color: #B2B1F9 !important;"> 2 </td> </tr> </tbody> </table> ] --- <code class ='r hljs remark-code'><b><span style="background-color:#FFDFD1">pollution</span></b> |> <br>  group_by(<b><span style="background-color:#FFD0CF">particle_size</span></b>) |> <br>  summarize(<b><span style="background-color:#FFD0CF">mean = mean(amount), sum = sum(amount), n = n()</span></b>)</code> .pull-left.small[ <table> <thead> <tr> <th style="text-align:left;"> city </th> <th style="text-align:left;"> particle_size </th> <th style="text-align:right;"> amount </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: #FFDFD1 !important;"> New York </td> <td style="text-align:left;background-color: #FFDFD1 !important;"> Large </td> <td style="text-align:right;background-color: #FFDFD1 !important;"> 23 </td> </tr> <tr> <td style="text-align:left;background-color: #FFF0D4 !important;"> New York </td> <td style="text-align:left;background-color: #FFF0D4 !important;"> Small </td> <td style="text-align:right;background-color: #FFF0D4 !important;"> 14 </td> </tr> <tr> <td style="text-align:left;background-color: #FFDFD1 !important;"> London </td> <td style="text-align:left;background-color: #FFDFD1 !important;"> Large </td> <td style="text-align:right;background-color: #FFDFD1 !important;"> 22 </td> </tr> <tr> <td style="text-align:left;background-color: #FFF0D4 !important;"> London </td> <td style="text-align:left;background-color: #FFF0D4 !important;"> Small </td> <td style="text-align:right;background-color: #FFF0D4 !important;"> 16 </td> </tr> <tr> <td style="text-align:left;background-color: #FFDFD1 !important;"> Beijing </td> <td style="text-align:left;background-color: #FFDFD1 !important;"> Large </td> <td style="text-align:right;background-color: #FFDFD1 !important;"> 121 </td> </tr> <tr> <td style="text-align:left;background-color: #FFF0D4 !important;"> Beijing </td> <td style="text-align:left;background-color: #FFF0D4 !important;"> Small </td> <td style="text-align:right;background-color: #FFF0D4 !important;"> 56 </td> </tr> </tbody> </table> ] -- .pull-right.small[ <table> <thead> <tr> <th style="text-align:left;"> particle_size </th> <th style="text-align:right;"> mean </th> <th style="text-align:right;"> sum </th> <th style="text-align:right;"> n </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: #FFDFD1 !important;"> Large </td> <td style="text-align:right;background-color: #FFDFD1 !important;"> 55.33333 </td> <td style="text-align:right;background-color: #FFDFD1 !important;"> 166 </td> <td style="text-align:right;background-color: #FFDFD1 !important;"> 3 </td> </tr> <tr> <td style="text-align:left;background-color: #FFF0D4 !important;"> Small </td> <td style="text-align:right;background-color: #FFF0D4 !important;"> 28.66667 </td> <td style="text-align:right;background-color: #FFF0D4 !important;"> 86 </td> <td style="text-align:right;background-color: #FFF0D4 !important;"> 3 </td> </tr> </tbody> </table> ] --- class: title title-2 section-title-inv-2 # Your turn #6: Grouping and summarizing .box-2[Find the minimum, maximum, and median<br>life expectancy for each continent] .box-2[Find the minimum, maximum, and median<br>life expectancy for each continent in 2007 only]
−
+
05
:
00
--- ```r gapminder |> group_by(continent) |> summarize(min_le = min(lifeExp), max_le = max(lifeExp), med_le = median(lifeExp)) ``` -- ```r gapminder |> filter(year == 2007) |> group_by(continent) |> summarize(min_le = min(lifeExp), max_le = max(lifeExp), med_le = median(lifeExp)) ``` --- class: title title-2 # dplyr: verbs for manipulating data <table> <tr> <td>Extract rows with <code>filter()</code></td> <td><img src="img/01-class/04/filter.png" alt="filter" title="filter" height="80px"></td> </tr> <tr> <td>Extract columns with <code>select()</code></td> <td><img src="img/01-class/04/select.png" alt="select" title="select" height="80px"></td> </tr> <tr> <td>Arrange/sort rows with <code>arrange()</code></td> <td><img src="img/01-class/04/arrange.png" alt="arrange" title="arrange" height="80px"></td> </tr> <tr> <td>Make new columns with <code>mutate()</code></td> <td><img src="img/01-class/04/mutate.png" alt="mutate" title="mutate" height="80px"></td> </tr> <tr> <td>Make group summaries with<br><code>group_by() |> summarize()</code></td> <td><img src="img/01-class/04/summarize.png" alt="summarize" title="summarize" height="80px"></td> </tr> </table>