class: center, middle, inverse, title-slide .title[ # Uvod u Tidyverse ] .subtitle[ ## Prilagođen materijal
Introduction to the Tidyverse - How to be a tidy data scientist
] .author[ ### Olivier Gimenez pilagodio Milan Kilibarda ] .date[ ### 2025 ] --- # **Tidyverse** infografik <img src="assets/img/TidyInfo.png" width="90%" style="display: block; margin: auto;" /> --- # **Tidyverse** video
--- # **Tidyverse** - **Tidyverse** , gde _Tidy_ znači "uredno", a _verse_ označava "univerzum". - Kolekcija R paketa 📦, razvijenih od strane **H. Wickhama** i drugih u **RStudio** timu. <img src="assets/img/wickham_president.jpg" width="50%" style="display: block; margin: auto;" /> --- # **Tidyverse** * " **Tidyverse** je okruženje (set paketa) za upravljanje podacima koji ima za cilj da olakša korake čišćenja i pripreme podataka" (Julien Barnier). * Glavne karakteristike **uređenog (tidy) dataset-a**: - Svaka promenljiva je **kolona** - Svako opažanje je **red** - Svaka vrednost je u **posebnoj ćeliji** <img src="assets/img/tidydata.png" width="80%" style="display: block; margin: auto;" /> --- # **Tidyverse** je kolekcija R paketa 📦 * `ggplot2` - vizualizacija podataka * `dplyr`, `tidyr` - manipulacija podacima * `purrr` - napredna programiranja * `readr` - uvoz podataka * `tibble` - poboljšani format `data.frame` * `forcats` - rad sa faktorima * `stringr` - rad sa tekstualnim podacima --- # **Tidyverse** je kolekcija R paketa 📦 * [`ggplot2` - vizualizacija podataka](https://ggplot2.tidyverse.org/) * [`dplyr`, `tidyr` - manipulacija podacima](https://dplyr.tidyverse.org/) * [`purrr` - funkcionalno programiranje](https://purrr.tidyverse.org/) * [`readr` - uvoz podataka](https://readr.tidyverse.org/) * [`tibble` - poboljšani `data.frame`](https://tibble.tidyverse.org/) * [`forcats` - rad sa faktorima](https://forcats.tidyverse.org/) * [`stringr` - rad sa tekstualnim podacima](https://stringr.tidyverse.org/) --- class: middle # **Radni tok u nauci o podacima** <img src="assets/img/data-science-workflow.png" width="100%" style="display: block; margin: auto;" /> --- class: middle # Radni tok u nauci o podacima sa **Tidyverse** <img src="assets/img/01_tidyverse_data_science.png" width="90%" style="display: block; margin: auto;" /> --- background-image: url(https://github.com/rstudio/hex-stickers/raw/master/SVG/tidyverse.svg?sanitize=true) background-size: 100px background-position: 90% 3% # Učitavanje okruženja [tidyverse](www.tidyverse.org) 📦 ``` r #install.packages("tidyverse") library(tidyverse) ``` --- class: middle ## Studija slučaja: # [Korišćenje Twitter-a za predikciju citiranosti ekoloških istraživanja](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0166570) <img src="assets/img/paper_workflow.png" width="85%" style="display: block; margin: auto;" /> --- # Rad Rad *"Using Twitter to predict citation rates of ecological research"* ispituje uticaj aktivnosti na Twitter-u na broj citata naučnih radova iz ekologije. Autori analiziraju da li broj tvitova o nekom radu može predvideti koliko će puta taj rad biti citiran u akademskoj literaturi. 🔹 Ključni nalazi: - ✅ Postoji pozitivna korelacija između broja tvitova i broja citata, ali nije savršena. - ✅ Tvitovanje može povećati vidljivost rada, ali nije glavni faktor u akademskom citiranju. - ✅ Radovi iz poznatijih časopisa i oni sa većim impact faktorom imaju veći broj citata nezavisno od Twitter aktivnosti. --- class: inverse, center, middle # Uvoz podataka --- # Uvoz podataka **Funkcija `readr::read_csv`**: * ~~zadržava ulazne tipove podataka kakvi jesu (bez konverzije u faktor)~~ (od `R` verzije 4.0.0) * Kreira **tibble** umesto `data.frame`: - Nema naziva za redove - Dozvoljava nazive kolona sa specijalnim karakterima (videti sledeći slajd) - Pametniji prikaz na ekranu u odnosu na `data.frame` (videti sledeći slajd) - [Nema delimičnog poklapanja naziva kolona](https://stackoverflow.com/questions/58513997/how-to-make-r-stop-accepting-partial-matches-for-column-names) - Prikazuje upozorenje ako pokušamo da pristupimo nepostojećoj koloni * Neverovatno je brz 🚀 🏎 --- # Uvoz podataka ``` r citations_raw <- read_csv('https://raw.githubusercontent.com/oliviergimenez/intro_tidyverse/master/journal.pone.0166570.s001.CSV') citations_raw ``` ``` ## # A tibble: 1,599 × 12 ## `Journal identity` 5-year journal impact fact…¹ `Year published` Volume Issue ## <chr> <dbl> <dbl> <dbl> <chr> ## 1 Ecology Letters 16.7 2014 17 12 ## 2 Ecology Letters 16.7 2014 17 12 ## 3 Ecology Letters 16.7 2014 17 12 ## 4 Ecology Letters 16.7 2014 17 11 ## 5 Ecology Letters 16.7 2014 17 11 ## 6 Ecology Letters 16.7 2014 17 10 ## 7 Ecology Letters 16.7 2014 17 10 ## 8 Ecology Letters 16.7 2014 17 9 ## 9 Ecology Letters 16.7 2014 17 9 ## 10 Ecology Letters 16.7 2014 17 9 ## # ℹ 1,589 more rows ## # ℹ abbreviated name: ¹`5-year journal impact factor` ## # ℹ 7 more variables: Authors <chr>, `Collection date` <chr>, ## # `Publication date` <chr>, `Number of tweets` <dbl>, ## # `Number of users` <dbl>, `Twitter reach` <dbl>, ## # `Number of Web of Science citations` <dbl> ``` --- class: inverse, center, middle # Tidy, transform --- # Promena imena kolona ``` r citations_temp <- rename(citations_raw, journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') citations_temp ``` Naprvati novi objekat `citations_temp` sa novim imenima kolona `citations_temp` – novi dataset u kojem su neke kolone preimenovane - 'Journal identity' → journal - '5-year journal impact factor' → impactfactor - 'Year published' → pubyear - 'Collection date' → colldate - 'Publication date' → pubdate - 'Number of tweets' → nbtweets - 'Number of Web of Science citations' → woscitations --- # Promena imena kolona ``` r citations_temp <- rename(citations_raw, journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') citations_temp ``` ``` ## # A tibble: 1,599 × 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecology … 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecology … 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecology … 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecology … 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecology … 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecology … 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecology … 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecology … 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecology … 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # ℹ 1,589 more rows ## # ℹ 3 more variables: `Number of users` <dbl>, `Twitter reach` <dbl>, ## # woscitations <dbl> ``` --- # Kreiranje (ili izmena) kolona ``` r citations <- mutate(citations_temp, journal = as.factor(journal)) citations ``` - 1️⃣ citations_temp –tabela ( tibble ili data.frame). - 2️⃣ mutate() – funkcija koja omogućava dodavanje ili modifikaciju postojećih kolona. - 3️⃣ journal = as.factor(journal) – kolona journal se konvertuje iz teksta (character) u faktor (factor). - 4️⃣ citations – nova tabela u kojoj je journal sada faktorska promenljiva. --- # Kreiranje (ili izmena) kolona ``` r citations <- mutate(citations_temp, journal = as.factor(journal)) citations ``` ``` ## # A tibble: 1,599 × 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecology … 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecology … 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecology … 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecology … 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecology … 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecology … 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecology … 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecology … 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecology … 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # ℹ 1,589 more rows ## # ℹ 3 more variables: `Number of users` <dbl>, `Twitter reach` <dbl>, ## # woscitations <dbl> ``` --- # Nova kolona je tipa faktor ``` r levels(citations$journal) ``` ``` ## [1] "Animal Conservation" "Conservation Letters" ## [3] "Diversity and Distributions" "Ecological Applications" ## [5] "Ecology" "Ecology Letters" ## [7] "Evolution" "Evolutionary Applications" ## [9] "Fish and Fisheries" "Functional Ecology" ## [11] "Global Change Biology" "Global Ecology and Biogeography" ## [13] "Journal of Animal Ecology" "Journal of Applied Ecology" ## [15] "Journal of Biogeography" "Limnology and Oceanography" ## [17] "Mammal Review" "Methods in Ecology and Evolution" ## [19] "Molecular Ecology Resources" "New Phytologist" ``` --- # Lepši kod uz "pipe" operator `|>` ranije (`%>%`) ``` r citations_raw |> rename(journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') |> mutate(journal = as.factor(journal)) ``` - ✅ **Korak 1:** Početni dataset `citations_raw` ulazi u `rename()`. - ✅ **Korak 2:** `rename()` menja nazive kolona i prosleđuje rezultat dalje. - ✅ **Korak 3:** `mutate()` konvertuje kolonu `journal` u faktor. - ✅ **Konačan rezultat:** Dataset se prikazuje u konzoli bez čuvanja u novoj promenljivoj. --- # Lepši kod uz "pipe" operator `|>` ranije (`|>`) ``` r citations_raw |> rename(journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') |> mutate(journal = as.factor(journal)) ``` ``` ## # A tibble: 1,599 × 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecology … 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecology … 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecology … 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecology … 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecology … 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecology … 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecology … 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecology … 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecology … 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # ℹ 1,589 more rows ## # ℹ 3 more variables: `Number of users` <dbl>, `Twitter reach` <dbl>, ## # woscitations <dbl> ``` --- # Napravimo novi objekat i sačuvajmo ga u promenljivoj `citation` ``` r *citations <- citations_raw |> rename(journal = 'Journal identity', impactfactor = '5-year journal impact factor', pubyear = 'Year published', colldate = 'Collection date', pubdate = 'Publication date', nbtweets = 'Number of tweets', woscitations = 'Number of Web of Science citations') |> mutate(journal = as.factor(journal)) ``` --- # Base R from [Lise Vaudor's blog](http://perso.ens-lyon.fr/lise.vaudor/) ``` r white_and_yolk <- crack(egg, add_seasoning) omelette_batter <- beat(white_and_yolk) omelette_with_chives <- cook(omelette_batter,add_chives) ``` <img src="assets/img/piping_successive.jpg" width="500px" style="display: block; margin: auto;" /> --- # Piping from [Lise Vaudor's blog](http://perso.ens-lyon.fr/lise.vaudor/) ``` r egg |> crack(add_seasoning) |> beat() |> cook(add_chives) -> omelette_with_chives ``` <img src="assets/img/piping_piped.png" width="250px" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Tidy, transform --- # Selekcija kolona ``` r citations |> select(journal, impactfactor, nbtweets) ``` ``` ## # A tibble: 1,599 × 3 ## journal impactfactor nbtweets ## <fct> <dbl> <dbl> ## 1 Ecology Letters 16.7 18 ## 2 Ecology Letters 16.7 15 ## 3 Ecology Letters 16.7 5 ## 4 Ecology Letters 16.7 9 ## 5 Ecology Letters 16.7 3 ## 6 Ecology Letters 16.7 27 ## 7 Ecology Letters 16.7 6 ## 8 Ecology Letters 16.7 19 ## 9 Ecology Letters 16.7 26 ## 10 Ecology Letters 16.7 44 ## # ℹ 1,589 more rows ``` --- # Izbacivanje kolona ``` r citations |> select(-Volume, -Issue, -Authors) ``` ``` ## # A tibble: 1,599 × 9 ## journal impactfactor pubyear colldate pubdate nbtweets `Number of users` ## <fct> <dbl> <dbl> <chr> <chr> <dbl> <dbl> ## 1 Ecology Let… 16.7 2014 2/1/2016 9/16/2… 18 16 ## 2 Ecology Let… 16.7 2014 2/1/2016 10/13/… 15 12 ## 3 Ecology Let… 16.7 2014 2/1/2016 10/21/… 5 4 ## 4 Ecology Let… 16.7 2014 2/1/2016 8/28/2… 9 8 ## 5 Ecology Let… 16.7 2014 2/1/2016 8/28/2… 3 3 ## 6 Ecology Let… 16.7 2014 2/2/2016 7/28/2… 27 23 ## 7 Ecology Let… 16.7 2014 2/2/2016 8/6/20… 6 6 ## 8 Ecology Let… 16.7 2014 2/2/2016 6/17/2… 19 18 ## 9 Ecology Let… 16.7 2014 2/2/2016 6/12/2… 26 23 ## 10 Ecology Let… 16.7 2014 2/2/2016 7/17/2… 44 42 ## # ℹ 1,589 more rows ## # ℹ 2 more variables: `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Razdvojimo jednu kolonu u nove 3 ``` r citations |> separate(pubdate,c('month','day','year'),'/') ``` ``` ## # A tibble: 1,599 × 14 ## journal impactfactor pubyear Volume Issue Authors colldate month day year ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 Ecology… 16.7 2014 17 12 Morin … 2/1/2016 9 16 2014 ## 2 Ecology… 16.7 2014 17 12 Jucker… 2/1/2016 10 13 2014 ## 3 Ecology… 16.7 2014 17 12 Calcag… 2/1/2016 10 21 2014 ## 4 Ecology… 16.7 2014 17 11 Segre … 2/1/2016 8 28 2014 ## 5 Ecology… 16.7 2014 17 11 Kaufma… 2/1/2016 8 28 2014 ## 6 Ecology… 16.7 2014 17 10 Nasto … 2/2/2016 7 28 2014 ## 7 Ecology… 16.7 2014 17 10 Tschir… 2/2/2016 8 6 2014 ## 8 Ecology… 16.7 2014 17 9 Barnec… 2/2/2016 6 17 2014 ## 9 Ecology… 16.7 2014 17 9 Pinto-… 2/2/2016 6 12 2014 ## 10 Ecology… 16.7 2014 17 9 Clough… 2/2/2016 7 17 2014 ## # ℹ 1,589 more rows ## # ℹ 4 more variables: nbtweets <dbl>, `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Rad sa datumima ``` r library(lubridate) citations |> mutate(pubdate = mdy(pubdate), colldate = mdy(colldate)) ``` ``` ## # A tibble: 1,599 × 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <date> <date> ## 1 Ecology Lett… 16.7 2014 17 12 Morin … 2016-02-01 2014-09-16 ## 2 Ecology Lett… 16.7 2014 17 12 Jucker… 2016-02-01 2014-10-13 ## 3 Ecology Lett… 16.7 2014 17 12 Calcag… 2016-02-01 2014-10-21 ## 4 Ecology Lett… 16.7 2014 17 11 Segre … 2016-02-01 2014-08-28 ## 5 Ecology Lett… 16.7 2014 17 11 Kaufma… 2016-02-01 2014-08-28 ## 6 Ecology Lett… 16.7 2014 17 10 Nasto … 2016-02-02 2014-07-28 ## 7 Ecology Lett… 16.7 2014 17 10 Tschir… 2016-02-02 2014-08-06 ## 8 Ecology Lett… 16.7 2014 17 9 Barnec… 2016-02-02 2014-06-17 ## 9 Ecology Lett… 16.7 2014 17 9 Pinto-… 2016-02-02 2014-06-12 ## 10 Ecology Lett… 16.7 2014 17 9 Clough… 2016-02-02 2014-07-17 ## # ℹ 1,589 more rows ## # ℹ 4 more variables: nbtweets <dbl>, `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl> ``` --- # Manipulacija datumima ``` r library(lubridate) citations |> mutate(pubdate = mdy(pubdate), colldate = mdy(colldate), * pubyear2 = year(pubdate)) ``` ``` ## # A tibble: 1,599 × 13 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <date> <date> ## 1 Ecology Lett… 16.7 2014 17 12 Morin … 2016-02-01 2014-09-16 ## 2 Ecology Lett… 16.7 2014 17 12 Jucker… 2016-02-01 2014-10-13 ## 3 Ecology Lett… 16.7 2014 17 12 Calcag… 2016-02-01 2014-10-21 ## 4 Ecology Lett… 16.7 2014 17 11 Segre … 2016-02-01 2014-08-28 ## 5 Ecology Lett… 16.7 2014 17 11 Kaufma… 2016-02-01 2014-08-28 ## 6 Ecology Lett… 16.7 2014 17 10 Nasto … 2016-02-02 2014-07-28 ## 7 Ecology Lett… 16.7 2014 17 10 Tschir… 2016-02-02 2014-08-06 ## 8 Ecology Lett… 16.7 2014 17 9 Barnec… 2016-02-02 2014-06-17 ## 9 Ecology Lett… 16.7 2014 17 9 Pinto-… 2016-02-02 2014-06-12 ## 10 Ecology Lett… 16.7 2014 17 9 Clough… 2016-02-02 2014-07-17 ## # ℹ 1,589 more rows ## # ℹ 5 more variables: nbtweets <dbl>, `Number of users` <dbl>, ## # `Twitter reach` <dbl>, woscitations <dbl>, pubyear2 <dbl> ``` * Proveri`?lubridate::lubridate` za više detalja --- ## <https://www.garrickadenbuie.com/project/tidyexplain/> <img src="assets/img/left-join.gif" width="70%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Manipulacija karakterima --- # Izbor **redova** koji odgovaraju radovima sa više od 3 autora ``` r citations |> * filter(str_detect(Authors,'et al')) ``` ``` ## # A tibble: 1,280 × 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 12 Morin … 2/1/2016 9/16/2… 18 ## 2 Ecology … 16.7 2014 17 12 Jucker… 2/1/2016 10/13/… 15 ## 3 Ecology … 16.7 2014 17 12 Calcag… 2/1/2016 10/21/… 5 ## 4 Ecology … 16.7 2014 17 11 Segre … 2/1/2016 8/28/2… 9 ## 5 Ecology … 16.7 2014 17 11 Kaufma… 2/1/2016 8/28/2… 3 ## 6 Ecology … 16.7 2014 17 10 Nasto … 2/2/2016 7/28/2… 27 ## 7 Ecology … 16.7 2014 17 10 Tschir… 2/2/2016 8/6/20… 6 ## 8 Ecology … 16.7 2014 17 9 Barnec… 2/2/2016 6/17/2… 19 ## 9 Ecology … 16.7 2014 17 9 Pinto-… 2/2/2016 6/12/2… 26 ## 10 Ecology … 16.7 2014 17 9 Clough… 2/2/2016 7/17/2… 44 ## # ℹ 1,270 more rows ## # ℹ 3 more variables: `Number of users` <dbl>, `Twitter reach` <dbl>, ## # woscitations <dbl> ``` --- # Dobijanje **kolone** sa redovima koji odgovaraju radovima sa više od 3 autora ``` r citations |> * filter(str_detect(Authors,'et al')) |> * select(Authors) ``` ``` ## # A tibble: 1,280 × 1 ## Authors ## <chr> ## 1 Morin et al ## 2 Jucker et al ## 3 Calcagno et al ## 4 Segre et al ## 5 Kaufman et al ## 6 Nasto et al ## 7 Tschirren et al ## 8 Barnechi et al ## 9 Pinto-Sanchez et al ## 10 Clough et al ## # ℹ 1,270 more rows ``` --- # Izbor redova koji odgovaraju radovima **sa manje od 3 autora** ``` r citations |> * filter(!str_detect(Authors,'et al')) ``` ``` ## # A tibble: 319 × 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Ecology … 16.7 2014 17 6 Neutle… 2/15/20… 3/17/2… 8 ## 2 Ecology … 16.7 2014 17 5 Kellne… 2/15/20… 2/20/2… 18 ## 3 Ecology … 16.7 2014 17 4 Griffi… 2/15/20… 1/16/2… 4 ## 4 Ecology … 16.7 2014 17 3 Gremer… 2/15/20… 1/17/2… 4 ## 5 Ecology … 16.7 2014 17 2 Cavier… 2/15/20… 10/17/… 16 ## 6 Ecology … 16.7 2014 17 2 Haegma… 2/15/20… 12/5/2… 9 ## 7 Ecology … 16.7 2013 16 12 Kearney 2/15/20… 10/1/2… 13 ## 8 Ecology … 16.7 2013 16 9 Locey … 2/15/20… 7/15/2… 28 ## 9 Ecology … 16.7 2013 16 8 Quinte… 2/15/20… 6/26/2… 120 ## 10 Ecology … 16.7 2013 16 3 Lesser… 2/15/20… 12/22/… 9 ## # ℹ 309 more rows ## # ℹ 3 more variables: `Number of users` <dbl>, `Twitter reach` <dbl>, ## # woscitations <dbl> ``` --- # Dobijanje **kolone** sa redovima koji odgovaraju radovima **sa manje od 3 autora** ``` r citations |> * filter(!str_detect(Authors,'et al')) |> * select(Authors) ``` ``` ## # A tibble: 319 × 1 ## Authors ## <chr> ## 1 Neutle and Thorne ## 2 Kellner and Asner ## 3 Griffin and Willi ## 4 Gremer and Venable ## 5 Cavieres ## 6 Haegman and Loreau ## 7 Kearney ## 8 Locey and White ## 9 Quintero and Weins ## 10 Lesser and Jackson ## # ℹ 309 more rows ``` --- # Dobijanje kolone sa redovima koji odgovaraju radovima sa manje od 3 autora ``` r citations |> filter(!str_detect(Authors,'et al')) |> # pull izdvaja samo kolonu Authors, uklanjajući je iz tibble/data.frame strukture. * pull(Authors) |> head(10) ``` ``` ## [1] "Neutle and Thorne" "Kellner and Asner" "Griffin and Willi" ## [4] "Gremer and Venable" "Cavieres" "Haegman and Loreau" ## [7] "Kearney" "Locey and White" "Quintero and Weins" ## [10] "Lesser and Jackson" ``` --- # Izbor redova koji odgovaraju radovima sa manje od 3 autora u časopisima sa IF manjim od 5 ``` r citations |> * filter(!str_detect(Authors,'et al'), impactfactor < 5) ``` ``` ## # A tibble: 77 × 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <fct> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 Molecula… 4.9 2014 14 6 Gautier 2/27/20… 5/14/2… 2 ## 2 Molecula… 4.9 2014 14 5 Gambel… 2/27/20… 3/7/20… 7 ## 3 Molecula… 4.9 2014 14 4 Kekkon… 2/27/20… 3/10/2… 4 ## 4 Molecula… 4.9 2014 14 3 Bhatta… 2/27/20… 12/8/2… 0 ## 5 Molecula… 4.9 2014 14 1 Christ… 2/28/20… 10/25/… 0 ## 6 Molecula… 4.9 2013 13 4 Villar… 2/28/20… 5/2/20… 0 ## 7 Molecula… 4.9 2013 13 4 Wang 2/28/20… 4/25/2… 0 ## 8 Molecula… 4.9 2012 12 1 Joly 2/28/20… 9/7/20… 3 ## 9 Animal C… 3.21 2014 17 6 Plavsic 2/9/2016 4/17/2… 9 ## 10 Animal C… 3.21 2014 17 Supp… Knox a… 2/11/20… 11/13/… 1 ## # ℹ 67 more rows ## # ℹ 3 more variables: `Number of users` <dbl>, `Twitter reach` <dbl>, ## # woscitations <dbl> ``` --- # Pretvaranje reči u mala slova ``` r citations |> * mutate(authors_lowercase = str_to_lower(Authors)) |> select(authors_lowercase) ``` ``` ## # A tibble: 1,599 × 1 ## authors_lowercase ## <chr> ## 1 morin et al ## 2 jucker et al ## 3 calcagno et al ## 4 segre et al ## 5 kaufman et al ## 6 nasto et al ## 7 tschirren et al ## 8 barnechi et al ## 9 pinto-sanchez et al ## 10 clough et al ## # ℹ 1,589 more rows ``` --- # Uklanjanje svih razmaka u nazivima časopisa ``` r citations |> * mutate(journal = str_remove_all(journal," ")) |> select(journal) |> unique() |> head(5) ``` ``` ## # A tibble: 5 × 1 ## journal ## <chr> ## 1 EcologyLetters ## 2 GlobalChangeBiology ## 3 GlobalEcologyandBiogeography ## 4 MolecularEcologyResources ## 5 DiversityandDistributions ``` --- # Istraživanje paketa 📦 stringr i regularnih izraza * Pogledajte [vodič za stringr](https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html) za više primera manipulacije tekstom i funkcija za prepoznavanje obrazaca. * Pogledajte [vodič za regularne izraze](https://stringr.tidyverse.org/articles/regular-expressions.html), koji su koncizan i fleksibilan alat za opisivanje obrazaca u stringovima. --- class: inverse, center, middle # Osnovna eksploratorna analiza podataka --- # Brojanje radova po časopisu ``` r citations |> count(journal, sort = TRUE) ``` ``` ## # A tibble: 20 × 2 ## journal n ## <fct> <int> ## 1 New Phytologist 144 ## 2 Ecology 108 ## 3 Evolution 108 ## 4 Global Change Biology 108 ## 5 Global Ecology and Biogeography 108 ## 6 Journal of Biogeography 108 ## 7 Ecology Letters 106 ## 8 Diversity and Distributions 105 ## 9 Animal Conservation 102 ## 10 Methods in Ecology and Evolution 90 ## 11 Evolutionary Applications 74 ## 12 Functional Ecology 54 ## 13 Journal of Animal Ecology 54 ## 14 Journal of Applied Ecology 54 ## 15 Limnology and Oceanography 54 ## 16 Molecular Ecology Resources 54 ## 17 Conservation Letters 53 ## 18 Ecological Applications 48 ## 19 Fish and Fisheries 36 ## 20 Mammal Review 31 ``` --- # Brojanje radova po časopisu i po godini ``` r citations |> count(journal, pubyear) |> head() ``` ``` ## # A tibble: 6 × 3 ## journal pubyear n ## <fct> <dbl> <int> ## 1 Animal Conservation 2012 18 ## 2 Animal Conservation 2013 18 ## 3 Animal Conservation 2014 66 ## 4 Conservation Letters 2012 17 ## 5 Conservation Letters 2013 18 ## 6 Conservation Letters 2014 18 ``` --- # Izračunavanje ukupnog broja tvitova po časopisu ``` r citations |> count(journal, wt = nbtweets, sort = TRUE) ``` ``` ## # A tibble: 20 × 2 ## journal n ## <fct> <dbl> ## 1 Ecology Letters 1538 ## 2 Animal Conservation 1268 ## 3 Journal of Applied Ecology 1012 ## 4 Methods in Ecology and Evolution 699 ## 5 Global Change Biology 613 ## 6 Conservation Letters 542 ## 7 New Phytologist 509 ## 8 Global Ecology and Biogeography 379 ## 9 Ecology 335 ## 10 Evolution 335 ## 11 Journal of Animal Ecology 323 ## 12 Fish and Fisheries 261 ## 13 Evolutionary Applications 238 ## 14 Journal of Biogeography 209 ## 15 Diversity and Distributions 200 ## 16 Mammal Review 166 ## 17 Functional Ecology 155 ## 18 Molecular Ecology Resources 139 ## 19 Ecological Applications 125 ## 20 Limnology and Oceanography 0 ``` --- # Grupisanje po promenljivoj za izračunavanje statistike ``` r citations |> * group_by(journal) |> * summarise(avg_tweets = mean(nbtweets)) |> head(10) ``` ``` ## # A tibble: 10 × 2 ## journal avg_tweets ## <fct> <dbl> ## 1 Animal Conservation 12.4 ## 2 Conservation Letters 10.2 ## 3 Diversity and Distributions 1.90 ## 4 Ecological Applications 2.60 ## 5 Ecology 3.10 ## 6 Ecology Letters 14.5 ## 7 Evolution 3.10 ## 8 Evolutionary Applications 3.22 ## 9 Fish and Fisheries 7.25 ## 10 Functional Ecology 2.87 ``` --- # Sortiranje ``` r citations |> group_by(journal) |> summarise(avg_tweets = mean(nbtweets)) |> * arrange(desc(avg_tweets)) |> # decreasing order (wo desc for increasing) head(10) ``` ``` ## # A tibble: 10 × 2 ## journal avg_tweets ## <fct> <dbl> ## 1 Journal of Applied Ecology 18.7 ## 2 Ecology Letters 14.5 ## 3 Animal Conservation 12.4 ## 4 Conservation Letters 10.2 ## 5 Methods in Ecology and Evolution 7.77 ## 6 Fish and Fisheries 7.25 ## 7 Journal of Animal Ecology 5.98 ## 8 Global Change Biology 5.68 ## 9 Mammal Review 5.35 ## 10 New Phytologist 3.53 ``` --- # Rad sa više kolona <img src="assets/img/dplyr_across.png" width="85%" style="display: block; margin: auto;" /> --- # Sračunaj proseke po časopisu za svaku kolonu numeričkog tipa ``` r citations |> * group_by(journal) |> * summarize(across(where(is.numeric), mean)) |> head() ``` ``` ## # A tibble: 6 × 8 ## journal impactfactor pubyear Volume nbtweets `Number of users` `Twitter reach` ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Animal… 3.21 2013. 16.5 12.4 9.71 28345. ## 2 Conser… 6.4 2013. 6.02 10.2 8.85 23234. ## 3 Divers… 5.4 2013 19 1.90 1.77 2350. ## 4 Ecolog… 5.06 2013 23 2.60 2.5 5727. ## 5 Ecology 6.16 2013 94 3.10 2.87 6176. ## 6 Ecolog… 16.7 2013. 16.0 14.5 14.0 44748. ## # ℹ 1 more variable: woscitations <dbl> ``` --- ## <https://github.com/courtiol/Rguides> <img src="assets/img/dplyr_guide_for_one_table_part2.png" width="85%" style="display: block; margin: auto;" /> --- # Tidying tibbles <img src="assets/img/original-dfs-tidy.png" width="70%" style="display: block; margin: auto;" /> --- # Prebacivanje iz **dugog** u **široki** format i obrnuto <img src="assets/img/tidyr-longer-wider.gif" width="70%" style="display: block; margin: auto;" /> --- # Isti podaci na 3 načina Podaci sadrže vrednosti povezane sa četiri promenljive (država, godina, broj slučajeva i populacija), ali svaka tabela organizuje vrednosti na drugačiji način (Globalni izveštaja o tuberkulozi Svetske zdravstvene organizacije). ``` r table1 #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 table2 #> # A tibble: 12 × 4 #> country year type count #> <chr> <dbl> <chr> <dbl> #> 1 Afghanistan 1999 cases 745 #> 2 Afghanistan 1999 population 19987071 #> 3 Afghanistan 2000 cases 2666 table3 #> # A tibble: 6 × 3 #> country year rate #> <chr> <dbl> <chr> #> 1 Afghanistan 1999 745/19987071 #> 2 Afghanistan 2000 2666/20595360 #> 3 Brazil 1999 37737/172006362 ``` --- # Koja tabela je najbolja? Sve ove tabele predstavljaju iste osnovne podatke, ali nisu podjednako jednostavne za korišćenje. Jedna od njih, `table1`, je lakša za rad unutar **Tidyverse-a**, jer je uređena (*tidy*). Postoje tri međusobno povezana pravila koja dataset čine **uređenim (tidy)**: - **Svaka promenljiva je kolona** – svaka kolona predstavlja jednu promenljivu. - **Svaka opservacija je red** – svaki red predstavlja jednu opservaciju. - **Svaka vrednost je ćelija** – svaka ćelija sadrži samo jednu vrednost. ``` r table1 #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 ``` --- ## Prebacivanje između dugog i širokog formata podataka Tidyverse funkcije **`pivot_longer()` i `pivot_wider()`**. Tabela table2 nije u tidy formatu jer vrednosti iz "cases" i "population" nisu u posebnim kolonama. ``` r #> # table2 A tibble: 12 × 4 #> country year type count #> <chr> <dbl> <chr> <dbl> #> 1 Afghanistan 1999 cases 745 #> 2 Afghanistan 1999 population 19987071 table2_tidy <- table2 |> * pivot_wider(names_from = type, values_from = count) # Prikaz podataka head(table2_tidy) ``` ``` ## # A tibble: 6 × 4 ## country year cases population ## <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- ## 📌 **Primer: Plata zaposlenih kroz godine** 📍 **Široki format (wide format)** ``` r library(tidyr) library(dplyr) # Kreiranje širokog formata df_wide <- tibble( ime = c("Ana", "Marko", "Jovana"), plata_2022 = c(60000, 70000, 50000), plata_2023 = c(62000, 73000, 52000) ) df_wide ``` ``` ## # A tibble: 3 × 3 ## ime plata_2022 plata_2023 ## <chr> <dbl> <dbl> ## 1 Ana 60000 62000 ## 2 Marko 70000 73000 ## 3 Jovana 50000 52000 ``` --- ## Transformacija u long format Podaci u **širokom formatu** često nisu pogodni za analizu. Funkcija **`pivot_longer()`** omogućava transformaciju u **dugi format**, gde se **više kolona objedini u dve kolone** – **naziv kategorije i njena vrednost**. ``` r # Pretvaranje širokog formata u dugi format df_long <- df_wide |> # Početni dataset `df_wide` se prosleđuje sledećoj funkciji pivot_longer( cols = starts_with("plata"), # Odabir kolona koje treba pretvoriti u dugi format (sve koje počinju sa "plata") names_to = "godina", # Nazivi originalnih kolona (npr. "plata_2022") se smeštaju u novu kolonu "godina" values_to = "iznos" # Vrednosti iz originalnih kolona se smeštaju u novu kolonu "iznos" ) df_long ``` ``` ## # A tibble: 6 × 3 ## ime godina iznos ## <chr> <chr> <dbl> ## 1 Ana plata_2022 60000 ## 2 Ana plata_2023 62000 ## 3 Marko plata_2022 70000 ## 4 Marko plata_2023 73000 ## 5 Jovana plata_2022 50000 ## 6 Jovana plata_2023 52000 ``` --- ## Dodatno sređivaje ``` r df_long <- df_wide |> pivot_longer( cols = starts_with("plata"), # Odabir kolona koje počinju sa "plata" names_to = "godina", # Imena originalnih kolona se premeštaju u "godina" values_to = "iznos" # Vrednosti plata idu u "iznos" plata_2022, ... ) |> mutate(godina = str_extract(godina, "\\d+")) # Ekstrahovanje broja iz naziva kolone 2022, ... df_long ``` ``` ## # A tibble: 6 × 3 ## ime godina iznos ## <chr> <chr> <dbl> ## 1 Ana 2022 60000 ## 2 Ana 2023 62000 ## 3 Marko 2022 70000 ## 4 Marko 2023 73000 ## 5 Jovana 2022 50000 ## 6 Jovana 2023 52000 ``` --- # Grafik sa sređenim podacima ``` r ggplot(df_long, aes(x = godina, y = iznos, fill = ime)) + geom_col(position = "dodge") + # Postavlja barova jedan pored drugog za poređenje labs(title = "Promena plata kroz godine", x = "Godina", y = "Plata") + theme_minimal() ``` <img src="assets/chunks/unnamed-chunk-52-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Vizualizacija --- # Vizualizacija sa ggplot2 * Paket **ggplot2** implementira **g**ramatiku **g**rafike (*Grammar of Graphics*). * Radi sa `data.frame` ili `tibble` objektima, ali ne direktno sa vektorima kao **base R**. * Jasno razdvaja **podatke** od **načina na koji su prikazani**. <img src="assets/img/ggplot2_logo.jpg" width="30%" style="display: block; margin: auto;" /> --- # Gramatika ggplot2 Element grafike | Opis :---------------- | :----------------------------- **Podaci (Data)** | `data.frame` koji se prikazuje na grafiku **Geometrija (Geometrics)** | Geometrijski oblik koji predstavlja podatke | (npr. tačke, boxplot, histogram) **Estetika (Aesthetics)** | Vizuelne karakteristike geometrijskog objekta | (npr. boja, veličina, oblik) <img src="assets/img/ggplot2_logo.jpg" width="30%" style="display: block; margin: auto;" /> --- # Skaterplotovi ``` r *citations |> * ggplot() + aes(x = nbtweets, y = woscitations) + geom_point() ``` * Prosledi `data.frame` kao prvi argument funkciji. --- # Skaterplotovi ``` r citations |> ggplot() + * aes(x = nbtweets, y = woscitations) + geom_point() ``` * Prosledi `data.frame` kao prvi argument funkciji. * Estetika (*Aesthetics*) mapira podatke na karakteristike grafika, ovde x i y osu. --- # Skaterplotovi ``` r citations |> ggplot() + aes(x = nbtweets, y = woscitations) + * geom_point() ``` * Prosledi `data.frame` kao prvi argument funkciji. * Estetika (*Aesthetics*) mapira podatke na karakteristike grafika, ovde x i y osu. * Prikaz podataka u geometrijskom obliku tačaka. --- # Skaterplotovi ``` r citations |> ggplot() + aes(x = nbtweets, y = woscitations) + geom_point() ``` <img src="assets/chunks/unnamed-chunk-58-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Skaterplotovi sa bojom ``` r citations |> ggplot() + aes(x = nbtweets, y = woscitations) + * geom_point(color = "red") ``` <img src="assets/chunks/unnamed-chunk-59-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Skaterplotovi sa bojama specifičnim za vrste ``` r citations |> ggplot() + * aes(x = nbtweets, y = woscitations, color = journal) + geom_point() ``` <img src="assets/chunks/unnamed-chunk-60-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> * Postavljanjem boje unutar estetike (*aesthetic*) mapira se na podatke. --- # Odaberi nekoliko časopisa ``` r citations_ecology <- citations |> mutate(journal = str_to_lower(journal)) |> # all journals names lowercase filter(journal %in% c('journal of animal ecology','journal of applied ecology','ecology')) # filter citations_ecology ``` ``` ## # A tibble: 216 × 12 ## journal impactfactor pubyear Volume Issue Authors colldate pubdate nbtweets ## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 ecology 6.16 2014 95 12 Magliane… 3/19/20… 12/1/2… 1 ## 2 ecology 6.16 2014 95 12 Soinen 3/19/20… 12/1/2… 6 ## 3 ecology 6.16 2014 95 12 Graham a… 3/19/20… 12/1/2… 1 ## 4 ecology 6.16 2014 95 11 White et… 3/19/20… 11/1/2… 9 ## 5 ecology 6.16 2014 95 11 Einarson… 3/19/20… 11/1/2… 15 ## 6 ecology 6.16 2014 95 11 Haav and… 3/19/20… 11/1/2… 2 ## 7 ecology 6.16 2014 95 10 Dodds et… 3/19/20… 10/1/2… 1 ## 8 ecology 6.16 2014 95 10 Brown et… 3/19/20… 10/1/2… 1 ## 9 ecology 6.16 2014 95 10 Wright e… 3/19/20… 10/1/2… 0 ## 10 ecology 6.16 2014 95 9 Ramahlo … 3/19/20… 9/1/20… 27 ## # ℹ 206 more rows ## # ℹ 3 more variables: `Number of users` <dbl>, `Twitter reach` <dbl>, ## # woscitations <dbl> ``` --- # Skaterplotovi sa oblicima specifičnim za vrste ``` r citations_ecology |> ggplot() + * aes(x = nbtweets, y = woscitations, shape = journal) + geom_point(size=2) ``` <img src="assets/chunks/unnamed-chunk-62-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Skaterplotovi, linije umesto tačaka ``` r citations_ecology |> ggplot() + aes(x = nbtweets, y = woscitations) + * geom_line() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-63-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Skaterplotovi, linije sa prethodnim sortiranjem ``` r citations_ecology |> * arrange(woscitations) |> ggplot() + aes(x = nbtweets, y = woscitations) + geom_line() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-64-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Dodavanje tačaka ``` r citations_ecology |> ggplot() + aes(x = nbtweets, y = woscitations) + geom_line() + * geom_point() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-65-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Dodavanje linearnog trenda ``` r citations_ecology |> ggplot() + aes(x = nbtweets, y = woscitations) + geom_point() + * geom_smooth(method = "lm") + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-66-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Dodavanje smooth intervala ``` r citations_ecology |> ggplot() + aes(x = nbtweets, y = woscitations) + geom_point() + * geom_smooth() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-67-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # aes ili ne aes? * Ako želimo da uspostavimo vezu između vrednosti promenljive i grafičke karakteristike, tj. mapiranja, tada koristimo `aes()`. * Ako se grafička karakteristika menja nezavisno od podataka, tada nam `aes()` nije potreban. <img src="assets/img/ggplot2_logo.jpg" width="30%" style="display: block; margin: auto;" /> --- # Histogrami ``` r citations_ecology |> ggplot() + aes(x = nbtweets) + * geom_histogram() ``` <img src="assets/chunks/unnamed-chunk-69-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Histogrami, sa bojom ``` r citations_ecology |> ggplot() + aes(x = nbtweets) + * geom_histogram(fill = "orange") ``` <img src="assets/chunks/unnamed-chunk-70-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Histogrami, sa bojom ``` r citations_ecology |> ggplot() + aes(x = nbtweets) + * geom_histogram(fill = "orange", color = "brown") ``` <img src="assets/chunks/unnamed-chunk-71-1.png" width="400cm" height="400cm" style="display: block; margin: auto;" /> --- # Histogrami, naslov i oznake ``` r citations_ecology |> ggplot() + aes(x = nbtweets) + geom_histogram(fill = "orange", color = "brown") + * labs(x = "Number of tweets", * y = "Count", * title = "Histogram of the number of tweets") ``` <img src="assets/chunks/unnamed-chunk-72-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Histogrami, po vrstama časopisa ``` r citations_ecology |> ggplot() + aes(x = nbtweets) + geom_histogram(fill = "orange", color = "brown") + labs(x = "Number of tweets", y = "Count", title = "Histogram of the number of tweets") + * facet_wrap(vars(journal)) ``` <img src="assets/chunks/unnamed-chunk-73-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Boksplotovi ``` r citations_ecology |> ggplot() + aes(x = "", y = nbtweets) + * geom_boxplot() + scale_y_log10() ``` <img src="assets/chunks/unnamed-chunk-74-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Boksplotovi sa bojom ``` r citations_ecology |> ggplot() + aes(x = "", y = nbtweets) + * geom_boxplot(fill = "green") + scale_y_log10() ``` <img src="assets/chunks/unnamed-chunk-75-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Boksplotovi po vrstama časopisa ``` r citations_ecology |> ggplot() + * aes(x = journal, y = nbtweets, fill = journal) + geom_boxplot() + scale_y_log10() ``` <img src="assets/chunks/unnamed-chunk-76-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Uklanjanje oznaka na x osi ``` r citations_ecology |> ggplot() + aes(x = journal, y = nbtweets, fill = journal) + geom_boxplot() + scale_y_log10() + * theme(axis.text.x = element_blank()) + * labs(x = "") ``` <img src="assets/chunks/unnamed-chunk-77-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Boxplot dijagrami sa korisnički definisanim bojama po vrstama ``` r citations_ecology |> ggplot() + aes(x = journal, y = nbtweets, fill = journal) + geom_boxplot() + scale_y_log10() + * scale_fill_manual( * values = c("red", "blue", "purple")) + theme(axis.text.x = element_blank()) + labs(x = "") ``` <img src="assets/chunks/unnamed-chunk-78-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Legenda ``` r citations_ecology |> ggplot() + aes(x = journal, y = nbtweets, fill = journal) + geom_boxplot() + scale_y_log10() + * scale_fill_manual( values = c("red", "blue", "purple"), * name = "Journal name", * labels = c("Ecology", "J Animal Ecology", "J Applied Ecology")) + theme(axis.text.x = element_blank()) + labs(x = "") ``` <img src="assets/chunks/unnamed-chunk-79-1.png" width="270cm" height="270cm" style="display: block; margin: auto;" /> --- # Bar plotovi ``` r citations |> count(journal) |> ggplot() + aes(x = journal, y = n) + * geom_col() ``` <img src="assets/chunks/unnamed-chunk-80-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Isto, sa obrnutim osama ``` r citations |> count(journal) |> ggplot() + * aes(x = n, y = journal) + geom_col() ``` <img src="assets/chunks/unnamed-chunk-81-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Isto, sa reorganizacijom faktora i obrnutim osama ``` r citations |> count(journal) |> ggplot() + * aes(x = n, y = fct_reorder(journal, n)) + geom_col() ``` <img src="assets/chunks/unnamed-chunk-82-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Dalje sređivanje ``` r citations |> count(journal) |> ggplot() + aes(x = n, y = fct_reorder(journal, n)) + geom_col() + labs(x = "counts", y = "") ``` <img src="assets/chunks/unnamed-chunk-83-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Više o (urednom) radu sa faktorima * [Postanite gospodar svojih faktora](https://stat545.com/block029_factors.html) * [forcats, forcats, da li ste rekli forcats?](https://thinkr.fr/forcats-forcats-vous-avez-dit-forcats/) --- # Grafici gustine raspodele (Density plots) ``` r citations_ecology |> ggplot() + aes(x = nbtweets, fill = journal) + * geom_density() + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-84-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Grafici gustine raspodele (Density plots) transparencija ``` r citations_ecology |> ggplot() + aes(x = nbtweets, fill = journal) + * geom_density(alpha = 0.5) + scale_x_log10() ``` <img src="assets/chunks/unnamed-chunk-85-1.png" width="350cm" height="350cm" style="display: block; margin: auto;" /> --- # Grafici gustine raspodele (Density plots) pozadina ``` r citations_ecology |> ggplot() + aes(x = nbtweets, fill = journal) + geom_density(alpha = 0.5) + scale_x_log10() + * theme_bw() ``` <img src="assets/chunks/unnamed-chunk-86-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Grafici gustine raspodele (Density plots) `classic theme` ``` r citations_ecology |> ggplot() + aes(x = nbtweets, fill = journal) + geom_density(alpha = 0.5) + scale_x_log10() + * theme_classic() ``` <img src="assets/chunks/unnamed-chunk-87-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Grafici gustine raspodele (Density plots) `dark theme` ``` r citations_ecology |> ggplot() + aes(x = nbtweets, fill = journal) + geom_density(alpha = 0.5) + scale_x_log10() + * theme_dark() ``` <img src="assets/chunks/unnamed-chunk-88-1.png" width="300cm" height="300cm" style="display: block; margin: auto;" /> --- # Više o vizualizaciji podataka sa ggplot2 * [Portfolio](https://www.r-graph-gallery.com/portfolio/ggplot2-package/) ggplot2 grafika * [Portfolio Cédric Scherera](https://cedricscherer.netlify.app/top/dataviz/) sa vizualizacijama podataka * [Najbolje](http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html) ggplot2 vizualizacije * [Interaktivno](https://dreamrs.github.io/esquisse/) <img src="assets/img/ggplot2_logo.jpg" width="30%" style="display: block; margin: auto;" /> --- background-image: url(https://github.com/rstudio/hex-stickers/raw/master/SVG/tidyverse.svg?sanitize=true) background-size: 550px background-position: 50% 50% --- # Još dublje u svet tidyverse-a * [Naučite tidyverse](https://www.tidyverse.org/learn/): knjige, radionice i online kursevi * Knjige: - [R for Data Science](https://r4ds.had.co.nz/) i [Advanced R](http://adv-r.had.co.nz/) - [Uvod u R i tidyverse](https://juba.github.io/tidyverse/) - [Osnove vizualizacije podataka](https://clauswilke.com/dataviz/) - [Vizualizacija podataka: Praktičan uvod](http://socviz.co/) * [Tidy Tuesdays video zapisi](https://www.youtube.com/user/safe4democracy/videos) od D. Robinsona, glavnog naučnika za podatke u DataCamp-u. --- # [How to switch from base R to tidyverse?](https://www.significantdigits.org/2017/10/switching-from-base-r-to-tidyverse/) <img src="assets/img/switch_baseR_tidyverse.png" width="800px" style="display: block; margin: auto;" /> --- # The [RStudio Cheat Sheets](https://www.rstudio.com/resources/cheatsheets/) <img src="assets/img/cheatsheet_dplyr.png" width="600px" style="display: block; margin: auto;" />