Data collection, data types

class: center, middle, inverse, title-slide

# Data collection, data types
## ⚔<br/>with xaringan
### Goran Kardum
### Department of Psychology
### 2021-10-18

---

```
## Loading required namespace: bibtex
```

# Important terms, definitions... from the last lecture

- Data collection

- Quantitative variable

- Qualitative variable

- Discrete vs continuous

- Scales of measurement (is a concept for distinguishing between different types of variables)

---
## Scales of measurement

- Nominal scale

- Ordinal scale

- Interval scale

- Ratio scale

---
# Type of variables according to models in research

- Do not confuse with dependent and independent research measure design

- Independent variable

- Dependent variable

---
# R structure

<div id="htmlwidget-14cf9eab7f2682f7665e" style="width:504px;height:504px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-14cf9eab7f2682f7665e">{"x":{"diagram":"digraph flowchart {\n       # node definitions with substituted label text\n      node [fontname = Helvetica, shape = rectangle]        \n      tab1 [label = \"R structure\"]\n      tab2 [label = \"Data type\"]\n      tab3 [label = \"Data structure\"]\n      \n      # edge definitions with the node IDs\n      tab1 -> tab2;\n      tab1 -> tab3;\n      }","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---
# R data types

R atomic data types:

character (characters and strings; "a", "name"...)

numeric (real or decimal; 2, 3, 7, 8.15)

integer (explicitly integer; 8L, 148L)

logical (boolean values; true/false)

complex (real + complex value: 5+7i)

raw (any type store as raw bytes)

---
# R data structures

R objects:

atomic vector

list

matrix

array

data frame

factors

---
# R functions

R language have several important functions for objects or vectors:

class() - what kind of object is it (high-level)?

typeof() - what is the object’s data type (low-level)?

length() - how long is it? What about two dimensional objects?

attributes() - does it have any metadata?

---
# Examples

```r
a <- "abcdefgh"
typeof(a)
```

```
## [1] "character"
```

```r
i <- 1:20
typeof(i)
```

```
## [1] "integer"
```

```r
i
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
```

```r
x <- c("a", "b", "c")
typeof(x)
```

```
## [1] "character"
```

---
# Vectors

The most important family of data types in base R.

<div id="htmlwidget-3c6a143f0e3f5ab66b01" style="width:504px;height:504px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-3c6a143f0e3f5ab66b01">{"x":{"diagram":"digraph flowchart {\n       # node definitions with substituted label text\n      node [fontname = Helvetica, shape = rectangle]        \n      tab1 [label = \"Vector\"]\n      tab2 [label = \"Atomic\"]\n      tab3 [label = \"List\"]\n       \n      # edge definitions with the node IDs\n      tab1 -> tab2;\n      tab1 -> tab3;\n      }","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---
# Vectors

- Atomic vectors: all elements must have the same type

- List can have the different type of elements

---
## Atomic vectors

There are four type of atomic vectors:

- logical

- integer

- double

- character, strings

- Numeric are: integer and double

---
## Character, string

```r
months <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")

months
```

```
##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"
```

- It's possible to access the exact position

```r
months[7]
```

```
## [1] "July"
```

- The number of letters in values (strings) of variables

```r
nchar(x=months)
```

```
##  [1] 7 8 5 5 3 4 4 6 9 7 8 8
```

---
## Names in vectors

Three ways to name vector

```r
# when we create a vector
i <- c(a = 1, b = 2, c = 3, d = 4)

# assigning a character vector to names

i <- 1:4
names(i) <- c("a", "b", "c", "d")

# with setNames function
i <- setNames(1:4, c("a", "b", "c", "d"))
```

---
## Matrix and array

When we use dim attributes that allows to have 2-dimensional **matrix** or multi-dimensional **array**.

```r
ex_matrix <- matrix(1:8, nrow = 2, ncol = 4)
ex_matrix
```

```
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
```

---
## Array

```r
ex_array <- array(1:16, c(2, 4, 2))
ex_array
```

```
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]    9   11   13   15
## [2,]   10   12   14   16
```

---
## Factors

A factor is a vector that can contain only predefined values (Wickham, 2019).

- it is used to store categorical data

```r
s <- factor(c("a","d","f","g"))
s
```

```
## [1] a d f g
## Levels: a d f g
```

- ordered factors are a minor variation of factor

```r
# ordered factors
sch_grade <- ordered(c("d", "d", "b", "c"), levels = c("d", "c", "b"))
sch_grade
```

```
## [1] d d b c
## Levels: d < c < b
```

---
## Lists

Lists are complex than atomic vectors because that each element can be any type. They could store character, string, number....

```r
list1 <- list(1:3, "x", c(TRUE, TRUE, FALSE), c(7.8, 8.9))
list1
```

```
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] "x"
## 
## [[3]]
## [1]  TRUE  TRUE FALSE
## 
## [[4]]
## [1] 7.8 8.9
```

Data frames are specific type of list

---
## Data frames

It's collection of variables and one type of the list.

Before - only variables in workspace / Environment (RStudio)

For an example...

```r
gender <- factor(c(1,2,2,2,1))
levels(gender) <- c("male","female")

group <- c(1,2,1,2,1)
levels(group) <- c("control","experimental")

age <- c(21,24,23,27,31)
```

That variables exists only as separate variables in R workspace... until...
---
## Data frames

Now we combine variables into **data.frame**

```r
df_example <- data.frame (gender,group,age)
df_example
```

```
##   gender group age
## 1   male     1  21
## 2 female     2  24
## 3 female     1  23
## 4 female     2  27
## 5   male     1  31
```

---

## Tibbles

Tibble is part of **tidyverse** package and that is the second type of list.

a modern reimagining of the data frame (Wickham, 2021)

There are two main differences in the usage of a **tibble** vs. a classic **data.frame**: printing and subsetting (Wickham, Grolemund 2017).

```r
library(tidyverse)
```

```
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
```

```
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.0     ✓ forcats 0.5.1
```

```
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
```

```r
# create example of data frame
data_df <- data.frame(a = 1:3, b = letters[1:3], c = Sys.Date() - 1:3)
data_df
```

```
##   a b          c
## 1 1 a 2021-10-17
## 2 2 b 2021-10-16
## 3 3 c 2021-10-15
```

```r
# create example of tibbles
as_tibble(data_df)
```

```
## # A tibble: 3 × 3
##       a b     c         
##   <int> <chr> <date>    
## 1     1 a     2021-10-17
## 2     2 b     2021-10-16
## 3     3 c     2021-10-15
```

---
## Tibbles

![Data entry](pic/tidy.png)
---
## One more object in the R

Formulas

Functions

- generic functions (e.g. summary(), plot())
  
--

- function within some library

---
## Introduction to descriptive statistics

- distribution of the data (vectors, factors)

- normal distributian (Gaussian distribution)

- sampling (sample size - N)

---
# References

```
## You haven't cited any references in this bibliography yet.
```

NULL

---
class: center, middle

# Thanks!

Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).

The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr/), and [R Markdown](https://rmarkdown.rstudio.com).