Descriptive statistics

class: center, middle, inverse, title-slide

# Descriptive statistics
## ⚔<br/>with xaringan
### Goran Kardum
### Department of Psychology
### 2021-10-25

---

```
## Loading required namespace: bibtex
```

# Descriptive statistics

The aim of calculating descriptive statistics is to summarize the sample you have collected (Cox, 2017)

- The main characteristics that are of interest when dealing with continuous data are the shape (distribution), the location, statistics (central tendency and variability) and the spread (dispersion) of the data.

- Appropriate presenting the results of descriptive statistics (table, figures)

- Descriptive statistics aim to summarize a sample, rather than use the data to learn about the population

- Descriptive statistics are not developed on the basis of probability theory

---
# Descriptive statistics

- Choosing which summary statistics are appropriate depend on the type of variable being examined.

- Different statistics should be used for interval/ratio, ordinal, and nominal data.

---
# Mean

- In statistics, the term average refers to any of the measures of central tendency!

- Psychologists are often interested in looking at how data points tend to group around a central value.

- The arithmetic mean of a set of observed data is defined as being equal to the sum of the numerical values of each and every observation, divided by the total number of observations.

- It is greatly influenced by **outliers** (values that are very much larger or smaller than most of the values)

---
# Mean (Navarro, 2015)

- If your data are nominal scale, you probably shouldn’t be using either the mean or the median.

- If your data are ordinal scale, you’re more likely to want to use the median than the mean.

- For interval and ratio scale data, either one is generally acceptable

---
## Mean

- It is not appropriate for skewed distributions. In probability theory and statistics, **skewness** is a measure of the asymmetry of the distribution.

- It is appropriate for larger number of observations (N>20 or N>30)

- Arithmetic mean for ungrouped data and arithmetic mean for grouped data

---
# Formula for mean

- Mean x̄ = Sum of all observations / Number of observations

$$ \bar{x} = \frac{1}{N}\sum_{i=1}^{N} x_i  $$

---
# How mean is the mean?

- The most common form of the mean that is used in psychology is the arithmetic mean.

- In a linear world, averages make perfect sense.

- The uncritical or unreflective use of the mean in much psychological research makes us problematically blind to variation and distribution amongst the data we collect (Speelman, McGann, 2013).

- Alternative model: Psychology are used to explore methods of research less mean-dependent and suggest that a critical assessment of the assumptions underlying its use in research play a more explicit role in the process of study design and review.

---
## Assumptions underlying the use of the mean in psychology research (Speelman, McGann, 2013)

- Assumption 1: There is a **True Value** that we are Trying to Approximate When we Measure Humans on Some Dimension

- Assumption 2: Averaging Helps us to Eliminate the Noise in Our Measures to See the True Value

- Assumption 3: Any Inability to Use the Mean as a Reliable Measure of a Stable Characteristic is a Product of Weaknesses in Methodology or Calculation (i.e., it does not Represent a Failure in the Initial Assumption that a True Value Exists)

- Assumption 4: The Noise in Our Measurements Represents the Effects of Variables Unrelated to the One Being Measured

---
# Median

- the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution

- The Median is the **middle** of a sorted list of numbers.

- it is not dependent/affected of extreme values

- appropriate for lower number of observation

- appropriate when we could not use arithmetic mean

---
# Median - example

```r
sample_x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
median(sample_x)
```

```
## [1] 5
```

```r
sample_xy <- c(sample_x, 10)
sample_xy
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10
```

```r
median(sample_xy)
```

```
## [1] 5.5
```

---
# Mode

The mode of a sample is very simple: it is the value that occurs most frequently.

- appropriate for variable on nominal scale

- in some cases, examples and real life situation we would like to calculate on ordinal, interval and ratio scale

- usage on frequency - in tables

---
# Comparison of means

![Comparison of means](pic/dc2.png)
---
## Practical usage in table

```r
library(psych)
library(kableExtra)
good_tbl <- describe(sat.act)
kable(good_tbl)
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> vars </th>
   <th style="text-align:right;"> n </th>
   <th style="text-align:right;"> mean </th>
   <th style="text-align:right;"> sd </th>
   <th style="text-align:right;"> median </th>
   <th style="text-align:right;"> trimmed </th>
   <th style="text-align:right;"> mad </th>
   <th style="text-align:right;"> min </th>
   <th style="text-align:right;"> max </th>
   <th style="text-align:right;"> range </th>
   <th style="text-align:right;"> skew </th>
   <th style="text-align:right;"> kurtosis </th>
   <th style="text-align:right;"> se </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> gender </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 700 </td>
   <td style="text-align:right;"> 1.647143 </td>
   <td style="text-align:right;"> 0.4782004 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 1.683929 </td>
   <td style="text-align:right;"> 0.0000 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> -0.6145233 </td>
   <td style="text-align:right;"> -1.6246760 </td>
   <td style="text-align:right;"> 0.0180743 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> education </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 700 </td>
   <td style="text-align:right;"> 3.164286 </td>
   <td style="text-align:right;"> 1.4253515 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 3.307143 </td>
   <td style="text-align:right;"> 1.4826 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> -0.6807999 </td>
   <td style="text-align:right;"> -0.0748912 </td>
   <td style="text-align:right;"> 0.0538732 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> age </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 700 </td>
   <td style="text-align:right;"> 25.594286 </td>
   <td style="text-align:right;"> 9.4986466 </td>
   <td style="text-align:right;"> 22 </td>
   <td style="text-align:right;"> 23.862500 </td>
   <td style="text-align:right;"> 5.9304 </td>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 65 </td>
   <td style="text-align:right;"> 52 </td>
   <td style="text-align:right;"> 1.6430573 </td>
   <td style="text-align:right;"> 2.4243053 </td>
   <td style="text-align:right;"> 0.3590151 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ACT </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 700 </td>
   <td style="text-align:right;"> 28.547143 </td>
   <td style="text-align:right;"> 4.8235599 </td>
   <td style="text-align:right;"> 29 </td>
   <td style="text-align:right;"> 28.842857 </td>
   <td style="text-align:right;"> 4.4478 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 36 </td>
   <td style="text-align:right;"> 33 </td>
   <td style="text-align:right;"> -0.6564026 </td>
   <td style="text-align:right;"> 0.5349691 </td>
   <td style="text-align:right;"> 0.1823134 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> SATV </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 700 </td>
   <td style="text-align:right;"> 612.234286 </td>
   <td style="text-align:right;"> 112.9025659 </td>
   <td style="text-align:right;"> 620 </td>
   <td style="text-align:right;"> 619.453571 </td>
   <td style="text-align:right;"> 118.6080 </td>
   <td style="text-align:right;"> 200 </td>
   <td style="text-align:right;"> 800 </td>
   <td style="text-align:right;"> 600 </td>
   <td style="text-align:right;"> -0.6438111 </td>
   <td style="text-align:right;"> 0.3251946 </td>
   <td style="text-align:right;"> 4.2673159 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> SATQ </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 687 </td>
   <td style="text-align:right;"> 610.216885 </td>
   <td style="text-align:right;"> 115.6392972 </td>
   <td style="text-align:right;"> 620 </td>
   <td style="text-align:right;"> 617.254083 </td>
   <td style="text-align:right;"> 118.6080 </td>
   <td style="text-align:right;"> 200 </td>
   <td style="text-align:right;"> 800 </td>
   <td style="text-align:right;"> 600 </td>
   <td style="text-align:right;"> -0.5929212 </td>
   <td style="text-align:right;"> -0.0177603 </td>
   <td style="text-align:right;"> 4.4119144 </td>
  </tr>
</tbody>
</table>

---

- draw example of hist function

---
# Standard deviation

- The standard deviation is a measure of variation which is commonly used with interval/ratio data

It’s a measurement of how close the observations in the data set are to the mean.

- If the deviations from the mean are generally large, the standard deviation will be large.

- Population mean is often represented with the letter mu $$ {\mu} $$, and the population standard deviation is represented with the letter sigma $$ {\sigma} $$

- The value of SD = 7.5. Is it small or large? Depends of mean value.

- Standard deviation may not be appropriate for skewed data (This is the reason among the others that we must first draw the figure - distribution of data.

-- 
## Formula for SD

`$$\sigma = \sqrt{\frac{\sum\limits_{i=1}^{n} \left(x_{i} - \bar{x}\right)^{2}} {n-1}}$$`

---
# Example

Create variable with some data

```r
data <- c(5.56, 3.45, 4.55, 6.2, 4.8, 6.8, 7.2, 8.3, 3.2, 7.7, 6.9, 4.1, 8.3, 7.5, 7.7, 4.3, 3.8, 2.8, 3.2, 8.9)
mean(data)
```

```
## [1] 5.763
```

```r
median(data)
```

```
## [1] 5.88
```

```r
library(psych)
datas_1<- rnorm(20, mean = 5.7, sd = 2)
SD(data)
```

```
## [1] 2.004275
```

```r
median(datas_1)
```

```
## [1] 5.744962
```

---
# References

```
## You haven't cited any references in this bibliography yet.
```

NULL

---
class: center, middle

# Thanks!

Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).

The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](https://yihui.org/knitr/), and [R Markdown](https://rmarkdown.rstudio.com).