1 Introduction

1.1 Dirty dataset

Decathlon data set comes from FactoMineR package and represents two competitions: Decastar and Olympic Games.

1.2 Source

Department of statistics and computer science, Agrocampus Rennes

2 Setup

2.1 Loading libraries

library(readr)
library(here)
library(janitor)
library(tidyverse)
library(data.table)

2.2 Loading cleaned data

decathlon <-
here("clean_data/clean_data.rds") %>% 
    read_rds()

3 Data

3.1 Raw

"raw_data/decathlon.rds" %>% 
    here() %>%
    read_rds() %>% 
    data.table()

3.2 Clean data

decathlon %>% 
    data.table()

4 Questions

4.1 Long Jump

Finding the longest long jump in the data

decathlon %>%
    group_by(event) %>% 
    filter(event == "long_jump",
           event_points == max(event_points)) %>% 
    select(-ranking:-overall_competition_points)

4.2 100m sprint

Finding the average 100m time for each competition

decathlon %>% 
        filter(event == "100m_sprint") %>%
    group_by(competition) %>%
        summarise(average_100m_time = round(mean(event_points), 2))

4.3 Best all-rounder

Finding the competitor with the highest total points across both competitions

decathlon %>% 
    group_by(competitor) %>%
    summarise(total_competition_points = sum(overall_competition_points)) %>% 
    filter(total_competition_points == max(total_competition_points)) %>% 
    head(3)
## `summarise()` ungrouping output (override with `.groups` argument)

4.4 Shot-put

Finding the shot-put scores for the top three competitors in each competition

decathlon %>% 
    select(-overall_competition_points, -ranking) %>% 
    filter(event == "shot_put")%>% 
    group_by(competition) %>%
    top_n(3, event_points) %>% 
    arrange(desc(event_points))

4.5 400m sprint

Calculating the average points for competitors who ran the 400m in less than 50 seconds vs. those that ran 400m in more than 50 seconds

decathlon %>% 
    filter(event == "400m_sprint")%>%
    group_by(event_points<50) %>% 
    summarise(average_points = round(mean(overall_competition_points))) %>%
  arrange(desc(average_points))
## `summarise()` ungrouping output (override with `.groups` argument)