Notebook.Rmd

---
title: "A Zoom Into Data-Related Jobs"
output: 
  html_notebook:
    toc: true
    toc_depth: 6
    toc_float: true
    css: styles.css
---

**Name**: Samaher Brahem

**Course**: Coding for DS & DM - R Module

**Professor**: Giancarlo Manzi

### **1. Introduction**

This study aims to analyze salary differences among Data Scientists and other Data-related roles across different countries. Utilizing a dataset containing diverse employment types, experience levels, and company sizes, the study intends to conduct statistical analyses and visualizations using R. The objective is to identify potential salary disparities and trends across various global contexts. The study aims to shed light on nuanced salary patterns within the data science domain, offering insights into the variations of compensation among data-related roles on a global scale.

I used [**GitHub**](https://github.com/SamaherUNIMI/DSE_R_Project/tree/main) as a tool for version control.

### **2. Data Used**

The *Salary of Data Scientists* dataset will be used for this analysis. It is made available on [Kaggle](https://www.kaggle.com/datasets/piyushborhade/salary-of-data-scientists) by the user, Piyush Borhade.

#### **2.1 Features of the dataset**:

-   `work_year`: The year the salary was paid.

-   `experience_level`: The level of work experience during the year

-   `employment_type`: the type of employment for the role

-   `job_title`: The job in which you worked during the year.

-   `salary`: The total gross amount of salary paid.

-   `salary_currency`: The currency of the salary paid as an ISO 4217 currency code.

-   `salary_in_usd`: The salary in USD

-   `employee_residence`: the country of primary residence of the employee during the employment year as an ISO 3166 country code.

-   `remote_ratio`: The total amount of work done remotely

-   `company_location`: The country of headquarters of the employer or contracting branch

-   `company_size`: The average number of people who worked for the company during the year

#### **2.2 Limits of the dataset**:

##### **2.2.1 No Shared Sources**:

The information about how the data was collected and who collected it isn't clear. This lack of clarity raises doubts about how dependable the data is.

##### **2.2.2 Geographical Bias**:

The dataset primarily comprises data from the United States, resulting in an imbalance where other countries are underrepresented. This skew might limit the depth and accuracy of analyses, potentially leading to an overemphasis on American-centric insights and a lack of comprehensive understanding regarding other regions or global perspectives.

### **3. Data Processing**

#### **3. 1 Installing packages and opening libraries:**

For the purpose of this study, I installed and loaded the following libraries:

```{r}
library(tidyverse)
library(ggplot2)
library(dplyr)
library(sf)
library(rnaturalearth)
library(scales)
library(viridis) 
library(wordcloud)

```

#### **3.2 Importing the dataset:**

```{r}
# Importing the dataset from a local path in my computer
data <- read.csv2("C:/Users/Samaher/Documents/DSE_R_Project/data.csv")

# Converting it into a tibble
data <- as.tibble(data)

```

#### **3.3 Previewing the dataset:**

```{r}
# Understanding the structure of the dataset
str(data)
```

```{r}
# Displaying a sample of the data
head(data)
```

Now, let's check the dimensions of the dataset.

```{r}
# Checking the dimensions of the dataset
dim(data)
```

The dataset has 3753 rows and 12 columns. This means we have 12 variables and 3753 observations.

#### **3.4 Cleaning and formatting the dataset:**

##### **3.4.1 Experience Level:**

There are 4 experience levels in this dataset:

-   EN = Entry Level

-   MI = Mid-Level

-   SE = Senior

-   EX = Executive

```{r}
# Checking how many experience levels are included in this dataset
table(data$experience_level)
```

Let's make those levels understandable by anyone who reads them!

```{r}
# Replacing abbreviations with full descriptions in the 'experience_level' column
data$experience_level <- ifelse(data$experience_level == "EN", "entry level",
                           ifelse(data$experience_level == "EX", "executive",
                           ifelse(data$experience_level == "MI", "mid-level",
                           ifelse(data$experience_level == "SE", "senior", data$experience_level))))

# Checking the updated 'experience_level' column
table(data$experience_level)
```

##### **3.4.2 Employment Type:**

There are 4 employment types in this dataset:

-   FT = full-time

-   PT = part-time

-   CT = contract

-   FL = freelance

```{r}
table(data$employment_type)
```

I will replace those abbreviations with the full word for better readability.

```{r}
# Replacing abbreviations with full descriptions in the 'employment_type' column
data$employment_type <- ifelse(data$employment_type == "CT", "contract",
                               ifelse(data$employment_type == "FL", "freelance",
                                      ifelse(data$employment_type == "FT", "full-time",
                                             ifelse(data$employment_type == "PT", "part-time", data$employment_type))))

# Checking the updated 'employment_type' column
table(data$employment_type)
```

##### **3.4.3 Salaries:**

When it comes to salaries, we can see that there are 3 columns: 1. `salary` 2. `salary_currency` 3. `salary_in_usd`

The salaries, stored in `salary` column, are collected in local currencies, mentioned in `salary_currency` column. Then, all those salaries were converted to USD and stored in `salary_in_usd`.

```{r}
table(data$salary_currency)
```

For the purpose of this study, it would be better to have the salaries in EUR. That's why I will replace the `salary_in_usd` column with `salary_in_eur` and convert the salaries to EUR. I will also remove the `salary` and `salary_currecny` columns since they are no longer needed.

The exchange rate for the conversion from USD to EUR is as follows:

1 United States Dollar equals 0.92 Euro

P.S. The date of this conversion is: Dec 16, 09:12 UTC

```{r}
# Given exchange rate
exchange_rate <- 0.92

# Converting USD salaries to EUR
data$salary_in_usd <- data$salary_in_usd * exchange_rate

# Changing column name to better reflect this change
names(data)[names(data) == "salary_in_usd"] <- "salary_in_eur"
```

```{r}
# Removing the salary and salary_currency columns
data <- subset(data, select = -c(salary, salary_currency))
```

##### **3.4.4 Countries:**

###### **3.4.4.1 Company Location:**

The column `company_location` reflects the countries where the companies are established. Those countries are written as abbreviations. While some country abbreviations are widely known, there are others that are less known.

```{r}
table(data$company_location)
```

To ensure clarity, I decided to replace the abbreviations with the full country names.

```{r}
# Defining a mapping of country codes to country names for company locations
cl_country_mapping <- c(
  AE = "United Arab Emirates",
  AL = "Albania",
  AM = "Armenia",
  AR = "Argentina",
  AS = "American Samoa",
  AT = "Austria",
  AU = "Australia",
  BA = "Bosnia and Herzegovina",
  BE = "Belgium",
  BO = "Bolivia",
  BR = "Brazil",
  BS = "Bahamas",
  CA = "Canada",
  CF = "Central African Republic",
  CH = "Switzerland",
  CL = "Chile",
  CN = "China",
  CO = "Colombia",
  CR = "Costa Rica",
  CZ = "Czech Republic",
  DE = "Germany",
  DK = "Denmark",
  DZ = "Algeria",
  EE = "Estonia",
  EG = "Egypt",
  ES = "Spain",
  FI = "Finland",
  FR = "France",
  GB = "United Kingdom",
  GH = "Ghana",
  GR = "Greece",
  HK = "Hong Kong",
  HN = "Honduras",
  HR = "Croatia",
  HU = "Hungary",
  ID = "Indonesia",
  IE = "Ireland",
  IN = "India",
  IQ = "Iraq",
  IR = "Iran",
  IT = "Italy",
  JP = "Japan",
  KE = "Kenya",
  LT = "Lithuania",
  LU = "Luxembourg",
  LV = "Latvia",
  MA = "Morocco",
  MD = "Moldova",
  MK = "North Macedonia",
  MT = "Malta",
  MX = "Mexico",
  MY = "Malaysia",
  NG = "Nigeria",
  NL = "Netherlands",
  NZ = "New Zealand",
  PH = "Philippines",
  PK = "Pakistan",
  PL = "Poland",
  PR = "Puerto Rico",
  PT = "Portugal",
  RO = "Romania",
  RU = "Russia",
  SE = "Sweden",
  SG = "Singapore",
  SI = "Slovenia",
  SK = "Slovakia",
  TH = "Thailand",
  TR = "Turkey",
  UA = "Ukraine",
  US = "United States",
  VN = "Vietnam"
)

# Replacing values in the 'company_location' column using the mapping created above
data$company_location <- cl_country_mapping[data$company_location]

```

###### **3.4.4.2 Employee Residence:**

Now, we'll do the same process for `employee_residence` column.

```{r}
#Checking which countries are included
table(data$employee_residence)
```

```{r}
# Defining a mapping of country codes to country names for employee residence

er_country_mapping <- c(
  AE = "United Arab Emirates",
  AM = "Armenia",
  AR = "Argentina",
  AS = "American Samoa",
  AT = "Austria",
  AU = "Australia",
  BA = "Bosnia and Herzegovina",
  BE = "Belgium",
  BG = "Bulgaria",
  BO = "Bolivia",
  BR = "Brazil",
  CA = "Canada",
  CF = "Central African Republic",
  CH = "Switzerland",
  CL = "Chile",
  CN = "China",
  CO = "Colombia",
  CR = "Costa Rica",
  CY = "Cyprus",
  CZ = "Czech Republic",
  DE = "Germany",
  DK = "Denmark",
  DO = "Dominican Republic",
  DZ = "Algeria",
  EE = "Estonia",
  EG = "Egypt",
  ES = "Spain",
  FI = "Finland",
  FR = "France",
  GB = "United Kingdom",
  GH = "Ghana",
  GR = "Greece",
  HK = "Hong Kong",
  HN = "Honduras",
  HR = "Croatia",
  HU = "Hungary",
  ID = "Indonesia",
  IE = "Ireland",
  IN = "India",
  IQ = "Iraq",
  IR = "Iran",
  IT = "Italy",
  JE = "Jersey",
  JP = "Japan",
  KE = "Kenya",
  KW = "Kuwait",
  LT = "Lithuania",
  LU = "Luxembourg",
  LV = "Latvia",
  MA = "Morocco",
  MD = "Moldova",
  MK = "North Macedonia",
  MT = "Malta",
  MX = "Mexico",
  MY = "Malaysia",
  NG = "Nigeria",
  NL = "Netherlands",
  NZ = "New Zealand",
  PH = "Philippines",
  PK = "Pakistan",
  PL = "Poland",
  PR = "Puerto Rico",
  PT = "Portugal",
  RO = "Romania",
  RS = "Serbia",
  RU = "Russia",
  SE = "Sweden",
  SG = "Singapore",
  SI = "Slovenia",
  SK = "Slovakia",
  TH = "Thailand",
  TN = "Tunisia",
  TR = "Turkey",
  UA = "Ukraine",
  US = "United States",
  UZ = "Uzbekistan",
  VN = "Vietnam"
)

# Replacing values in the 'employee_residence' column using the mapping created above
data$employee_residence <- er_country_mapping[data$employee_residence]

```

##### **3.4.5 Checking for Missing Values:**

After cleaning and formatting the data, I want to make sure there are no null values.

```{r}
# Making sure there are no missing values 
any(is.na(data))
```

The outcome is `FALSE`, so there no null values to deal with. Let's move to the fun part!

### **4. Data Analysis**

#### **4.1 Exploratory Data Analysis (EDA)**

##### **4.1.1 Overview and Summary Statistics**

In this section, we will be examining the dataset's basic characteristics and computing summary statistics for better comprehension.

###### **4.1.1.1 Overview**

Let's check again the dimensions and structure of the dataset after the the data processing phase.

```{r}
dim(data)
```

The dataset now has 3753 rows (observations) and 10 columns (variables).

Let's check the columns' names and the structure of the processed dataset.

```{r}
names(data)
```

```{r}
str(data)
```

###### **4.1.1.2 Summary Statistics**

In this section, let's calculate summary statistics like mean, median, standard deviation, minimum, maximum, and quartiles for numerical columns (`salary_in_eur`, `remote_ratio`).

```{r}
# Checking the summary statistics for the salary variable
summary(data$salary_in_eur)
```

These statistics provide insights into the distribution of total gross salaries within the dataset. The minimum salary recorded is €4,721, while the maximum is €414,000. The median salary of €124,200 indicates that half of the recorded total gross salaries fall below this value and the other half above it. The mean salary (€126,499) is slightly higher than the median, indicating potential influence from higher salaries within the dataset. The interquartile range (IQR), between the 25th and 75th percentiles, spans from €87,400 to €161,000, encapsulating the middle 50% of the total gross salary data.

```{r}
# Checking the summary statistics for the remote ratio variable
summary(data$remote_ratio)
```

This shows that most of the values are concentrated towards the lower end, as indicated by the median and the 1st quartile both being 0.00. However, 25% of the data involves entirely remote positions. This means there's a mix of roles, from no remote work to completely remote setups, in the dataset.

##### **4.1.2 Univariate Analysis**

###### **4.1.2.1 Numerical Variables**

**A. Salary Distribution**

```{r}
ggplot(data, aes(x = salary_in_eur)) +
    geom_density(fill = "#5bc8af", color = "#202060") +
    labs(title = "Density Plot of Salary in EUR", x = "Salary in EUR", y = "Density") +
    scale_x_continuous(labels = scales::comma, breaks = seq(0, 500000, by = 100000)) # Modifying the x-axis scale to display breaks at intervals of 100,000 with comma formatting
```

The density plot for the `salary_in_eur` variable reaffirms the statistical observations derived from the summary statistics. It visually depicts a right-skewed distribution, indicating that the majority of salaries are concentrated on the lower end of the spectrum. There's a notable peak at the lower salary values, with a gradual decrease in density as salaries increase. While the plot indicates a rightward skewness, it also showcases a few instances of higher salaries, although these are notably less frequent in the dataset.

**B. Remote Ratio Distribution**

```{r}
ggplot(data, aes(x = remote_ratio)) +
    geom_density(fill = "#5bc8af", color = "#202060") +
    labs(title = "Density Plot of Remote Ratio", x = "Remote Ratio", y = "Density")

```

We can see that there are peaks at both ends (0% and 100%). The density plot represents a bimodal distribution. The plot highlights a high density near 0% and 100%, signifying two prominent modes in the dataset, indicating a significant number of individuals either work fully remotely or fully on-site, with a relatively smaller portion having a moderate remote work percentage around 50%.

###### **4.1.2.2 Categorical Variables**

**A. Company Size**

```{r}

# Creating a table of company sizes and their frequency
company_sizes_freq <- table(data$company_size)

# Converting the table to a data frame
company_sizes_df <- data.frame(company_size = names(company_sizes_freq), frequency = as.numeric(company_sizes_freq))

# Defining a custom color palette (this step is optional, but, they're my favorite colors)
custom_palette <- c("#5bc8af","#202060", "#b030b0","#6c91bf", "#602080")

# Creating the pie chart for company sizes with percentages
ggplot(company_sizes_df, aes(x = "", y = frequency, fill = company_size)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  labs(title = "Company Size Distribution", fill = "Company Size") +
  theme_void() +
  scale_fill_manual(values = custom_palette) +
  geom_text(aes(label = paste0(round((frequency/sum(frequency)) * 100), "%")),
            position = position_stack(vjust = 0.5), color = "white" )

```

The data is composed mainly of medium sized companies with a percentage of 84%. In the 2nd place comes large companies with a percetage of 12%. Small companies represent the smallest category with only 4%.

**B. Work Year**

Let's do the same for the `work_year` variable.

```{r}
# Creating a table of work years and their frequency
work_years_freq <- table(data$work_year)

# Converting the table to a data frame
work_years_df <- data.frame(work_year = names(work_years_freq), frequency = as.numeric(work_years_freq))

# Creating the pie chart for work years with percentages
ggplot(work_years_df, aes(x = "", y = frequency, fill = work_year)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  labs(title = "Work Year Distribution", fill = "Work Year") +
  theme_void() +
  scale_fill_manual(values = custom_palette) +
  geom_text(aes(label = paste0(round((frequency/sum(frequency)) * 100), "%")),
            position = position_stack(vjust = 0.5), color = "white" ) 

```

This pie chart depicts the dominance of 2023 and 2022 in the dataset compared to 2021 and 2020. 2023 has the most significant portion with 48%. 2022 follows with 44%. 2021 and 2020 have a relatively small representation in this data with the percentages 6% and 2% respectively.

**C. Experience Level**

```{r}
# Creating a table of experience levels and their frequency
experience_levels_freq <- table(data$experience_level)

# Converting the table to a data frame
experience_levels_df <- data.frame(experience_level = names(experience_levels_freq), frequency = as.numeric(experience_levels_freq))

# Creating the pie chart for experience levels with percentages
ggplot(experience_levels_df, aes(x = "", y = frequency, fill = experience_level)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  labs(title = "Experience Level Distribution", fill = "Experience Level") +
  theme_void() +
  scale_fill_manual(values = custom_palette) +
  geom_text(aes(label = paste0(round((frequency/sum(frequency)) * 100), "%")),
            position = position_stack(vjust = 0.5), color = "white" )

```

The pie chart shows that most individuals, around 67%, are in senior positions. About 21% are in middle-level roles, while entry-level positions make up 9%. Executive roles are the smallest group, just 3% of the dataset.

**D. Employment Type**

```{r}
# Creating a table of employment types and their frequency
employment_types_freq <- table(data$employment_type)

# Converting the table to a data frame
employment_types_df <- data.frame(employment_type = names(employment_types_freq), frequency = as.numeric(employment_types_freq))

# Creating the pie chart for employment types with percentages
ggplot(employment_types_df, aes(x = "", y = frequency, fill = employment_type)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  labs(title = "Employment Type Distribution", fill = "Employment Type") +
  theme_void() +
  scale_fill_manual(values = custom_palette) +
  geom_text(aes(label = paste0(round((frequency/sum(frequency)) * 100), "%")),
            position = position_stack(vjust = 0.5), color = "white" )

```

The majority of individuals, constituting approximately 99%, are categorized as "Full-time" employees. Now, let's dig a bit deeper to see the percentages of the other categories.

```{r}
# Calculating percentages
employment_types_df$percentage <- round((employment_types_df$frequency / sum(employment_types_df$frequency)) * 100, 2)

employment_types_df
```

This clarifies that part-time roles come in second with 0.45%. Lastly, we have both contract and freelance roles with 0.27% each.

**E. Company Location**

In this section, we will explore where the companies present in this dataset have their headquarters. For this purpose, I chose to visualize the countries on a map. Instead of using a linear scale, I will use a logarithmic scale for the frequency values. The logarithmic scaling helps highlight the distribution more evenly across the map, emphasizing the presence of multiple countries while downplaying the overwhelming frequency of the United States, enabling a clearer visual comparison of the dataset's geographic spread.

```{r}
# Creating a df that has the company locations along with their freq
company_location_df <- data %>%
  group_by(company_location) %>%
  summarise(frequency = n())

# Getting world map data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Merging company_location_df with world map data
map_data <- merge(world, company_location_df, by.x = "name", by.y = "company_location", all.x = TRUE)

# Plotting
ggplot() +
  geom_sf(data = map_data, aes(fill = log(frequency + 1))) +
  scale_fill_gradient(low = "#5bc8af", high = "#202060") +
  labs(title="Company Location Distribution") +
  theme_minimal()
```

The United States stands out as the most prominent location, reflecting a high occurrence frequency, followed by Canada, Spain, and several other countries that also have notable but comparatively lower representation.

**F. Employee Residence**

Let's do the same for `employee_residence`.

```{r}
# Creating a data frame with employee residence locations and their frequency
employee_residence_df <- data %>%
  group_by(employee_residence) %>%
  summarise(frequency = n())

# Getting world map data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Merging employee_residence_df with world map data
map_data_residence <- merge(world, employee_residence_df, by.x = "name", by.y = "employee_residence", all.x = TRUE)

# Plotting
ggplot() +
  geom_sf(data = map_data_residence, aes(fill = log(frequency + 1))) +
  scale_fill_gradient(low = "#5bc8af", high = "#202060") +
  labs(title="Employee Residence Distribution") +
  theme_minimal()

```

Looking at the distribution, it's apparent that the United States has the highest frequency, indicated by the darkest shade on the map. It's followed by Canada, Germany, the United Kingdom, and France, showing relatively high frequencies as well. The lighter shades across various countries denote lower frequencies. Countries such as Tunisia, Algeria, Iraq, Kuwait, American Samoa, Armenia, Bulgaria, Cyprus, Dominican Republic, Iran, Moldova, North Macedonia, Slovakia, and Uzbekistan have the lightest shades, representing the lowest frequency of employee residences.

**G. Job Titles**

In this section, we will see what are the 10 Job Titles present in this dataset.

```{r}
# Creating a table of job titles and their frequency
job_titles_freq <- table(data$job_title)

# Converting the table to a data frame and sort by frequency
job_titles_df <- data.frame(job_title = names(job_titles_freq), frequency = as.numeric(job_titles_freq))
job_titles_df <- job_titles_df[order(job_titles_df$frequency, decreasing = TRUE), ]

# Calculating percentages
job_titles_df$percentage <- (job_titles_df$frequency / sum(job_titles_df$frequency)) * 100

# Selecting the top 10 most occurring job titles
top_10_jobs <- head(job_titles_df, 10)


# Creating a bar plot for the top 10 job titles with percentages on the y-axis
ggplot(top_10_jobs, aes(x = reorder(job_title, -frequency), y = percentage)) +
  geom_bar(stat = "identity", fill = "#5bc8af") +
  labs(title = "Top 10 Most Occurring Job Titles", x = "Job Titles", y = "Percentage") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

The most dominant roles include "Data Engineer," "Data Scientist," and "Data Analyst," constituting a significant portion of the dataset's job titles. Additionally, positions like "Machine Learning Engineer," "Analytics Engineer," "Data Architect," and roles related to research, such as "Research Scientist" and "Research Engineer," also hold notable but comparatively smaller percentages.

##### **4.1.3 Bivariate Analysis**

###### **4.1.3.1 Salary Vs. Company Location**

In this analysis, we'll explore the comparative average salaries across various countries where data employees are employed, focusing on the `company_location` variable. Using this information, we aim to visualize and understand the differences in average salaries by country. The approach involves calculating the mean salary for each country listed in the dataset and presenting this information on a world map. This visualization will provide a clearer perspective on the regional discrepancies in salaries within the data industry.

```{r}
# Creating a dataframe with company locations and average salary for each country
salary_location_df <- data %>%
  group_by(company_location) %>%
  summarise(avg_salary = mean(salary_in_eur, na.rm = TRUE)) %>% # Calculating the average salary
  arrange(desc(avg_salary))  # Sorting by average salary in descending order

# Getting world map data
world <- ne_countries(scale = "medium", returnclass = "sf")

# Merging company_location_df with world map data
map_data <- merge(world, salary_location_df, by.x = "name", by.y = "company_location", all.x = TRUE)

# Plotting
ggplot() +
  geom_sf(data = map_data, aes(fill = avg_salary)) +
  scale_fill_gradient(low = "#5bc8af", high = "#202060") +
  labs(title="Average Salary Vs. Company Location") +
  theme_minimal()

```

This map indicates that when it comes to average salaries for data employees, there's a big difference between countries. Let dig deeper to see which are the top 10 countries with the highest average salaries.

```{r}
# Selecting the top 10 countries
top_10_countries <- head(salary_location_df, 10)

# Creating a bar plot for top 10 countries with highest average salaries
ggplot(top_10_countries, aes(x = reorder(company_location, -avg_salary), y = avg_salary)) +
  geom_bar(stat = "identity", fill = "#5bc8af") +
  labs(title = "Top 10 Countries with Highest Average Salaries",
       x = "Company Location", y = "Average Salary") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

Puerto Rico seems to offer the highest salaries among these top 10 countries. The United States, Russia, and Canada follow closely behind.

###### **4.1.3.2 Salary Vs. Job Title**

In this section, we will see the top 10 job titles by average salary.

```{r}

# Creating a dataframe with job titles and their average salary
salary_jobtitle_df <- data %>%
  group_by(job_title) %>%
  summarise(avg_salary = mean(salary_in_eur, na.rm = TRUE))  

# Sorting by average salary in descending order to get the top salaries at the top
salary_jobtitle_df <- salary_jobtitle_df[order(-salary_jobtitle_df$avg_salary),]

# Selecting the top 10 job titles
top_10_jobtitles <- head(salary_jobtitle_df, 10)

# Creating a bar plot for the top 10 job titles
ggplot(top_10_jobtitles, aes(x = reorder(job_title, -avg_salary), y = avg_salary)) +
  geom_bar(stat = "identity", fill = "#5bc8af") +
  labs(title = "Top 10 Job Titles by Average Salary", x = "Job Title" , y = "Average Salary" ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = scales::comma, breaks = pretty_breaks(n = 10))  # Formatting the Y-axis labels
```

We can see that the top job title with the highest average salaries is Data Science Tech Lead, followed by other job titles such as Cloud Data Architect, Data Lead, Data Analytics Lead, and Principal Data Scientist.

###### **4.1.3.3 Salary Vs. Experience Level**

```{r}
# Reordering experience_level from entry level to executive
data$experience_level <- factor(data$experience_level, levels = c("entry level", "mid-level", "senior", "executive"))

# Creating boxplot for salaries across experience levels with custom colors
ggplot(data, aes(x = experience_level, y = salary_in_eur, fill = experience_level)) +
  geom_boxplot() +
  labs(title = "Salaries across Experience Levels", x = "Experience Level", y = "Salary in EUR") +
  scale_fill_manual(values = custom_palette) +  # Adding my favorite colors
  theme_minimal() +
  scale_y_continuous(labels = scales::comma, breaks = pretty_breaks(n = 10))
```

The boxplot demonstrates variations in salary distributions across experience levels. The executive category exhibits the highest median salary among all categories, followed by the senior level. In contrast, the entry-level category displays the lowest median salary, notably less than €55,000 per year. This suggests a significant disparity in salary distributions across different experience levels, with higher experience levels generally commanding higher salaries.

We can also include a violin plot to the boxplot to see both the distribution of the data and its summary statistics.

```{r}
# Calculate sample size for each experience level
sample_size <- data %>% 
  group_by(experience_level) %>%     
  summarize(num = n())               

# Joining the datasets based on the experience_level column
data_with_size <- left_join(data, sample_size, by = "experience_level") %>%
  mutate(myaxis = paste0(experience_level, "\n", "n=", num)) # Creating a new column 'myaxis' combining experience level and sample size

# Plot
ggplot(data_with_size, aes(x = myaxis, y = salary_in_eur, fill = experience_level)) +
  geom_violin(width = 0.7, position = position_dodge(width = 0.6)) +
  geom_boxplot(width = 0.1, color = "white", alpha = 0.2, position = position_dodge(width = 0.6)) +
  scale_fill_manual(values = custom_palette) +  # Adding my favorite colors
  theme_minimal() +
  ggtitle("Salaries across Experience Levels") +
  xlab("") +
  ylab("Salary in EUR")+
  scale_y_continuous(labels = scales::comma, breaks = pretty_breaks(n = 10))
```

#### **4.2 Further Analysis**

In this section, I will focus on entry level roles in Europe and answer these questions:

-   Which job titles has the highest average salary for entry level roles in Europe?
-   Which country offer the highest average salary for entry level roles in Europe?

##### **4.2.1 Entry Level Salaries Vs. Job Titles \| IN EUROPE \|**

```{r}

# Creating a vector of European countries
european_countries <- c("Spain", "France", "Germany", "Italy", "United Kingdom", "Netherlands", "Sweden", "Norway", "Finland", "Denmark", "Belgium", "Austria", "Greece", "Switzerland", "Portugal", "Ireland", "Czech Republic", "Romania", "Hungary", "Poland", "Slovakia", "Croatia", "Bulgaria", "Estonia", "Lithuania", "Latvia", "Slovenia", "Luxembourg", "Malta", "Cyprus")

# Filtering data for entry-level roles in Europe
entry_level_europe <- data[data$experience_level == "entry level" & data$company_location %in% european_countries, ]

# Calculating average salary for each job title in entry-level roles in Europe
avg_salary_entry_level_europe <- aggregate(salary_in_eur ~ job_title, entry_level_europe, FUN = mean)

# Creating a word cloud
wordcloud(words = avg_salary_entry_level_europe$job_title, 
          freq = avg_salary_entry_level_europe$salary_in_eur, 
          min.freq = 1,
          max.words = 50, 
          random.order = FALSE, 
          colors = custom_palette,
          scale = c(2, 0.5))

```

This word cloud highlights that job titles such as AI Developer, Applied Data Scientist, and Research Engineer have higher average salaries for entry level experience level in Europe than job titles such as Computer Vision Software Engineer, Computer Vision Engineer, and ML Engineer. To make things clearer let's display the top 10 job titles.

```{r}
# Sorting by average salary in descending order
sorted_data_europe <- avg_salary_entry_level_europe[order(-avg_salary_entry_level_europe$salary_in_eur), ]

# Selecting top 10 job titles for entry level in Europe
top_10_jobtitles_entry_europe <- head(sorted_data_europe, 10)

# Creating the bar plot
ggplot(top_10_jobtitles_entry_europe, aes(x = reorder(job_title, salary_in_eur), y = salary_in_eur)) +
  geom_bar(stat = "identity", fill = "#5bc8af") +
  labs(title = "Top 10 Job Titles with Highest Avg Salary (Entry Level in Europe)", x = "Job Title", y = "Average Salary") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()  # To display the bars horizontally

```

AI Developers seem to have the highest average salaries in Europe among the other data-related roles, followed by Research Engineer, and Applied Data Scientist.

##### **4.2.2 Entry Level Salaries Vs. Company Location \| IN EUROPE \|**

```{r}

# Calculating average salary for each country
avg_salary_entry_level_europe <- entry_level_europe %>%
  group_by(company_location) %>%
  summarise(avg_salary = mean(salary_in_eur, na.rm = TRUE)) %>%
  arrange(desc(avg_salary))  # Sorting by average salary in descending order

# Getting world map data for Europe
europe_map <- ne_countries(scale = "medium", continent = "Europe", returnclass = "sf")

# Merge salary data with the map data for Europe
map_data_europe <- merge(europe_map, avg_salary_entry_level_europe, by.x = "name", by.y = "company_location", all.x = TRUE)

# Plotting map of Europe showing average salaries for entry level
ggplot() +
  geom_sf(data = map_data_europe, aes(fill = avg_salary)) +
  scale_fill_gradient(low = "#5bc8af", high = "#202060") +
  labs(title = "Average Salary for Entry Level in European Countries") +
  theme_minimal()

```

To gain a clearer perspective on our findings, let's visualize the data on a bar plot, focusing solely on the top 10 countries exhibiting the highest average salaries.

```{r}
# Sorting the data by average salary in descending order
top_10_countries_eu <- head(avg_salary_entry_level_europe[order(-avg_salary_entry_level_europe$avg_salary), ], 10)

# Creating a bar plot for the top 10 European countries with the highest average salaries
ggplot(top_10_countries_eu, aes(x = reorder(company_location, avg_salary), y = avg_salary)) +
  geom_bar(stat = "identity", fill = "#5bc8af") +
  labs(title = "Top 10 European Countries with Highest Average Salaries (Entry Level)", x = "Country", y = "Average Salary") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

```

The bar plot reveals a notable disparity in average salaries among European countries. Sweden appears to offer the highest average salary for entry-level roles, followed by Belgium and Germany. There's a substantial variation in salaries, with the top countries showcasing significantly higher averages compared to those at the bottom of the list, such as Luxembourg and Denmark.

### **5. Conclusion**

Throughout this project, we took a close look at how salaries differ across jobs, experience levels, and countries worldwide. Our exploration highlighted significant disparities: certain roles tend to yield notably higher or lower earnings, and this discrepancy amplifies with experience, with senior and executive positions commanding higher salaries. The geographical element also plays a pivotal role, showcasing substantial variations in earnings across different countries. Using various visualization techniques such as box plots, violin plots, bar plots, and maps helped us see these differences clearly. This study underlines the substantial impact of job type, experience level, and location on income. Looking forward, deeper investigations into the underlying factors behind these disparities, such as industry specifics or skill demands, could provide valuable insights. Additionally, studying salary fluctuations over time might offer further perspectives on these earning differentials. Ultimately, this analysis reveals that where one works, the job they do, and their experience significantly influence their earnings.

Please check the `Dashboard` file to see the dashboard I created using flexdashboard.