-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathNotebook.Rmd
879 lines (647 loc) · 34.1 KB
/
Notebook.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
---
title: "A Zoom Into Data-Related Jobs"
output:
html_notebook:
toc: true
toc_depth: 6
toc_float: true
css: styles.css
---
**Name**: Samaher Brahem
**Course**: Coding for DS & DM - R Module
**Professor**: Giancarlo Manzi
### **1. Introduction**
This study aims to analyze salary differences among Data Scientists and other Data-related roles across different countries. Utilizing a dataset containing diverse employment types, experience levels, and company sizes, the study intends to conduct statistical analyses and visualizations using R. The objective is to identify potential salary disparities and trends across various global contexts. The study aims to shed light on nuanced salary patterns within the data science domain, offering insights into the variations of compensation among data-related roles on a global scale.
I used [**GitHub**](https://github.com/SamaherUNIMI/DSE_R_Project/tree/main) as a tool for version control.
### **2. Data Used**
The *Salary of Data Scientists* dataset will be used for this analysis. It is made available on [Kaggle](https://www.kaggle.com/datasets/piyushborhade/salary-of-data-scientists) by the user, Piyush Borhade.
#### **2.1 Features of the dataset**:
- `work_year`: The year the salary was paid.
- `experience_level`: The level of work experience during the year
- `employment_type`: the type of employment for the role
- `job_title`: The job in which you worked during the year.
- `salary`: The total gross amount of salary paid.
- `salary_currency`: The currency of the salary paid as an ISO 4217 currency code.
- `salary_in_usd`: The salary in USD
- `employee_residence`: the country of primary residence of the employee during the employment year as an ISO 3166 country code.
- `remote_ratio`: The total amount of work done remotely
- `company_location`: The country of headquarters of the employer or contracting branch
- `company_size`: The average number of people who worked for the company during the year
#### **2.2 Limits of the dataset**:
##### **2.2.1 No Shared Sources**:
The information about how the data was collected and who collected it isn't clear. This lack of clarity raises doubts about how dependable the data is.
##### **2.2.2 Geographical Bias**:
The dataset primarily comprises data from the United States, resulting in an imbalance where other countries are underrepresented. This skew might limit the depth and accuracy of analyses, potentially leading to an overemphasis on American-centric insights and a lack of comprehensive understanding regarding other regions or global perspectives.
### **3. Data Processing**
#### **3. 1 Installing packages and opening libraries:**
For the purpose of this study, I installed and loaded the following libraries:
```{r}
library(tidyverse)
library(ggplot2)
library(dplyr)
library(sf)
library(rnaturalearth)
library(scales)
library(viridis)
library(wordcloud)
```
#### **3.2 Importing the dataset:**
```{r}
# Importing the dataset from a local path in my computer
data <- read.csv2("C:/Users/Samaher/Documents/DSE_R_Project/data.csv")
# Converting it into a tibble
data <- as.tibble(data)
```
#### **3.3 Previewing the dataset:**
```{r}
# Understanding the structure of the dataset
str(data)
```
```{r}
# Displaying a sample of the data
head(data)
```
Now, let's check the dimensions of the dataset.
```{r}
# Checking the dimensions of the dataset
dim(data)
```
The dataset has 3753 rows and 12 columns. This means we have 12 variables and 3753 observations.
#### **3.4 Cleaning and formatting the dataset:**
##### **3.4.1 Experience Level:**
There are 4 experience levels in this dataset:
- EN = Entry Level
- MI = Mid-Level
- SE = Senior
- EX = Executive
```{r}
# Checking how many experience levels are included in this dataset
table(data$experience_level)
```
Let's make those levels understandable by anyone who reads them!
```{r}
# Replacing abbreviations with full descriptions in the 'experience_level' column
data$experience_level <- ifelse(data$experience_level == "EN", "entry level",
ifelse(data$experience_level == "EX", "executive",
ifelse(data$experience_level == "MI", "mid-level",
ifelse(data$experience_level == "SE", "senior", data$experience_level))))
# Checking the updated 'experience_level' column
table(data$experience_level)
```
##### **3.4.2 Employment Type:**
There are 4 employment types in this dataset:
- FT = full-time
- PT = part-time
- CT = contract
- FL = freelance
```{r}
table(data$employment_type)
```
I will replace those abbreviations with the full word for better readability.
```{r}
# Replacing abbreviations with full descriptions in the 'employment_type' column
data$employment_type <- ifelse(data$employment_type == "CT", "contract",
ifelse(data$employment_type == "FL", "freelance",
ifelse(data$employment_type == "FT", "full-time",
ifelse(data$employment_type == "PT", "part-time", data$employment_type))))
# Checking the updated 'employment_type' column
table(data$employment_type)
```
##### **3.4.3 Salaries:**
When it comes to salaries, we can see that there are 3 columns: 1. `salary` 2. `salary_currency` 3. `salary_in_usd`
The salaries, stored in `salary` column, are collected in local currencies, mentioned in `salary_currency` column. Then, all those salaries were converted to USD and stored in `salary_in_usd`.
```{r}
table(data$salary_currency)
```
For the purpose of this study, it would be better to have the salaries in EUR. That's why I will replace the `salary_in_usd` column with `salary_in_eur` and convert the salaries to EUR. I will also remove the `salary` and `salary_currecny` columns since they are no longer needed.
The exchange rate for the conversion from USD to EUR is as follows:
1 United States Dollar equals 0.92 Euro
P.S. The date of this conversion is: Dec 16, 09:12 UTC
```{r}
# Given exchange rate
exchange_rate <- 0.92
# Converting USD salaries to EUR
data$salary_in_usd <- data$salary_in_usd * exchange_rate
# Changing column name to better reflect this change
names(data)[names(data) == "salary_in_usd"] <- "salary_in_eur"
```
```{r}
# Removing the salary and salary_currency columns
data <- subset(data, select = -c(salary, salary_currency))
```
##### **3.4.4 Countries:**
###### **3.4.4.1 Company Location:**
The column `company_location` reflects the countries where the companies are established. Those countries are written as abbreviations. While some country abbreviations are widely known, there are others that are less known.
```{r}
table(data$company_location)
```
To ensure clarity, I decided to replace the abbreviations with the full country names.
```{r}
# Defining a mapping of country codes to country names for company locations
cl_country_mapping <- c(
AE = "United Arab Emirates",
AL = "Albania",
AM = "Armenia",
AR = "Argentina",
AS = "American Samoa",
AT = "Austria",
AU = "Australia",
BA = "Bosnia and Herzegovina",
BE = "Belgium",
BO = "Bolivia",
BR = "Brazil",
BS = "Bahamas",
CA = "Canada",
CF = "Central African Republic",
CH = "Switzerland",
CL = "Chile",
CN = "China",
CO = "Colombia",
CR = "Costa Rica",
CZ = "Czech Republic",
DE = "Germany",
DK = "Denmark",
DZ = "Algeria",
EE = "Estonia",
EG = "Egypt",
ES = "Spain",
FI = "Finland",
FR = "France",
GB = "United Kingdom",
GH = "Ghana",
GR = "Greece",
HK = "Hong Kong",
HN = "Honduras",
HR = "Croatia",
HU = "Hungary",
ID = "Indonesia",
IE = "Ireland",
IN = "India",
IQ = "Iraq",
IR = "Iran",
IT = "Italy",
JP = "Japan",
KE = "Kenya",
LT = "Lithuania",
LU = "Luxembourg",
LV = "Latvia",
MA = "Morocco",
MD = "Moldova",
MK = "North Macedonia",
MT = "Malta",
MX = "Mexico",
MY = "Malaysia",
NG = "Nigeria",
NL = "Netherlands",
NZ = "New Zealand",
PH = "Philippines",
PK = "Pakistan",
PL = "Poland",
PR = "Puerto Rico",
PT = "Portugal",
RO = "Romania",
RU = "Russia",
SE = "Sweden",
SG = "Singapore",
SI = "Slovenia",
SK = "Slovakia",
TH = "Thailand",
TR = "Turkey",
UA = "Ukraine",
US = "United States",
VN = "Vietnam"
)
# Replacing values in the 'company_location' column using the mapping created above
data$company_location <- cl_country_mapping[data$company_location]
```
###### **3.4.4.2 Employee Residence:**
Now, we'll do the same process for `employee_residence` column.
```{r}
#Checking which countries are included
table(data$employee_residence)
```
```{r}
# Defining a mapping of country codes to country names for employee residence
er_country_mapping <- c(
AE = "United Arab Emirates",
AM = "Armenia",
AR = "Argentina",
AS = "American Samoa",
AT = "Austria",
AU = "Australia",
BA = "Bosnia and Herzegovina",
BE = "Belgium",
BG = "Bulgaria",
BO = "Bolivia",
BR = "Brazil",
CA = "Canada",
CF = "Central African Republic",
CH = "Switzerland",
CL = "Chile",
CN = "China",
CO = "Colombia",
CR = "Costa Rica",
CY = "Cyprus",
CZ = "Czech Republic",
DE = "Germany",
DK = "Denmark",
DO = "Dominican Republic",
DZ = "Algeria",
EE = "Estonia",
EG = "Egypt",
ES = "Spain",
FI = "Finland",
FR = "France",
GB = "United Kingdom",
GH = "Ghana",
GR = "Greece",
HK = "Hong Kong",
HN = "Honduras",
HR = "Croatia",
HU = "Hungary",
ID = "Indonesia",
IE = "Ireland",
IN = "India",
IQ = "Iraq",
IR = "Iran",
IT = "Italy",
JE = "Jersey",
JP = "Japan",
KE = "Kenya",
KW = "Kuwait",
LT = "Lithuania",
LU = "Luxembourg",
LV = "Latvia",
MA = "Morocco",
MD = "Moldova",
MK = "North Macedonia",
MT = "Malta",
MX = "Mexico",
MY = "Malaysia",
NG = "Nigeria",
NL = "Netherlands",
NZ = "New Zealand",
PH = "Philippines",
PK = "Pakistan",
PL = "Poland",
PR = "Puerto Rico",
PT = "Portugal",
RO = "Romania",
RS = "Serbia",
RU = "Russia",
SE = "Sweden",
SG = "Singapore",
SI = "Slovenia",
SK = "Slovakia",
TH = "Thailand",
TN = "Tunisia",
TR = "Turkey",
UA = "Ukraine",
US = "United States",
UZ = "Uzbekistan",
VN = "Vietnam"
)
# Replacing values in the 'employee_residence' column using the mapping created above
data$employee_residence <- er_country_mapping[data$employee_residence]
```
##### **3.4.5 Checking for Missing Values:**
After cleaning and formatting the data, I want to make sure there are no null values.
```{r}
# Making sure there are no missing values
any(is.na(data))
```
The outcome is `FALSE`, so there no null values to deal with. Let's move to the fun part!
### **4. Data Analysis**
#### **4.1 Exploratory Data Analysis (EDA)**
##### **4.1.1 Overview and Summary Statistics**
In this section, we will be examining the dataset's basic characteristics and computing summary statistics for better comprehension.
###### **4.1.1.1 Overview**
Let's check again the dimensions and structure of the dataset after the the data processing phase.
```{r}
dim(data)
```
The dataset now has 3753 rows (observations) and 10 columns (variables).
Let's check the columns' names and the structure of the processed dataset.
```{r}
names(data)
```
```{r}
str(data)
```
###### **4.1.1.2 Summary Statistics**
In this section, let's calculate summary statistics like mean, median, standard deviation, minimum, maximum, and quartiles for numerical columns (`salary_in_eur`, `remote_ratio`).
```{r}
# Checking the summary statistics for the salary variable
summary(data$salary_in_eur)
```
These statistics provide insights into the distribution of total gross salaries within the dataset. The minimum salary recorded is €4,721, while the maximum is €414,000. The median salary of €124,200 indicates that half of the recorded total gross salaries fall below this value and the other half above it. The mean salary (€126,499) is slightly higher than the median, indicating potential influence from higher salaries within the dataset. The interquartile range (IQR), between the 25th and 75th percentiles, spans from €87,400 to €161,000, encapsulating the middle 50% of the total gross salary data.
```{r}
# Checking the summary statistics for the remote ratio variable
summary(data$remote_ratio)
```
This shows that most of the values are concentrated towards the lower end, as indicated by the median and the 1st quartile both being 0.00. However, 25% of the data involves entirely remote positions. This means there's a mix of roles, from no remote work to completely remote setups, in the dataset.
##### **4.1.2 Univariate Analysis**
###### **4.1.2.1 Numerical Variables**
**A. Salary Distribution**
```{r}
ggplot(data, aes(x = salary_in_eur)) +
geom_density(fill = "#5bc8af", color = "#202060") +
labs(title = "Density Plot of Salary in EUR", x = "Salary in EUR", y = "Density") +
scale_x_continuous(labels = scales::comma, breaks = seq(0, 500000, by = 100000)) # Modifying the x-axis scale to display breaks at intervals of 100,000 with comma formatting
```
The density plot for the `salary_in_eur` variable reaffirms the statistical observations derived from the summary statistics. It visually depicts a right-skewed distribution, indicating that the majority of salaries are concentrated on the lower end of the spectrum. There's a notable peak at the lower salary values, with a gradual decrease in density as salaries increase. While the plot indicates a rightward skewness, it also showcases a few instances of higher salaries, although these are notably less frequent in the dataset.
**B. Remote Ratio Distribution**
```{r}
ggplot(data, aes(x = remote_ratio)) +
geom_density(fill = "#5bc8af", color = "#202060") +
labs(title = "Density Plot of Remote Ratio", x = "Remote Ratio", y = "Density")
```
We can see that there are peaks at both ends (0% and 100%). The density plot represents a bimodal distribution. The plot highlights a high density near 0% and 100%, signifying two prominent modes in the dataset, indicating a significant number of individuals either work fully remotely or fully on-site, with a relatively smaller portion having a moderate remote work percentage around 50%.
###### **4.1.2.2 Categorical Variables**
**A. Company Size**
```{r}
# Creating a table of company sizes and their frequency
company_sizes_freq <- table(data$company_size)
# Converting the table to a data frame
company_sizes_df <- data.frame(company_size = names(company_sizes_freq), frequency = as.numeric(company_sizes_freq))
# Defining a custom color palette (this step is optional, but, they're my favorite colors)
custom_palette <- c("#5bc8af","#202060", "#b030b0","#6c91bf", "#602080")
# Creating the pie chart for company sizes with percentages
ggplot(company_sizes_df, aes(x = "", y = frequency, fill = company_size)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
labs(title = "Company Size Distribution", fill = "Company Size") +
theme_void() +
scale_fill_manual(values = custom_palette) +
geom_text(aes(label = paste0(round((frequency/sum(frequency)) * 100), "%")),
position = position_stack(vjust = 0.5), color = "white" )
```
The data is composed mainly of medium sized companies with a percentage of 84%. In the 2nd place comes large companies with a percetage of 12%. Small companies represent the smallest category with only 4%.
**B. Work Year**
Let's do the same for the `work_year` variable.
```{r}
# Creating a table of work years and their frequency
work_years_freq <- table(data$work_year)
# Converting the table to a data frame
work_years_df <- data.frame(work_year = names(work_years_freq), frequency = as.numeric(work_years_freq))
# Creating the pie chart for work years with percentages
ggplot(work_years_df, aes(x = "", y = frequency, fill = work_year)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
labs(title = "Work Year Distribution", fill = "Work Year") +
theme_void() +
scale_fill_manual(values = custom_palette) +
geom_text(aes(label = paste0(round((frequency/sum(frequency)) * 100), "%")),
position = position_stack(vjust = 0.5), color = "white" )
```
This pie chart depicts the dominance of 2023 and 2022 in the dataset compared to 2021 and 2020. 2023 has the most significant portion with 48%. 2022 follows with 44%. 2021 and 2020 have a relatively small representation in this data with the percentages 6% and 2% respectively.
**C. Experience Level**
```{r}
# Creating a table of experience levels and their frequency
experience_levels_freq <- table(data$experience_level)
# Converting the table to a data frame
experience_levels_df <- data.frame(experience_level = names(experience_levels_freq), frequency = as.numeric(experience_levels_freq))
# Creating the pie chart for experience levels with percentages
ggplot(experience_levels_df, aes(x = "", y = frequency, fill = experience_level)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
labs(title = "Experience Level Distribution", fill = "Experience Level") +
theme_void() +
scale_fill_manual(values = custom_palette) +
geom_text(aes(label = paste0(round((frequency/sum(frequency)) * 100), "%")),
position = position_stack(vjust = 0.5), color = "white" )
```
The pie chart shows that most individuals, around 67%, are in senior positions. About 21% are in middle-level roles, while entry-level positions make up 9%. Executive roles are the smallest group, just 3% of the dataset.
**D. Employment Type**
```{r}
# Creating a table of employment types and their frequency
employment_types_freq <- table(data$employment_type)
# Converting the table to a data frame
employment_types_df <- data.frame(employment_type = names(employment_types_freq), frequency = as.numeric(employment_types_freq))
# Creating the pie chart for employment types with percentages
ggplot(employment_types_df, aes(x = "", y = frequency, fill = employment_type)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
labs(title = "Employment Type Distribution", fill = "Employment Type") +
theme_void() +
scale_fill_manual(values = custom_palette) +
geom_text(aes(label = paste0(round((frequency/sum(frequency)) * 100), "%")),
position = position_stack(vjust = 0.5), color = "white" )
```
The majority of individuals, constituting approximately 99%, are categorized as "Full-time" employees. Now, let's dig a bit deeper to see the percentages of the other categories.
```{r}
# Calculating percentages
employment_types_df$percentage <- round((employment_types_df$frequency / sum(employment_types_df$frequency)) * 100, 2)
employment_types_df
```
This clarifies that part-time roles come in second with 0.45%. Lastly, we have both contract and freelance roles with 0.27% each.
**E. Company Location**
In this section, we will explore where the companies present in this dataset have their headquarters. For this purpose, I chose to visualize the countries on a map. Instead of using a linear scale, I will use a logarithmic scale for the frequency values. The logarithmic scaling helps highlight the distribution more evenly across the map, emphasizing the presence of multiple countries while downplaying the overwhelming frequency of the United States, enabling a clearer visual comparison of the dataset's geographic spread.
```{r}
# Creating a df that has the company locations along with their freq
company_location_df <- data %>%
group_by(company_location) %>%
summarise(frequency = n())
# Getting world map data
world <- ne_countries(scale = "medium", returnclass = "sf")
# Merging company_location_df with world map data
map_data <- merge(world, company_location_df, by.x = "name", by.y = "company_location", all.x = TRUE)
# Plotting
ggplot() +
geom_sf(data = map_data, aes(fill = log(frequency + 1))) +
scale_fill_gradient(low = "#5bc8af", high = "#202060") +
labs(title="Company Location Distribution") +
theme_minimal()
```
The United States stands out as the most prominent location, reflecting a high occurrence frequency, followed by Canada, Spain, and several other countries that also have notable but comparatively lower representation.
**F. Employee Residence**
Let's do the same for `employee_residence`.
```{r}
# Creating a data frame with employee residence locations and their frequency
employee_residence_df <- data %>%
group_by(employee_residence) %>%
summarise(frequency = n())
# Getting world map data
world <- ne_countries(scale = "medium", returnclass = "sf")
# Merging employee_residence_df with world map data
map_data_residence <- merge(world, employee_residence_df, by.x = "name", by.y = "employee_residence", all.x = TRUE)
# Plotting
ggplot() +
geom_sf(data = map_data_residence, aes(fill = log(frequency + 1))) +
scale_fill_gradient(low = "#5bc8af", high = "#202060") +
labs(title="Employee Residence Distribution") +
theme_minimal()
```
Looking at the distribution, it's apparent that the United States has the highest frequency, indicated by the darkest shade on the map. It's followed by Canada, Germany, the United Kingdom, and France, showing relatively high frequencies as well. The lighter shades across various countries denote lower frequencies. Countries such as Tunisia, Algeria, Iraq, Kuwait, American Samoa, Armenia, Bulgaria, Cyprus, Dominican Republic, Iran, Moldova, North Macedonia, Slovakia, and Uzbekistan have the lightest shades, representing the lowest frequency of employee residences.
**G. Job Titles**
In this section, we will see what are the 10 Job Titles present in this dataset.
```{r}
# Creating a table of job titles and their frequency
job_titles_freq <- table(data$job_title)
# Converting the table to a data frame and sort by frequency
job_titles_df <- data.frame(job_title = names(job_titles_freq), frequency = as.numeric(job_titles_freq))
job_titles_df <- job_titles_df[order(job_titles_df$frequency, decreasing = TRUE), ]
# Calculating percentages
job_titles_df$percentage <- (job_titles_df$frequency / sum(job_titles_df$frequency)) * 100
# Selecting the top 10 most occurring job titles
top_10_jobs <- head(job_titles_df, 10)
# Creating a bar plot for the top 10 job titles with percentages on the y-axis
ggplot(top_10_jobs, aes(x = reorder(job_title, -frequency), y = percentage)) +
geom_bar(stat = "identity", fill = "#5bc8af") +
labs(title = "Top 10 Most Occurring Job Titles", x = "Job Titles", y = "Percentage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
The most dominant roles include "Data Engineer," "Data Scientist," and "Data Analyst," constituting a significant portion of the dataset's job titles. Additionally, positions like "Machine Learning Engineer," "Analytics Engineer," "Data Architect," and roles related to research, such as "Research Scientist" and "Research Engineer," also hold notable but comparatively smaller percentages.
##### **4.1.3 Bivariate Analysis**
###### **4.1.3.1 Salary Vs. Company Location**
In this analysis, we'll explore the comparative average salaries across various countries where data employees are employed, focusing on the `company_location` variable. Using this information, we aim to visualize and understand the differences in average salaries by country. The approach involves calculating the mean salary for each country listed in the dataset and presenting this information on a world map. This visualization will provide a clearer perspective on the regional discrepancies in salaries within the data industry.
```{r}
# Creating a dataframe with company locations and average salary for each country
salary_location_df <- data %>%
group_by(company_location) %>%
summarise(avg_salary = mean(salary_in_eur, na.rm = TRUE)) %>% # Calculating the average salary
arrange(desc(avg_salary)) # Sorting by average salary in descending order
# Getting world map data
world <- ne_countries(scale = "medium", returnclass = "sf")
# Merging company_location_df with world map data
map_data <- merge(world, salary_location_df, by.x = "name", by.y = "company_location", all.x = TRUE)
# Plotting
ggplot() +
geom_sf(data = map_data, aes(fill = avg_salary)) +
scale_fill_gradient(low = "#5bc8af", high = "#202060") +
labs(title="Average Salary Vs. Company Location") +
theme_minimal()
```
This map indicates that when it comes to average salaries for data employees, there's a big difference between countries. Let dig deeper to see which are the top 10 countries with the highest average salaries.
```{r}
# Selecting the top 10 countries
top_10_countries <- head(salary_location_df, 10)
# Creating a bar plot for top 10 countries with highest average salaries
ggplot(top_10_countries, aes(x = reorder(company_location, -avg_salary), y = avg_salary)) +
geom_bar(stat = "identity", fill = "#5bc8af") +
labs(title = "Top 10 Countries with Highest Average Salaries",
x = "Company Location", y = "Average Salary") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
Puerto Rico seems to offer the highest salaries among these top 10 countries. The United States, Russia, and Canada follow closely behind.
###### **4.1.3.2 Salary Vs. Job Title**
In this section, we will see the top 10 job titles by average salary.
```{r}
# Creating a dataframe with job titles and their average salary
salary_jobtitle_df <- data %>%
group_by(job_title) %>%
summarise(avg_salary = mean(salary_in_eur, na.rm = TRUE))
# Sorting by average salary in descending order to get the top salaries at the top
salary_jobtitle_df <- salary_jobtitle_df[order(-salary_jobtitle_df$avg_salary),]
# Selecting the top 10 job titles
top_10_jobtitles <- head(salary_jobtitle_df, 10)
# Creating a bar plot for the top 10 job titles
ggplot(top_10_jobtitles, aes(x = reorder(job_title, -avg_salary), y = avg_salary)) +
geom_bar(stat = "identity", fill = "#5bc8af") +
labs(title = "Top 10 Job Titles by Average Salary", x = "Job Title" , y = "Average Salary" ) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(labels = scales::comma, breaks = pretty_breaks(n = 10)) # Formatting the Y-axis labels
```
We can see that the top job title with the highest average salaries is Data Science Tech Lead, followed by other job titles such as Cloud Data Architect, Data Lead, Data Analytics Lead, and Principal Data Scientist.
###### **4.1.3.3 Salary Vs. Experience Level**
```{r}
# Reordering experience_level from entry level to executive
data$experience_level <- factor(data$experience_level, levels = c("entry level", "mid-level", "senior", "executive"))
# Creating boxplot for salaries across experience levels with custom colors
ggplot(data, aes(x = experience_level, y = salary_in_eur, fill = experience_level)) +
geom_boxplot() +
labs(title = "Salaries across Experience Levels", x = "Experience Level", y = "Salary in EUR") +
scale_fill_manual(values = custom_palette) + # Adding my favorite colors
theme_minimal() +
scale_y_continuous(labels = scales::comma, breaks = pretty_breaks(n = 10))
```
The boxplot demonstrates variations in salary distributions across experience levels. The executive category exhibits the highest median salary among all categories, followed by the senior level. In contrast, the entry-level category displays the lowest median salary, notably less than €55,000 per year. This suggests a significant disparity in salary distributions across different experience levels, with higher experience levels generally commanding higher salaries.
We can also include a violin plot to the boxplot to see both the distribution of the data and its summary statistics.
```{r}
# Calculate sample size for each experience level
sample_size <- data %>%
group_by(experience_level) %>%
summarize(num = n())
# Joining the datasets based on the experience_level column
data_with_size <- left_join(data, sample_size, by = "experience_level") %>%
mutate(myaxis = paste0(experience_level, "\n", "n=", num)) # Creating a new column 'myaxis' combining experience level and sample size
# Plot
ggplot(data_with_size, aes(x = myaxis, y = salary_in_eur, fill = experience_level)) +
geom_violin(width = 0.7, position = position_dodge(width = 0.6)) +
geom_boxplot(width = 0.1, color = "white", alpha = 0.2, position = position_dodge(width = 0.6)) +
scale_fill_manual(values = custom_palette) + # Adding my favorite colors
theme_minimal() +
ggtitle("Salaries across Experience Levels") +
xlab("") +
ylab("Salary in EUR")+
scale_y_continuous(labels = scales::comma, breaks = pretty_breaks(n = 10))
```
#### **4.2 Further Analysis**
In this section, I will focus on entry level roles in Europe and answer these questions:
- Which job titles has the highest average salary for entry level roles in Europe?
- Which country offer the highest average salary for entry level roles in Europe?
##### **4.2.1 Entry Level Salaries Vs. Job Titles \| IN EUROPE \|**
```{r}
# Creating a vector of European countries
european_countries <- c("Spain", "France", "Germany", "Italy", "United Kingdom", "Netherlands", "Sweden", "Norway", "Finland", "Denmark", "Belgium", "Austria", "Greece", "Switzerland", "Portugal", "Ireland", "Czech Republic", "Romania", "Hungary", "Poland", "Slovakia", "Croatia", "Bulgaria", "Estonia", "Lithuania", "Latvia", "Slovenia", "Luxembourg", "Malta", "Cyprus")
# Filtering data for entry-level roles in Europe
entry_level_europe <- data[data$experience_level == "entry level" & data$company_location %in% european_countries, ]
# Calculating average salary for each job title in entry-level roles in Europe
avg_salary_entry_level_europe <- aggregate(salary_in_eur ~ job_title, entry_level_europe, FUN = mean)
# Creating a word cloud
wordcloud(words = avg_salary_entry_level_europe$job_title,
freq = avg_salary_entry_level_europe$salary_in_eur,
min.freq = 1,
max.words = 50,
random.order = FALSE,
colors = custom_palette,
scale = c(2, 0.5))
```
This word cloud highlights that job titles such as AI Developer, Applied Data Scientist, and Research Engineer have higher average salaries for entry level experience level in Europe than job titles such as Computer Vision Software Engineer, Computer Vision Engineer, and ML Engineer. To make things clearer let's display the top 10 job titles.
```{r}
# Sorting by average salary in descending order
sorted_data_europe <- avg_salary_entry_level_europe[order(-avg_salary_entry_level_europe$salary_in_eur), ]
# Selecting top 10 job titles for entry level in Europe
top_10_jobtitles_entry_europe <- head(sorted_data_europe, 10)
# Creating the bar plot
ggplot(top_10_jobtitles_entry_europe, aes(x = reorder(job_title, salary_in_eur), y = salary_in_eur)) +
geom_bar(stat = "identity", fill = "#5bc8af") +
labs(title = "Top 10 Job Titles with Highest Avg Salary (Entry Level in Europe)", x = "Job Title", y = "Average Salary") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() # To display the bars horizontally
```
AI Developers seem to have the highest average salaries in Europe among the other data-related roles, followed by Research Engineer, and Applied Data Scientist.
##### **4.2.2 Entry Level Salaries Vs. Company Location \| IN EUROPE \|**
```{r}
# Calculating average salary for each country
avg_salary_entry_level_europe <- entry_level_europe %>%
group_by(company_location) %>%
summarise(avg_salary = mean(salary_in_eur, na.rm = TRUE)) %>%
arrange(desc(avg_salary)) # Sorting by average salary in descending order
# Getting world map data for Europe
europe_map <- ne_countries(scale = "medium", continent = "Europe", returnclass = "sf")
# Merge salary data with the map data for Europe
map_data_europe <- merge(europe_map, avg_salary_entry_level_europe, by.x = "name", by.y = "company_location", all.x = TRUE)
# Plotting map of Europe showing average salaries for entry level
ggplot() +
geom_sf(data = map_data_europe, aes(fill = avg_salary)) +
scale_fill_gradient(low = "#5bc8af", high = "#202060") +
labs(title = "Average Salary for Entry Level in European Countries") +
theme_minimal()
```
To gain a clearer perspective on our findings, let's visualize the data on a bar plot, focusing solely on the top 10 countries exhibiting the highest average salaries.
```{r}
# Sorting the data by average salary in descending order
top_10_countries_eu <- head(avg_salary_entry_level_europe[order(-avg_salary_entry_level_europe$avg_salary), ], 10)
# Creating a bar plot for the top 10 European countries with the highest average salaries
ggplot(top_10_countries_eu, aes(x = reorder(company_location, avg_salary), y = avg_salary)) +
geom_bar(stat = "identity", fill = "#5bc8af") +
labs(title = "Top 10 European Countries with Highest Average Salaries (Entry Level)", x = "Country", y = "Average Salary") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip()
```
The bar plot reveals a notable disparity in average salaries among European countries. Sweden appears to offer the highest average salary for entry-level roles, followed by Belgium and Germany. There's a substantial variation in salaries, with the top countries showcasing significantly higher averages compared to those at the bottom of the list, such as Luxembourg and Denmark.
### **5. Conclusion**
Throughout this project, we took a close look at how salaries differ across jobs, experience levels, and countries worldwide. Our exploration highlighted significant disparities: certain roles tend to yield notably higher or lower earnings, and this discrepancy amplifies with experience, with senior and executive positions commanding higher salaries. The geographical element also plays a pivotal role, showcasing substantial variations in earnings across different countries. Using various visualization techniques such as box plots, violin plots, bar plots, and maps helped us see these differences clearly. This study underlines the substantial impact of job type, experience level, and location on income. Looking forward, deeper investigations into the underlying factors behind these disparities, such as industry specifics or skill demands, could provide valuable insights. Additionally, studying salary fluctuations over time might offer further perspectives on these earning differentials. Ultimately, this analysis reveals that where one works, the job they do, and their experience significantly influence their earnings.
Please check the `Dashboard` file to see the dashboard I created using flexdashboard.