BUS5DWR Data Wrangling and R Assignment 3
1. The provided data files contain demographics and employment specific data of survey respondents in two separate files. Import the data from the given data files into two separate data frames and combine the created data frames to form a single data frame and explore the summary. (5 marks)
2. Explore the combined data frame consist of all the data for quality assessment and preparation. (15 marks)
a. What potential data quality issues did you find in the merged dataset?
b. What steps did you take to clean the data, address quality issues, ensure consistency, and prepare it for analysis?
Deliverable: Summarize your observations on the dataset's quality, outlining the data quality issues identified and the approaches used to address them. Include relevant screenshots from your analysis to support your findings and the corrections made. Present these details in the summary report. (150-200 words).
NOTE: In general, data can contain various issues. For this question, it is sufficient to highlight the several main issues and correct the most relevant ones based on your expertise with justifications. You may rename columns if needed.
3. Using appropriate data visualization techniques, comment on the relationship between job role and the salary among survey respondents. Analyse how salary levels vary based on different roles and discuss on your observations in the summary report with supporting evidence from your analysis. (100-200 words) (10 marks)
4. Evaluate how years of experience influence salary levels in the IT industry. Are there specific experience ranges or job roles where the impact on salary based on experience is more significant or unexpected? Analyse the data to identify if any such patterns present and explain your findings. (max 250 words) (10 marks)
5. For new entrants into the IT industry, perform a critical analysis to gain insights into the job market in the EU region based on key factors highlighted in the survey responder data. Select at least four relevant attributes from the data and analyse their impact using appropriate visualisations. Provide recommendations based on your data driven analysis to help newcomers understand the job market and navigate it effectively. (max 500 words) (20 marks).
The data reading technique is appropriate to read both the CSV and XLSX datasets. The components deliver the involvement of the ‘readxl’ library for the reading of the XLSX dataset. The merge strategy for MBA Assignment Expert is pertinent to merge both two datasets and assemble a final data frame, ‘final_data’. The summary explains the components of the data frame (Kopra et al., 2024).
Figure 2: Structure of final merged data
(Source: R-Studio)
The configuration of the final combined data frame is illustrated in this portion. This furnishes the components of the overall data structure which possesses the data of both two datasets.
a. Data Quality Issue
After the merging of the data, the main data quality issue describes that there are so multiple blanks, and also ‘NA’ values attending in the data final dataset which requires to be terminated for further enactment (Pavía, and Romero, 2023).
b. Data cleaning approach, and other
Figure 3: Check of ‘NA’
(Source: R-Studio)
The ‘NA’ checking methodology is executed to find those columns that possess the ‘NA’ values. As a consequence, most of the columns retain ‘NA’ values such as Age retains 27 ‘NA’ values, Position contains 6 ‘NA’ values, and so on.
Figure 4: ‘NA’ handling and removing
(Source: R-Studio)
The ‘NA’ handling as well as withdrawing strategy is applicable by replacing the ‘NA’ with ‘unknown’ (for categorical column), and 0 (for numeric column).
Figure 5: Blank portion checking and removal
(Source: R-Studio)
The blank-checking procedure is serviceable to encounter the blank value in a queue. This guideline is functional to find the blank in the ‘Gender’ column. Substitute this with the ‘unknown’ value to scrub the data part.
Figure 6: Data conversion
(Source: R-Studio)
The transformation procedure is practical to convert the character or string data queue into a numeric one. In this case, ‘TotalYearsOfExperince’ is transformed into numeric data one.
The data filtering methodology is serviceable to filter out simply those data where ‘unknown’ values are not present. In this matter, the top 10 roles are functional for the decisiveness of the average salary of each role.
Figure 8: Plot of average salary for various role
(Source: R-Studio)
The plotting strategy is functional to plot the relational enactment between the job roles, and the average calculated salary. In this case, two elements are executed such as ‘Position’ which expresses the job roles, and the average premeditated salary which is calculated for each position/job role. This expresses the bar undertaking plot which accentuates the x-axis as job roles, and y as average estimated salary values. As per the plotting, the ultimate average estimated salary is emanated in the case of an ‘ML Engineer’.
The relational enactment between salary and experience is reckoned by employing the evaluation methodology. This furnishes the evaluation of the salary for diverse work experiences. This also illustrates the structure of the experience range such as 0-2, 3-5, 11-20, 6-10, and 20+. The Regression prototype is relevant for the evaluation between salary and experience.
Figure 10: Scatter distribution between salary and experience
(Source: R-Studio)
There is no clear sequential relationship between the salary and the number of years of experience in the graph, it has low salaries in most cases and one high salary brought the horizontal scale to that level (MacAvaney et al., 2021). This indicates that there is little to support that salary escalates with experience even if further analysis can be made to try and distinguish any correlation.
Figure 11: Box execution between experience range and salary distribution
(Source: R-Studio)
The box commission characterizes the distribution of salary for mixed experience ranges such as 0-2, 3-5, 11-20, 6-10, and 20+. In this point, 0-2 Years has the leading salary distribution values than others.
Figure 12: Summary details of the model/prototype (Linear Regression)
(Source: R-Studio)
As evidenced from the results of the analysis, salary is not correlated with experience using the linear regression model; and this can be evidenced by a high p-value of 0.450 for experience as well as a low R-squared value of 0.0005.
The first factor is the job role which is pertinent to the consequence perseverance of the salary. At this point, the top 10 positions are selected, and filter out the ‘unknown’ queue data to clean the data area. Plotting is executed for inventing the average salary allocation for mixed job roles.
Figure 14: Plot of executed factor 1 (Job Role on Salary)
(Source: R-Studio)
As per the plotting of the consequence evaluation of job role on salary, the maximum salary allotment is demarcated in the point of ML Engineer whose value is around 11972988.
Figure 15: Execution of factor 2 (Year of experience on Salary)
(Source: R-Studio)
The next factor is represented as the year of experience which has an assertive consequence on salary. This also explains the undertaking of the filtering of the top 10 years of experience which is relevant for the finding of the salary values (Jaimovitch-López et al., 2023).
Figure 16: Plot of executed factor 2 (Year of experience on Salary)
(Source: R-Studio)
As per the clinching finding, the ultimate percentage is decided for the case of 10 years of experience which is approximately 15.6%. So, this number of years of experience has the maximum salary allotment value.
Figure 17: Execution of factor 3 (Main programming Language on Salary)
(Source: R-Studio)
The third element is the main programming language which also has a consequence on the salary allocation. This also underlines the filtering of the top 10 programming languages and the reduction of the ‘unknown’ column data.
Figure 18: Plot of executed factor 3 (Main programming Language on Salary)
(Source: R-Studio)
As per the plotting, the greatest salary issuance is reckoned in the point of the ‘Scala’ programming language whose salary allocation value is 85146.
Figure 19: Execution of factor 4 (Employment Status on Salary)
(Source: R-Studio)
The fourth and final element is the employment status which also has a consequence on the salary distribution. This also explains the filtering of the ‘unknown’ queue data and the top 10 data.
Figure 20: Plot of executed of factor 4 (Employment Status on Salary)
(Source: R-Studio)
The plotting explains that the highest salary distribution matter is confined in the case of Full-time employment where the salary issuance value is 491369.
Jaimovitch-López, G., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., & Ramírez-Quintana, M. J. (2023). Can language models automate data wrangling?. Machine Learning, 112(6), 2053-2082. Doi : https://doi.org/10.1007/s10994-022-06259-9
Kopra, J., Tikka, S., Heinäniemi, M., López-Pernas, S., & Saqr, M. (2024). An R approach to data cleaning and wrangling for education research. In Learning Analytics Methods and Tutorials: A Practical Guide Using R (pp. 95-119). Cham: Springer Nature Switzerland. Doi : https://doi.org/10.1007/978-3-031-54464-4
MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., & Goharian, N. (2021, July). Simplified data wrangling with ir_datasets. In Proceedings of the 44th
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 2429-2436). Doi : https://doi.org/10.1145/3404835.3463254
Pavía, J. M., & Romero, R. (2023). Data wrangling, computational burden, automation, robustness and accuracy in ecological inference forecasting of R× C tables. SORT, 47, 151-186. Doi : 10.57645/20.8080.02.4