Need help?

BUS5DWR Data Wrangling and R Assignment 3 Sample

BUS5DWR Data Wrangling and R Assignment 3

Questions:

1. The provided data files contain demographics and employment specific data of survey respondents in two separate files. Import the data from the given data files into two separate data frames and combine the created data frames to form a single data frame and explore the summary. (5 marks)

2. Explore the combined data frame consist of all the data for quality assessment and preparation. (15 marks)

a. What potential data quality issues did you find in the merged dataset?

b. What steps did you take to clean the data, address quality issues, ensure consistency, and prepare it for analysis?

Deliverable: Summarize your observations on the dataset's quality, outlining the data quality issues identified and the approaches used to address them. Include relevant screenshots from your analysis to support your findings and the corrections made. Present these details in the summary report. (150-200 words).

NOTE: In general, data can contain various issues. For this question, it is sufficient to highlight the several main issues and correct the most relevant ones based on your expertise with justifications. You may rename columns if needed.

3. Using appropriate data visualization techniques, comment on the relationship between job role and the salary among survey respondents. Analyse how salary levels vary based on different roles and discuss on your observations in the summary report with supporting evidence from your analysis. (100-200 words) (10 marks)

4. Evaluate how years of experience influence salary levels in the IT industry. Are there specific experience ranges or job roles where the impact on salary based on experience is more significant or unexpected? Analyse the data to identify if any such patterns present and explain your findings. (max 250 words) (10 marks)

5. For new entrants into the IT industry, perform a critical analysis to gain insights into the job market in the EU region based on key factors highlighted in the survey responder data. Select at least four relevant attributes from the data and analyse their impact using appropriate visualisations. Provide recommendations based on your data driven analysis to help newcomers understand the job market and navigate it effectively. (max 500 words) (20 marks).

Solution

Questions 1

Figure 1: Two data read approach
(Source: R-Studio)

The data reading technique is appropriate to read both the CSV and XLSX datasets. The components deliver the involvement of the ‘readxl’ library for the reading of the XLSX dataset. The merge strategy for MBA Assignment Expert is pertinent to merge both two datasets and assemble a final data frame, ‘final_data’. The summary explains the components of the data frame (Kopra et al., 2024).

Figure 2: Structure of final merged data
(Source: R-Studio)

The configuration of the final combined data frame is illustrated in this portion. This furnishes the components of the overall data structure which possesses the data of both two datasets.

Questions 2

a. Data Quality Issue

After the merging of the data, the main data quality issue describes that there are so multiple blanks, and also ‘NA’ values attending in the data final dataset which requires to be terminated for further enactment (Pavía, and Romero, 2023).

b. Data cleaning approach, and other

Figure 3: Check of ‘NA’
(Source: R-Studio)

The ‘NA’ checking methodology is executed to find those columns that possess the ‘NA’ values. As a consequence, most of the columns retain ‘NA’ values such as Age retains 27 ‘NA’ values, Position contains 6 ‘NA’ values, and so on.

Figure 4: ‘NA’ handling and removing
(Source: R-Studio)

The ‘NA’ handling as well as withdrawing strategy is applicable by replacing the ‘NA’ with ‘unknown’ (for categorical column), and 0 (for numeric column).

Figure 5: Blank portion checking and removal
(Source: R-Studio)

The blank-checking procedure is serviceable to encounter the blank value in a queue. This guideline is functional to find the blank in the ‘Gender’ column. Substitute this with the ‘unknown’ value to scrub the data part.

Figure 6: Data conversion
(Source: R-Studio)

The transformation procedure is practical to convert the character or string data queue into a numeric one. In this case, ‘TotalYearsOfExperince’ is transformed into numeric data one.

Questions 3

Figure 7: Data filtering and average salary for various role determination
(Source: R-Studio)

The data filtering methodology is serviceable to filter out simply those data where ‘unknown’ values are not present. In this matter, the top 10 roles are functional for the decisiveness of the average salary of each role.

Figure 8: Plot of average salary for various role
(Source: R-Studio)

The plotting strategy is functional to plot the relational enactment between the job roles, and the average calculated salary. In this case, two elements are executed such as ‘Position’ which expresses the job roles, and the average premeditated salary which is calculated for each position/job role. This expresses the bar undertaking plot which accentuates the x-axis as job roles, and y as average estimated salary values. As per the plotting, the ultimate average estimated salary is emanated in the case of an ‘ML Engineer’.

Questions 4

Figure 9: Relational execution between salary, and experience
(Source: R-Studio)

The relational enactment between salary and experience is reckoned by employing the evaluation methodology. This furnishes the evaluation of the salary for diverse work experiences. This also illustrates the structure of the experience range such as 0-2, 3-5, 11-20, 6-10, and 20+. The Regression prototype is relevant for the evaluation between salary and experience.

Figure 10: Scatter distribution between salary and experience
(Source: R-Studio)

There is no clear sequential relationship between the salary and the number of years of experience in the graph, it has low salaries in most cases and one high salary brought the horizontal scale to that level (MacAvaney et al., 2021). This indicates that there is little to support that salary escalates with experience even if further analysis can be made to try and distinguish any correlation.

Figure 11: Box execution between experience range and salary distribution
(Source: R-Studio)

The box commission characterizes the distribution of salary for mixed experience ranges such as 0-2, 3-5, 11-20, 6-10, and 20+. In this point, 0-2 Years has the leading salary distribution values than others.

Figure 12: Summary details of the model/prototype (Linear Regression)
(Source: R-Studio)

As evidenced from the results of the analysis, salary is not correlated with experience using the linear regression model; and this can be evidenced by a high p-value of 0.450 for experience as well as a low R-squared value of 0.0005.

Questions 5

Figure 13: Execution of factor 1 (Job Role on Salary)
(Source: R-Studio)

The first factor is the job role which is pertinent to the consequence perseverance of the salary. At this point, the top 10 positions are selected, and filter out the ‘unknown’ queue data to clean the data area. Plotting is executed for inventing the average salary allocation for mixed job roles.

Figure 14: Plot of executed factor 1 (Job Role on Salary)
(Source: R-Studio)

As per the plotting of the consequence evaluation of job role on salary, the maximum salary allotment is demarcated in the point of ML Engineer whose value is around 11972988.

Figure 15: Execution of factor 2 (Year of experience on Salary)
(Source: R-Studio)

The next factor is represented as the year of experience which has an assertive consequence on salary. This also explains the undertaking of the filtering of the top 10 years of experience which is relevant for the finding of the salary values (Jaimovitch-López et al., 2023).

Figure 16: Plot of executed factor 2 (Year of experience on Salary)
(Source: R-Studio)

As per the clinching finding, the ultimate percentage is decided for the case of 10 years of experience which is approximately 15.6%. So, this number of years of experience has the maximum salary allotment value.

Figure 17: Execution of factor 3 (Main programming Language on Salary)
(Source: R-Studio)

The third element is the main programming language which also has a consequence on the salary allocation. This also underlines the filtering of the top 10 programming languages and the reduction of the ‘unknown’ column data.

Figure 18: Plot of executed factor 3 (Main programming Language on Salary)
(Source: R-Studio)

As per the plotting, the greatest salary issuance is reckoned in the point of the ‘Scala’ programming language whose salary allocation value is 85146.

Figure 19: Execution of factor 4 (Employment Status on Salary)
(Source: R-Studio)

The fourth and final element is the employment status which also has a consequence on the salary distribution. This also explains the filtering of the ‘unknown’ queue data and the top 10 data.

Figure 20: Plot of executed of factor 4 (Employment Status on Salary)
(Source: R-Studio)

The plotting explains that the highest salary distribution matter is confined in the case of Full-time employment where the salary issuance value is 491369. 

References

Jaimovitch-López, G., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., & Ramírez-Quintana, M. J. (2023). Can language models automate data wrangling?. Machine Learning, 112(6), 2053-2082. Doi : https://doi.org/10.1007/s10994-022-06259-9

Kopra, J., Tikka, S., Heinäniemi, M., López-Pernas, S., & Saqr, M. (2024). An R approach to data cleaning and wrangling for education research. In Learning Analytics Methods and Tutorials: A Practical Guide Using R (pp. 95-119). Cham: Springer Nature Switzerland. Doi : https://doi.org/10.1007/978-3-031-54464-4

MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., & Goharian, N. (2021, July). Simplified data wrangling with ir_datasets. In Proceedings of the 44th
International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 2429-2436). Doi : https://doi.org/10.1145/3404835.3463254

Pavía, J. M., & Romero, R. (2023). Data wrangling, computational burden, automation, robustness and accuracy in ecological inference forecasting of R× C tables. SORT, 47, 151-186. Doi : 10.57645/20.8080.02.4

Fill the form to continue reading

Still in Dilemma? See what our users have to say about our services.

student rating
Management

Essay: 10 Pages, Deadline: 2 days

They delivered my assignment early. They also respond promptly. This is excellent. Tutors answer my questions professionally and courteously. Good job. Thanks!

flag User ID: 9***95 United States

student rating
Accounting

Report: 10 Pages, Deadline: 4 days

After sleeping for only a few hours a day for the entire week, I was very weary and lacked the motivation to write anything or think about any suggestions for the writer to include in the paper. I am glad I chose your service and was pleasantly pleased by the quality. The paper is complete and ready for submission to the professor. Thanks!

flag User ID: 9***85 United States

student rating
Finance

Assignment: 8 Pages, Deadline: 3 days

I resorted to the MBA assignment Expert in the hopes that they would provide different outcomes after receiving unsatisfactory results from other assignment writing organizations, and they genuinely are fantastic! I received exactly what I was looking for from this writing service. I'm grateful.

flag User ID: 9***55

student rating
HR Rrecruiter

Assignment: 13 Pages, Deadline: 3 days

Incredible response! I could not believe I had received the completed assignment so far ahead of the deadline. Their expert team of writers effortlessly provided me with high-quality content. I only received an A because of their assistance. Thank you very much!

flag User ID: 6***15 United States

student rating
Management

Essay: 8 Pages, Deadline: 3 days

This expert work was very nice and clean.expert did the included more words which was very kind of them.Thank you for the service.

flag User ID: 9***95 United States

student rating
Thesis

Report: 15 Pages, Deadline: 5 days

Cheers on the excellent work, which involved asking questions to clarify anything they were unclear about and ensuring that any necessary adjustments were made promptly.

flag User ID: 9***95 United States

student rating
Economics

Essay: 9 Pages, Deadline: 5 days

To be really honest, I can't bear writing essays or coursework. I'm fortunate to work with a writer who has always produced flawless work. What a wonderful and accessible service. Satisfied!

flag User ID: 9***95

student rating
Taxation

Essay: 12 Pages, Deadline: 4 days

My essay submission to the university has never been so simple. As soon as I discovered this assignment helpline, however, everything improved. They offer assistance with all forms of academic assignments. The finest aspect is that there is also an option for escalation. We will get a solution on time.

flag User ID: 9***95 United States

student rating
Management

Essay: 15 Pages, Deadline: 3 days

This is my first experience with expert MBA assignment expert. They provide me with excellent service and complete my project within 48 hours before the deadline; I will attempt them again in the future.

flag User ID: 9***95 United States

GET A FREE ASSISTANCE

Still Finding MBA Assignment Help? You’ve Come To The Right Place!