This Code assessment represents a coding challenge for Data Engineering roles.
- Evaluate your coding abilities and your software engineering skills
- Judge your technical experience
- Understand how you design the solution
- Have a technical conversation after the code submission & review is done
In this challenge, we will use a dataset having the current job posting hosted by the City of New york which is available on the City of New York's official jobs site (https://www1.nyc.gov/jobs/index.page).
- Internal postings availble to city employees and external postings availbale to the public are included.
- Data is accesible in the CSV file nyc-jobs.csv
This challenge is composed of following steps:
- Provide a detailed analysis of source data: Column values (eg: Numerical vs character), categorical columns, etc.
- List of KPIs to be resolved:
- What's the number of jobs posting per category (Top 10)?
- What's the salary distribution per job category?
- Is there any correlation between the higher degree and the salary?
- What's the job posting having the highest salary per agency?
- What's the job postings average salary per agency for the last 2 years?
- What are the highest paid skills in the US market?
- Create functions to process your dataset (Cleaning, column pre-processing, data wrangling, transformation etc)
- Apply atleast 3 feature engineering techniques
- Features removal based on the exploration/ profiling.
- Store your processed data into a target file
- Above mentioned challenge
- Test cases : you need to create/run your own test cases for every functions you build.
- Code Comments
- If any deployment to be done, please document the proposals of the deployment steps in MyDocument.md
- If you had to trigger your code, please suggest your approach in MyDocument.md
- PySpark should be used
- Feel free to use any libraries (you can use pip install, if needed)
- Use a visualization library to present your analysis results
- Any learnings/ challenges/ considerations/ assumptions, please document in MyDocument.md
- Except for visualizations keep it mostly as pyspark dataframe(Avoid converting pyspark dataframe to pandas dataframe)
- Use your own Github account to fork this repository.
- Follow the installation instructions in INSTALL.md
- Open the Jupyter notebook home page, using the link similar to the one highlighted in the picture
- Open the notebook (assesment_notebook.ipynb).
- Run all the cells
- After creating spark session, you should be able to see an application pyspark-assesment running in the Spark Master UI as below.
By now you would already have every setup needed in your own machine :
- Spark cluster with a master and 2 worker nodes.
- Notebook to start working, with the basic libraries imported (Please feel free to import any new library, if required).
- Dataset read that is required for your activity.
- Sample function within the notebook arriving at a basic KPI. distinct_values.py
- Sample test case for testing the function. test_distinct_values.py
- Runnable test cases using 'make test'
Once the activity is completed:
- Commit/ push your code into your own Git.
- DONT try to raise a Pull Request to this repo please.
- Make your GitHub repo as private. (GitHub -> Settings -> General -> Danger Zone -> Change repository visibility -> Change Visibility -> Make Private)
- Share the access projectforyou20@gmail.com (GitHub -> Settings -> Collaborators -> Manage Access -> Add People -> Add the email address projectforyou20@gmail.com)
- Share the GitHub repository URL with hiring team.
We will review your work and get back to you!