Welcome to the repository for the Data Owls - Data Engineering Cohort Workshop! This repository contains all the materials, exercises, and resources you'll need to successfully complete the cohort and gain hands-on experience in data engineering. I will be integrating different sources along with my add-on explanations just so you could follow along. Links will be provided at the end of each. There will also be video tutorials from chosen YouTube videos that we deem easy to follow on.
This repository includes:
- Workshop Materials: Step-by-step guides, slides, and explanations covering key concepts in data engineering such as data pipelines, databases, ETL processes, and cloud tools.
- Project Templates: Templates and boilerplate code for building scalable data engineering projects. You can fork this repository to customize and implement your solutions.
- Collaborative Learning: Fork the repository, make changes, and submit pull requests to collaborate with your peers. Share insights, improvements, and solutions as we learn together.
Feel free to explore, contribute, and make the most out of this resource!
The purpose of this cohort is to guide participants in getting into data engineering. Now, I know that there are a lot of sources out there. But, this resource will include multiple sources of knowledge that I believe is easier to understand even for a beginner.
So this cohort will be divided into four weeks. Each week will focus on specific areas that are deemed important in the field of data engineering. But first, you might ask, "What the hell is Data Engineering?"
According to Coursera, data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale. So if you are a part of an organization, that handles big data (Facebook for example) you would need to ensure of its highly usable state by the time it reaches to others like the data scientists and analysts.
So what do we mean with highly usable? You see, when we collect data, you cannot just simply process it immediately. You need convert that raw data into a usable format (e.g., from CSV to Postgres).
In a cohort held last 2024 by DataTalks, they had put emphasis on Containerization and Infrastructure as Code. Now, in the first week you will need to dive deep into utilizing Docker, Postgres with Docker, Terraform, and Azure (or GCP).
In the first session we will tackle about about Containerization and IaC. The workshop will also include the steps in the installation of Docker as well as running Postgres locally with Docker.