The Data Engineering Zoomcamp began on January 16, 2023. It offers free instruction on the most widely used technologies and tools in data engineering and the cloud. During week 1, participants learned how to build a data pipeline and retrieve and ingest data using Docker, Postgres, Docker-compose, Terraform and Google Cloud. Key topics covered in week 1 include: Docker, Postgres, Docker-compose, Terraform, Google Cloud, and Google VM.

Architecture diagram

Docker + Postgres

Introduction of the Docker

Docker is an open-source platform that allows developers to easily create, deploy, and run applications in containers. Containers are lightweight, portable, and self-sufficient environments that allow applications to run consistently across different environments.

Introduction of Postgres

Postgres is a versatile database that is designed for transactional purposes rather than analytics. Despite this, it is powerful and sometimes employed as a data warehouse solution.

Docker command

docker run hello-world
# -it: run iterative mode
docker run -it ubuntu bash  

docker run -it python:3.9
docker run -it --entrypoint=bash python:3.9
# build an image called taxi-ingest:v001
docker build -t taxi-ingest:v001 .

docker run -it test:pandas

Ingesting data into the database

Once you have created this Dockerfile, you can build the image by running the command docker build -t mypostgres . and then run it with docker run -p 5432:5432 mypostgres. This will start the container and map port 5432 on the host to port 5432 in the container. Dockerfile

FROM python:3.9.1

RUN apt-get install wget
RUN pip install pandas sqlalchemy psycopg2

WORKDIR /app
COPY ingest_data.py ingest_data.py

ENTRYPOINT [ "python", "ingest_data.py" ]

By running the above commands, your dataset will be loaded and ready for use in the postgres container.

Running Postgres locally with Docker

docker run -it \
 -e POSTGRES_USER="root" \
 -e POSTGRES_PASSWORD="root" \
 -e POSTGRES_DB="ny_taxi" \
 -v "$(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data"\
 -p 5432:5432 \
 postgres:13

To interact with Postgres in command line we can install pgcli.

# Install pgcli with pip
pip install pgcli
# Connect with the postgres
pgcli -h localhost -p 5432 -u root -d ny_taxi
# Check with tables
\dt

pgAdmin

It’s not convenient to use pgcli for data exploration and querying. Instead, we will use pgAdmin, the standard graphical tool for postgres. We can run it with docker. However, this docker container can’t access the postgres container. We need to link them.

docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -p 8080:80 \
  dpage/pgadmin4

pgAdmin UI

Docker Network

# To link the database with pgadmin
# Connect a container to a network
docker network create pgnetwork

docker run -it \
  -e POSTGRES_USER="root" \
  -e POSTGRES_PASSWORD="root" \
  -e POSTGRES_DB="ny_taxi" \
  -v "$(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data"\
  -p 5432:5432 \
  --name pgdatabase \
  --network pgnetwork \
  postgres:13

docker run -it \
  -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \
  -e PGADMIN_DEFAULT_PASSWORD="root" \
  -p 8080:80 \
  --name pgadmin \
  --network pgnetwork \
  dpage/pgadmin4

# Ingest data
docker run -it \
    --network=pgnetwork \
    taxi-ingest:v001 \
        --user=root \
        --password=root \
        --host=pgdatabase \
        --port=5432 \
        --db=ny_taxi \
        --table_name=green_taxi_trips \
        --url=${URL}

It works, but we need to keep two terminal tabs running, manually create a network - and a bunch of other things. Let’s use docker compose that will take care of that.

Docker-compose

Docker Compose is a powerful tool that makes it easy to define and run multi-container Docker applications, simplifying the process of development, testing, and deployment. It allows developers to define all the services and dependencies of an application in a single file, and then start, stop, and manage those services with simple commands. Docker Compose also allows developers to define networks and volumes that can be shared between services. The docker-compose.yml file is a YAML file that defines the services, networks, and volumes needed for the application.

# Start the application in "detached" mode: the containers run in the background and the terminal is free for other commands.
docker-compose up -d
# Stop and remove the containers
docker-compose down

docker-compose.yml

services:
  pgdatabase:
    image: postgres:13
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_taxi
    volumes:
      - "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"
    ports:
      - "5432:5432"
  pgadmin:
    image: dpage/pgadmin4
    environment:
      - PGADMIN_DEFAULT_EMAIL=admin@admin.com
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - "8080:80"

Terraform + Google Cloud Platform

Google Account

Step 1: Create a new google account and get free 300€.

Step 2: Create a new project.

Step 3: Create a service account

  • Assign the role as Viewer.

role

  • Create and download the key.

Install Google SDK

~ ./google-cloud-sdk/install.sh
Welcome to the Google Cloud CLI!
~ gcloud init


export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token/session, and verify authentication
gcloud auth application-default login

Terraform

Terraform is an open-source infrastructure as code software tool that enables you to safely and predictably create, change, and improve infrastructure. First of all, follow the instruction to install Terraform.

Command:

  1. terraform init:
    • Initializes & configures the backend, installs plugins/providers, & checks out an existing configuration from a version control
  2. terraform plan:
    • Matches/previews local changes against a remote state, and proposes an Execution Plan.
  3. terraform apply:
    • Asks for approval to the proposed plan, and applies changes to cloud
  4. terraform destroy
    • Removes your stack from the Cloud

Setting up on VM

Step 1: Generate ssh key Follow the documentation.

# 1. Install anaconda
wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh

bash Anaconda3-2022.10-Linux-x86_64.sh

# 2. Configure VScode to access cloud VM
.ssh touch config
.ssh code config
.ssh ssh data-engineering-demo
 
# 3. Install Docker in VM 
(base) xia@data--engineering-demo:~$ sudo apt-get update

(base) xia@data--engineering-demo:~$ sudo apt-get install docker.io

# 4. Clone course repo
(base) xia@data--engineering-demo:~$ git clone https://github.com/DataTalksClub/data-engineering-zoomcamp.git

# 5. Run docker without sudo / manage docker as non-root user https://docs.docker.com/engine/install/linux-postinstall/ 
(base) xia@data--engineering-demo:~$ sudo groupadd docker
groupadd: group 'docker' already exists
(base) xia@data--engineering-demo:~$ sudo usermod -aG docker $USER
(base) xia@data--engineering-demo:~$ newgrp docker
(base) xia@data--engineering-demo:~$ docker run hello-world

# 6. Sftp google credentials to VM
(base) ➜  sftp data-engineering-demo

Reference:

❤️ Please visit my home page for more contents. I look forward to connecting with you via LinkedIn.