Analytics Engineering Basics

Analytics Engineers are between Data Engineers and Data analysts. He/she introduces the good software engineering practices to the efforts of data analysts and data scientists.

Some key skills of analytics engineers are:

  • SQL mastery, data transformation, data warehouse, data modelling
  • Data Orchestration, business intellignce tools, version control, communication
  • Programming, CI/CD, data engineering principles, security & governance

Tools for analytics engineers:

  • Data loading
  • Data storing -> cloud data warehouses like snowflake, bigquery, redshift
  • Data modelling -> dbt or dataform
  • Data presentation -> data studio, looker, tableau

Data Modelling Concepts

A data model

  • Foundation to build DWH (Data warehouse)& BI solutions
  • Primary goal of the data model is to act as a blueprint
  • Helps build conceptual, logical, physical models and define relationships

ETL vs ELT

etl vs elt

ETL ELT
Volume Small / medium data volume Large data volumne
Analysis time Fast analysis Slower than ETL
Data Type Prefer structured data Structured & unstructured data
Cost High-cost Low-cost
Maintenance High maintenance due to on-premise solutions Low-Maintance due to cloud solutions
Availability Only required for reporting/analysis Everything can be accessed from data lake

Data warehouse design methodologies

  • Inmon
  • Kimball
  • Data Vault
  • OBT

Kimball’s Dimensional Modelling

Kimball methodology is created by Ralph Kimball. It starts with identifying key business process and requirements (Bottom-up). It prioritize user understandability and query performance over non redundant data (3NF). The dimensional model is star-schema design (denormalised). The model design is built on fact and dimension tables.

Fact tables vs dimension tables

Fact tables Dimension tables
Measurements, metrics or facts Provides context to a business process
Corresponds to a business process Corresponds to a business entity
“Verbs” “Nons”
fact table is surrounded by dimension tables

Example star schema

star schema

Kimball architecture

The architecture

  • Staging area: contain raw data and not meant to be exposed to everyone
  • Processing area: do data modelling
  • Presentation area: final presentation of the data, exposure to business.

Introduction of dbt

dbt stands for data built tool. dbt is a transformation tool that allows anyone that knows SQL to deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation. dbt

dbt cloud vs dtb core

dbt Cloud dbt Core
Web-based IDE to develop, run and test a dbt project Builds and runs a dbt project (.sql and .yml files)
Jobs orchestration Includes SQL compilation logic, macros and database adapters
Logging and Alerting Load Google analytics asynchronously
Integrated documentation Includes a CLI interface to run dbt commands locally
Free for individuals (one developer seat) Opens source and free to use

Start a new dbt project in dbt cloud

When you signed in, the first step is to create a project following the steps:

  1. Give a name to your project, choose a warehouse, a development environment is automatically created.
    start

  2. For BigQuery connection, we will upload the service account key in json.
    service key

  3. Set the data location to the same location where the dataset in BigQuery is located.
    data location

  4. Test the connection test connection

  5. Link dbt project to a GitHub repository

  • Create a GitHub repository to deploy the project.
  • Go to “Profile” -> Linked accounts -> Choose Github -> Select the repository

After all these steps, your project is ready.

  1. In development, initialise the project and run dbt run to run your first model in the example model dbt provides. initialise

dbt run

Jinja and macros

In dbt, you can combine SQL with Jinja, a templating language. You have much more flexibility when writing your sql codes. You can use if or for loops, or use variables.

Macros in Jinja are similar to functions and can be reused across different models. They can return codes and are very similar to python chain. You can create a sql file under the macros folder. For example:

 {#
    This macro returns the description of the payment_type 
#}

{% macro get_payment_type_description(payment_type) -%}

    case {{ payment_type }}
        when 1 then 'Credit card'
        when 2 then 'Cash'
        when 3 then 'No charge'
        when 4 then 'Dispute'
        when 5 then 'Unknown'
        when 6 then 'Voided trip'
    end

{%- endmacro %}

Then we can call the macro in our dbt models.

Packages

Like libraries in other programming languages, we can use packages in dbt_utils instead of writing everything in macros ourselves.

Imported in the packages.yml file and imported by running dbt deps.
An example of packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.8.0

It creates a dtb_packages folder.

Variables

Variables are useful for defining values that be used across the project. You can define variable either in the dbt_project.yml file or on the command line.

Define variables in dbt_project.yml file:

vars:
  payment_type_values: [1, 2, 3, 4, 5, 6]

Calling the variable using

{% var('is_test_run', default=true) %}

In command line call the variable

dbt run --select stg_green_tripdata --var ‘is_test_run:false’

Tests and Documentation of dbt models

Tests

Tests are defined on a column in the .yml file. Dbt provides basic tests to check if the column values are unique, not null, accepted value, a foreign key to another.

test

Documentation

Generate documentation for dbt project and render it as a website Information about your project and your data warehouse. Dbt docs can be hosted in dbt cloud. documentation

Deployment

Deployment means to run dbt models in production, to set up a system to run a scheduled dbt job automatically.

Step 1: Create a production environment. Choose deployment type. production env

Step 2: Create a job. You can set up the triggers with Cron or with Schedule. Add the commands to run in the job dbt build. Job

Step 3: View the job detailed history. job history

Visualization with Google studio

Google data studio is an easy free tool to visualize your data. The first step is to import data from our BigQuery tables to Data studio. I will not cover how to build a dashboard here. import data

Reference:

❤️ Please visit my home page for more contents. I look forward to connecting with you via LinkedIn.

❤️ You might like