What is Data Warehouse

A data warehouse is a type of data management system that is designed to enable and support business intelligence activities. It aggregates data from different sources into a single, central data store. It is Online Analytical Processing(OLAP) system. Data is structured in form of dimensional model.
The most popular architecture is three-tier data warehouse architecture, which includes bottom tier, middle tier and top tier.

three tier model

OLTP vs OLAP

OLTP(online transaction processing) OLAP(online analytical processing)
Storage type Row store Column Store
Function Operational BI, Decision making, Analysis
Data source Current data, Operational Data, Day to Day OLTP, third-party systems, IOT
Query Types Simple, Multiple joins, return few records Complex, less joins, return large records
Speed Very fast transaction type slow to medium depends on complexity of query
Database Design Normalized (Many tables), relational De-normalized (few tables), star/snowflake schema
Users Operational users Analysts/data scientist/data engineers

Introduction of BigQuery

When a company starts its journey of data warehouse or data analytics, a huge chunk of time is spent in creating a data warehouse, maintaining it, using bigquery can get rid of the pain. BigQuery is

  • Serverless data warehouse
  • Software as well as infrastructure including scalability and high-availability
  • Built-in features like
    • Machine learning
    • Geospatial analysis
    • Business intelligence
  • BigQuery maximizes flexibility by separating the compute engine that analyzes your data from your storage

bq intro

Bigquery generally caches data, you can disable it in query settings ->cache preference. sreenshot bq

Bigquery provides a lot of public data. We can search it and explore the data. public data

BigQuery Pricing

  • On demand pricing E.G. $5 per TB. The first 1TB per month is free.
  • Flat rate pricing Based on number of pre requested slots. Monthly cost $2000, 100 number of slots

Create external tables using query

-- Create external table referring to gcs path
CREATE OR REPLACE EXTERNAL TABLE `data-engineering-demo-375721.dezoomcamp.external_fvh_rides`
OPTIONS (
 format = 'CSV',
 uris = ['gs://dtc_data_lake_data-engineering-demo-375721/data-engineering-demo/data/fhv/fhv_tripdata_2019-*.csv.gz']
);

It takes only a few seconds to create the external table. External table size 0B, long-term storage size 0B, But it can’t detect the rows or the table size. Because the table is not in bigquery. It is in an external system such as google cloud storage.

external table

Partitioning and Clustering

Partition is a huge advantage so that Big Query doesn’t need to run over the whole table. For example, when most of the queries are based on date und use it as a filter. You can partition a table based on Integer column, or time-unit column, or ingestion time.

partition table

Clustering tables means sort by clustered columns. In this way it can also improve query performance and reduce query costs.

clustering table

-- Create a partitioned table from external table
CREATE OR REPLACE TABLE `data-engineering-demo-375721.dezoomcamp.fvh_rides_partitioned`
PARTITION BY
 DATE(pickup_datetime) AS
SELECT * FROM `data-engineering-demo-375721.dezoomcamp.external_fvh_rides`;


-- Creating a partition and cluster table
CREATE OR REPLACE TABLE `data-engineering-demo-375721.dezoomcamp.fvh_rides_partitoned_clustered`
PARTITION BY DATE(pickup_datetime)
CLUSTER BY affiliated_base_number AS
SELECT * FROM `data-engineering-demo-375721.dezoomcamp.external_fvh_rides`;

BigQuery best practice

Cost reduction
  • Avoid SELECT *
  • Price your queries before running them
  • Use clustered or partitioned tables
  • Use streaming inserts with caution
  • Materialize query results in stages
Query performance
  • Filter on partitioned columns
  • Denormalizing data
  • Use nested or repeated columns
  • Use external data sources appropriately
  • Don’t use it, in case u want a high query performance
  • Reduce data before using a JOIN
  • Do not treat WITH clauses as prepared statements
  • Avoid oversharding tables
  • Avoid JavaScript user-defined functions
  • Use approximate aggregation functions (HyperLogLog++)
  • Order Last, for query operations to maximize performance
  • Optimize your join patterns
    • As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.

Machine Learning in BigQuery

Data Analysts and managers can use BigQuery to build ML models and it requires no knowledge of python or Java. It’s not necessary to export data into a different system.

ML in BigQuery is free for the 10GB per month of data storage and the 1TB per month of queries processed.

Steps to build a ML steps:

  1. Select relevant columns (features)
  2. Create a regression or classification model you selected
  3. Evaluate the model
  4. Predict the results with the model
  5. If necessary, fine tuning the hyperparameters.
-- CREATE A ML TABLE WITH APPROPRIATE TYPE
CREATE OR REPLACE TABLE `taxi-rides-ny.nytaxi.yellow_tripdata_ml` (
`passenger_count` INTEGER,
`trip_distance` FLOAT64,
`PULocationID` STRING,
`DOLocationID` STRING,
`payment_type` STRING,
`fare_amount` FLOAT64,
`tolls_amount` FLOAT64,
`tip_amount` FLOAT64
) AS (
SELECT passenger_count, trip_distance, cast(PULocationID AS STRING), CAST(DOLocationID AS STRING),
CAST(payment_type AS STRING), fare_amount, tolls_amount, tip_amount
FROM `taxi-rides-ny.nytaxi.yellow_tripdata_partitoned` WHERE fare_amount != 0
);


-- CREATE MODEL WITH DEFAULT SETTING
CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_model`
OPTIONS
(model_type='linear_reg',
input_label_cols=['tip_amount'],
DATA_SPLIT_METHOD='AUTO_SPLIT') AS
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL;


-- CHECK FEATURES
SELECT * FROM ML.FEATURE_INFO(MODEL `taxi-rides-ny.nytaxi.tip_model`);


-- EVALUATE THE MODEL
SELECT
*
FROM
ML.EVALUATE(MODEL `taxi-rides-ny.nytaxi.tip_model`,
(
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL
));


-- PREDICT THE MODEL
SELECT
*
FROM
ML.PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,
(
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL
));


-- PREDICT AND EXPLAIN
SELECT
*
FROM
ML.EXPLAIN_PREDICT(MODEL `taxi-rides-ny.nytaxi.tip_model`,
(
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL
), STRUCT(3 as top_k_features));


-- HYPER PARAM TUNNING
CREATE OR REPLACE MODEL `taxi-rides-ny.nytaxi.tip_hyperparam_model`
OPTIONS
(model_type='linear_reg',
input_label_cols=['tip_amount'],
DATA_SPLIT_METHOD='AUTO_SPLIT',
num_trials=5,
max_parallel_trials=2,
l1_reg=hparam_range(0, 20),
l2_reg=hparam_candidates([0, 0.1, 1, 10])) AS
SELECT
*
FROM
`taxi-rides-ny.nytaxi.yellow_tripdata_ml`
WHERE
tip_amount IS NOT NULL;

Reference:

❤️ Please visit my home page for more contents. I look forward to connecting with you via LinkedIn.

❤️ You might like Week 2 data engineering zoomcamp notes: Prefect.