SQL Data pipelines

Every day, three times per second, we produce the equivalent of the amount of data that the Library of Congress has in its entire print collection, right? But most of it is like cat videos on YouTube or 13-year-olds exchanging text messages about the next Twilight movie.– Nate Silver

Data pipelines

For data driven organizations, one of the most critical operations is the efficient flow of data from source systems to target systems.Useful analysis can only begin once data is availed to it's end users in a formart that is easy to use and is accessible

Moving data from one place to another, especially data of large volumes can be a daunting task. This is because data can be corrupted, data can be coming in slow i.e. hitting latency bottlenecks or data sources/desitinations may conflict

With this problems in mind, there has been need of creating softwares that eliminate the manual steps from the data flow process. This is by automating extraction, loading ,transformation,testing and documentation of data for further analysis. The softwares make up a data pipeline that readies data for the end users

Data pipelines can handle the following types of processing.

Streaming processing - this involves processing data almost instantaneously as it streams from source to target destinationg e.g. IOT data
Batch processing - this is the processing of large volumes of data at once over specified intervals of time

This blog will be focus on a theoritical introduction on how to set up a simple batch processing data pipeline with SQL. The following blog posts as listed below will focus on a practical example of setting this up using the tools in this blog.

Part 2: Setting up dbt(data build tool), Postgresql and gitlab ci
Part 3: Data modelling, documentation and testing
Part 4: Version controll and continous integration
Part 5: Visualization with tableau

The problem

Take an example of a company that primarily depends on SQL for data analysis. Since the company started, it has been using an open source BI tool that allows analysts to write adhoc queries and visualize quickly.As the company scales, so does the team grow and soon they may run into the following problems:

Analysts have to write queries for each new problem
There is no testing of the SQL queries
It is hard to tell which queries are used where
A single change can cause some dependent queries to fail
Many undefined metrics and dimensions
The SQL queries have to be manually updated

The solution

The above is a typical growing pain of most startups. The initial team is usually consisted of just one person who builds all the SQL models and visualizations. As the team grows, it becomes hard to know which models changed or who changed what or what changed where. This can cause significant issues with correctness and reliability of data.

Building an SQL pipeline will be able to sort out some of these growing pains. An SQL pipeline will allow the team to build reusable datasets and focus on generating insights instead of copy pasting old SQL and build same queries over and over again.

A simple SQL pipeline

There are many tools in the market, howevere in this blog we will talk about how dbt(data build tool) in combination with gitlab ci can be used to build a simple SQL pipeline to solve the above problem

DBT(data build tool)

dbt or data build tool is an open source command line tool that enables analysts to do the data transformation work that happens after data is loaded to the data warehouse.

It sits at the end of the 'ELT' stack, i.e. transformation which is carried out within the database.It takes code and compiles it to SQL then runs this against the datawarehouse.The SQL code is stored in a .sql file called a model. These models are single SELECT... statements that define the resulting data set after compilation

Using principles of software engineering, dbt enable analysts:

Documentation

It has an in built feature that allows analysts to document details about their model. Dbt autogenerates this and is stored in the form of a static HTML site that can be viewed by users of the data.The documentation can be hosted in s3,github or gitlab

Version control SQL models

Since SQL statements are stored in files, version control can be applied to manage change of the SQL statements.CI/CD process can be added on top of this such that datasets in the database are continously updated incase of model change

Automated Testing

dbt allows one to assert outcomes SQL statements written inside dbt. There are schema test - testing structure of columns in a table - and custom tests - these tests allow you to test the resulting data set produced by a model.

Package Management

dbt ships with a package manager that allows analysts to write and ship reusable dbt code. This can come in handy especially for most commercial applications e.g. Stripe and can enable analysts set up dbt and have working models without heavy dbt knowledge

Gitlab CI

Gitlab markets its self as a complete DevOps platform delivered as a single application. With gitlab, one can project plan, manage source code ,host static websites, CI/CD pipelines and more.

For dbt use case we are most interested in source code management, gitlabs ability to host a static website and the CI/CD pipeline feature. This will enable automation of the dbt compilation and also hosting of dbt documentation.

The pipeline set up

We have seen the capabilities of dbt and gitlab to solve our problem.dbt will be used for creating ,documenting and managing data transformation models while gitlab will be used to version control,automate this process and also hosting dbt documentation.

The following blog posts as listed below will focus on a practical example of setting this up using the tools in this blog.

Part 2: Setting up dbt(data build tool), Postgresql and gitlab ci
Part 3: Data modelling, documentation and testing
Part 4: Version controll and continous integration
Part 5: Visualization with tableau

Toptal skill reference:Data modeling analyst

Toptal skill reference:Data engineer

← Previous Post Next Post →