5 SQL advanced concepts that every Data Engineer should know

Dec 21, 2022

Hey! I’m extremely glad that you’re here. If you like what I’m doing or want to give constructive feedback, I would love to hear that! Please don’t hesitate to reach out to me via LinkedIn.

Have a great lecture!

Upskilling or learning a new tool?

Have you ever wondered what you could learn as the next step in your data/software engineering career? Why not leave another new, shiny tool and focus on polishing skills you already have?

SQL is a programming language that virtually every software engineer, data engineer, data scientist, or machine learning engineer has come across. Its entry barrier is pretty low, the syntax is self-explanatory, and you can write complex queries without using any state-of-the-art structures. It wouldn’t be the best semantics ever, but at least it would function, right?

Apart from expressions like SELECT, FROM, WHERE, ORDER BY, and GROUP BY, there is a plethora of commands and concepts that can streamline plenty of your day-to-day work. Compared to SQL basics, it’s much more challenging to understand some, but it’s worth it. There may be a chance that your queries can be rewritten using mentioned ideas so they suit two major goals:

they are more optimized and compute much faster,
they are more readable for you and other users.

Bear in mind that business analysts or other (even less technical) stakeholders of the queries’ outcomes can look into them. Therefore, the second argument shouldn’t be neglected.

One last notice before diving into the concepts. This blog is primarily dedicated to Data Engineers and all related fields. Thus, please remember that the below ideas lean more towards them. After familiarizing yourself, you can dive deeper into more transactional database topics. It could include subjects like cursors, garbage collectors, triggers, or transactions and phantom reads levels.

Let’s jump into five advanced SQL concepts that every Data Engineer should know!

Window functions

If somebody wonders what could be the first action to take SQL skills to the next level, then window functions are the answer. Even though they may resemble data grouping, it is still a powerful functionality.

The window function calculates a specified aggregation (e.g., SUM or AVERAGE) over a set of rows related to the current row. This relation has to be defined and can be interpreted as a grouping. The only discrepancy is that after applying a window function, result aggregation doesn’t come down to only one row representing the whole group (as it would take place if grouping was applied). All rows are intact, and only new columns are added, with the corresponding result of a window function computation.

One could interpret window functions as joining grouping results with the original table rows.

Window functions’ syntax isn’t that hard to learn. But because there are many caveats and components, see the links below for a good explanation. When learning, please remember to review all parts of this feature:

OVER
PARTITION BY
ORDER BY
ROWS BETWEEN
QUALIFY

You can find a simple use case below. To see customers’ total expenses next to each particular transaction you could execute something like this:

SELECT
    customer_id
    , amount
    , SUM(amount) OVER (PARTITION BY customer_id) AS total_amount
FROM transactions

UNNESTing

Using SQL language does not always indicate dealing only with totally structured data. Various engines allow for semi-structured and unstructured data as well. Let’s focus on the former here.

Generally, rows contain only values of particular atomic types. However, there might be a case, where some parameters should be linked together, e.g. elements of the shopping basket. Thankfully, some engines enable to join items into lists.

That's where UNNEST clause comes into play. It flattens the specified array and returns a table with each array’s element as a separate row. Why would we do such an operation as we already have items aggregated in an array? It would be the only option if we want to deal with each list’s elements independently. The same applies when we want to process lists of a specific group, e.g. particular client or geographic location.

Although conceptually it is natural to understand, the syntax can sometimes be tricky, especially when other table structures, like STRUCT objects, exist. Nevertheless, it is still worth being familiar with this expression. Below you can find an example of unnesting array field in BigQuery:

SELECT 
    event_name
    , event_timestamp
    , user_id
    , event_param
FROM `events_table`,
UNNEST(event_params) AS event_param
WHERE event_name = "level_complete"
CTEs and recursive CTEs

Common Table Expressions (CTEs) are something completely crucial if you want to write readable and optimized code. It allows you to export your subqueries to a named temporary table and treat them as a data source in FROM clause.

The syntax is pretty simple:

WITH cte_name1 AS (
    …
), cte_name2 AS (
    SELECT … FROM cte_name …
)
SELECT * FROM cte_name2 …

The CTEs list can go on, simplifying your complex query even further.

Common Table Expressions are common and well-known by a lot of developers. But what is omitted in the vast majority of cases is that it is feasible to define recursive CTEs as well. It is not a feature you would use daily, but when it is necessary, it can save a lot of headaches.

The only distinction between regular and recursive CTEs is the RECURSIVE clause at the beginning of the definition and the stop condition to avoid an infinite recursive loop. The latter is solved by attaching filtering on a recursion level attribute.

Below you can find a sample of a recursive CTE in action. You should also see the links at the end of the article for an excellent source to dive deeper into this topic:

WITH RECURSIVE cte_weekdays AS (
    SELECT 
        0 AS day_num, 
        DATENAME(DW, 0) AS weekday
    UNION ALL
    SELECT
        day_num + 1, 
        DATENAME(DW, n + 1)
    FROM    
        cte_weekdays
    WHERE day_num < 6
)
SELECT weekday
FROM cte_weekdays;

Time travel

It is fair to declare that querying a table means returning some subset of its current version. It is an oversimplification, as transactions and phantom reads make this topic more miserable than it looks. Nevertheless, let’s imagine that you don’t perform any transaction locks, and table querying returns the present state of the data.

What if you want to see the table’s content from yesterday or a week ago? What if the production table was unexpectedly dropped?

Going back in time to a specified timestamp or by a determined offset supports solving this kind of problem. So far, those issues were solved by performing periodic snapshots of data. It isn’t the finest approach, though. The biggest caveat is that it is plausible to duplicate big chunks of data and complicate querying past tables dynamically.

Modern data platforms, like Snowflake or GCP BigQuery, can tackle this challenge differently. By storing each row’s history (with a defined retention period) it is possible to fetch its state at a given point in time. Expressions like AT, BEFORE or UNDROP make time travel doable. See the examples below:

SELECT * 
FROM prod_schema.customers 
AT(TIMESTAMP => 'Fri, 01 May 2015 16:20:00 -0700'::timestamp_tz);

---

UNDROP TABLE prod_schema.customers;

Indexing, partitioning, sharding

Last but not least, let’s dive into the concept not directly associated with everyday coding. Advanced data processing means being fluent in numerous tools like Snowflake or dbt but also being aware of the various optimization techniques.

Queries simplification is at the top of the list, but data storage optimization should be right behind it. Multiple methods have been developed over time. Picking the best combination is a remarkably challenging task (directly related to the available hardware). Below you can find three fundamental techniques that should be considered in the first place.

Indexing

Indexing means creating objects named indexes. Their origin comes from one of the oldest relational databases. Behind the curtains, they are data structures (in most scenarios B-trees) that allow for more efficient data searching, insertions and deletions.

Therefore, if there are columns that tend to be filtering keys, it would be a good idea to put them as index criteria. It is a double-edged sword, though. Too many indexes on a table can hinder DML operations. Use it wisely!

Partitioning

Partitioning is a must when discussing distributed data storage. Along with sharding, it is a slightly different storage optimization technique. It means spreading the data into different locations (directories) on the same machine. As you may notice, it doesn’t have to be applied only to tabular data. Files of any extension can be placed into separate folders, depending on the grouping.

For instance, it could be a geographical location (country=US). Moreover, who said it has to be only 1 level of the hierarchy? The common practice is to partition data with date, but divide it into the year, month, and day directories (year=2022/month=10/day=14). Even though it places data next to each other in separate folders, it entails powerful filtering and efficient processing.

Sharding

Sharding is another data distribution technique, but in contrast to partitioning, it means spreading data across different machines, e.g., clusters.

Unfortunately, clusters’ amount is always limited, the same as each cluster’s capacity. Therefore, when talking about distributed data systems, sharding and partitioning often are discussed simultaneously to find the best solution given the possible budget or hardware.

Outro

In this article, we discussed five advanced SQL topics that developers have to consider researching. It is only a few meters below the surface of the water, while the iceberg goes much deeper. Learning the above concepts is only the start of a journey. There is much more to discover. It may seem daunting in the beginning, but it will streamline your SQL coding a lot!

Not enough? Explore for more!

Window Function in SQLite, incl. great diagrams: https://www.sqlite.org/windowfunctions.html
Arrays in GCP BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
UNNESTing: https://count.co/sql-resources/bigquery-standard-sql/unnest
Inverted indexes - another use case for later UNNEST usage: https://www.geeksforgeeks.org/inverted-index/
Recursive CTEs: https://learnsql.com/blog/sql-recursive-cte/
Recursive CTEs, different syntax: https://www.sqlservertutorial.net/sql-server-basics/sql-server-recursive-cte/
Time travel in Snowflake: https://docs.snowflake.com/en/user-guide/data-time-travel.html
Introduction to B-Trees: https://www.geeksforgeeks.org/introduction-of-b-tree-2/

Dataworm

Discussion about this post