Data lineage: a pillar of data observability

Published: September 25, 2024

The big data environments are becoming bigger with the growth of distributed computing, IoT, microservices, and the number of ways to store data on the cloud. In this scenario, keeping track of all data movement without a data lineage process is almost impossible. 

In your data systems or the last project that you worked on, could you tell what data a column in a source database generated on the dashboard at the end of a pipeline? In this article we will see the tools and approaches that can give us a better capability to observe our data flow through the environment, making the whole organization sustainable.

Introduction to Data Lineage

According to the book “Observability Engineering, Archiving Productions Excellence”, the term “Observability” was created by the engineer Rudolf E. Kelman in 1960. Since then, it has come to mean different things depending on the community. For applications we can easily find references teaching about three pillars:

  • Logs: apps logs, infra logs, server logs, etc.
  • Metrics: for quality, infra health, alarms, etc.
  • Traces: HTTP Request, Microservices, Database, etc.


But in the Data Engineering we can open traces on two other topics:

That’s because traces might not provide the same level of insight into data flow as lineage tracking can, and can not deliver the metadata that helps us know more about the data itself without the need to open and read it, the volume would make it impossible, that’s the place this method fits better.

What is Data Lineage?

Yet on the Observability scope, in the article “Lectures on controllability and observability”, from Stanford University Operations Research Department, Kalman defines that controllability is related to the input of a system but observability is related to its output.

With this statement, he defines observability is how well we can measure or observe the internal states of the system from its outputs. This traditional definition is commonly used in mechanical and control engineering where systems are applied with an expected final state.

Data Lineage came to make this observation possible, it’s basically the capability to keep track of data while it is processed through the organization’s environment. Normally a source table of our dataset is cleaned, processed, enriched with other information or something like that. But the final path is normally the same for all companies, that is to answer business questions using different tools like dashboards, indicators, reports, etc. To represent that, we use a diagram or a Flowchart.

The problem that data lineage came to solve is the difficulty of keeping track of data relations at table or even column level. It’s becoming very common for organizations to transform Data Lakes into Data Swamps and set back years of work, or with the growth of the business the quality of data becomes poor and a company that was advancing towards becoming Data Driven has a very expensive storage that can not respond to their needs.

How to apply a Data Lineage process: a practical scenario

Let’s picture the following scenario: 

In this architecture, we have a Simple Data Lake, a database with PostgreSQL that is the source, AWS Glue jobs for ETL, AWS Step Functions for orchestration, Amazon Redshift for Data Warehouse, AWS Glue Crawler for data catalog discovery, Glue Data Catalog for data Cataloging and AWS Athena for Ad hoc Queries. 

For the Data Lineage context, what metadata do we want to collect? Starting with the foundation. The output of our system can be:

  • Where is the data coming from?
  • Which transformations were made on the data?
  • Where is the data going?

To answer that let’s start identifying the the sources and destinations.

With our Data Flow identified, we have to know how the data inside each step is being treated. To create your data lineage you can:

  1. Identify the sources, in this case, a table on the PostgreSQL database.
  2. Understand the architecture to know where the data is going.
    a) In this case to the Data Lake on AWS S3 bucket;
    b) Then to the Data Warehouse on Amazon Redshift.

Now we have two important informations:

  • Where is the data coming from? ✅
  • Which transformations were made on the data?
  • Where is the data going? ✅

 

Now, how can we illustrate or register the transformations made on the data? We have many ways but two that are more objective.

  • Register the process unit and the version that processes the data.
  • Process unit
  • Code version
  • Data of processing

 

  • Register the column level lineage to understand which data generated which.
  • Source column
  • Destination column

Process unit

An approach that includes the processing computing into the flowchart.

Pros

  • Easy to discover where to fix a problem identified in the data lineage interface. 
  • Aggregate information of process steps on the flowchart.

Cons

  • Need to go into the code to know which transformations were made.
  • Not all tools can get that information easily.

Column level lineage

An approach that uses the column level lineage on the chart to give information about what transformation was made in the data.

Pros

  • Knowing the data from the source and the result on the destination table we easily know the transformation made.
  • Direct information on data aggregation and new names for columns.
  • More tools that can read that information automatically.

Cons

  • Lack of how the transformation was made.
  • With big aggregations, the flowchart can become complex.

 

Now we have all the information we need.

  • Where is the data coming from? ✅
  • Which transformations were made on the data? ✅
  • Where is the data going? ✅

Data Lineage with OpenMetadata

Now that we know how data lineage works, its foundations, and what it tries to solve, how can we make that kind of metadata collection in real life? There are still a limited number of tools that can do that kind of operation, but one of them has interesting features, OpenMetadata.

OpenMetadata is an Open Source project that can automatically get information from many sources. With it, we can create our own Data Observability Platform and integrate it with data sources, pipelines, ML tools, visualization tools, etc. 

On the web interface we have the pre-prepared connectors to collect metadata from the sources and for some sources, automatically generate lineage like using Amazon Redshift.

If we take a closer look we will see that this tool uses the Column level approach of Data Lineage. 

Lineage is one of the many observability and governance features on OpenMetadata. There is a Sandbox on https://open-metadata.org/ that you can try.

Start your evolution into a Data-Driven business

That’s a foundational approach to Data Lineage and why it’s a pillar of Data Observability. Even though it is still an underexplored field, we can see that with little effort we can get very important information about our Data Environment.

That effort can save a lot of time and money on understanding our organization’s data tools. We can now easily keep track of how the environment is growing and more importantly, we can get faster information to keep our business on a healthy path to be more data-driven.

The importance of self-knowledge can make us make better decisions about our business, let’s start to take action to get better observations about our data.

Want to know more about how we are helping our customers be more informed about their environment, increase the health of their data, and grow faster? Our team is ready to help with your challenges and build a better future together for your business.

References

  • Book: Majors, C., Fong-Jones, L., & Miranda, G. (2022). Observability Engineering: Achieving production excellence (First Edition). O’Reilly Media, Inc.
  • Book: Gorelik Alex. (2019). The Enterprise Big Data Lake. O’Reilly Media, Inc.
  • Article: Kalman, R. E. (1970). Lectures on controllability and observability. STANFORD UNIV CA DEPT OF OPERATIONS RESEARCH.
  • Tool: https://open-metadata.org/
  • Blog Post: Jonatas Delatorre, Data Engineer leader at e-Core: https://medium.com/@jonatasdelatorre/linhagem-de-dados-um-pilar-da-observabilidade-419cc9e160e9

Data Engineer Leader at e-Core

e-Core

We combine global expertise with emerging technologies to help companies like yours create innovative digital products, modernize technology platforms, and improve efficiency in digital operations.

Skip to content