Data fundamentals on AWS: Data Pipelines

Carlos Barroso
Sep 14, 2021
3 min read

Updated: Jul 11, 2024

As our world becomes more and more data-oriented, we need good conceptual frameworks for communicating our ideas and for making the best design decisions. On one hand, this blog post describes a mental model of a data processing pipeline with two main goals: to establish common language and concepts and to help us to design data pipeline architectures following the best practices. From a practical standpoint, this framework will guide you in designing and evaluating any data pipeline, from a simple batch processing process to a petabyte-scale stream processing architecture. We will develop a real example along with our explanation for didactical purposes.

Use case

As with anything we want to learn, a good, practical example is the way to go. We describe a simple requirement for a data pipeline and we will develop the concepts along with the implementation:

I want to upload a CSV file to a given S3 bucket (Capture stage).
I want an automatic calculation to be performed on the input data. (Ingest stage).
I want to store and build a new CSV file with the result of the said calculation and each ID of the original CSV (Transform stage).
I want to store the new CSV file on an S3 bucket. (Store stage).
I want to be able to consume the CSV using http. (Profit stage).

These steps are chained together, and the first one triggers all the others.

5 stages of a data pipeline

To simplify our understanding of the data management pipelines we define 5 different, consecutive stages, going from raw data to getting actual value from it. DISCLAIMER: This is a humongous simplification for educational purposes, please do not take this as a literal implementation guide.

We can think of these steps as sequential, each one producing the input of the next:

Capture stage

The pipeline life starts as soon as the data is produced. We want to have some artifact sitting right next to the data production process to capture it and send it to the ingestion process of our data pipeline. At this stage, you face challenges like not losing any data, how to transmit it in an efficient way, how to secure the data, how to authenticate the data, etc

Ingest

The ingestion process bridges the data between the producing agent and the location where the data will continue down the pipeline. It receives the data from the Capture process and optionally sort, clean, enrich, discards or relay some of all the data. It can receive data in batches (like files for example) or as a stream.

As you work more and more with data pipelines you discover that all these stages can be combined, reordered, or skipped completely in the name of efficiency, security, or simple operational convenience.

Transform

This process transforms the data for its different uses. You may transform the data for using it in analytics and also convert it to some optimized format for storage for example. Normally here is where you do the most expensive data sanitization and transformation because the data has already been cleaned (if done at Ingestion), and also you have more computing power available to run it.

Storage

When the data is ready, it is stored in a durable medium for consumption of the value-generating processes. The data can be copied multiple times and sent to different storage mediums according to its expected usages, like a data lake for long-term storage, a data warehouse for analytical processing, or even SQL and no-SQL databases.

Profiting from data

In this step, you start getting value from your data. From simple analytics to the most sophisticated deep learning techniques, the possibilities are endless.

Moving forward

There are many different variations of these paths, like repeating or looping steps, additional processes, and more. We will explain these variations with real-world examples in the next posts of this series.

The most important part to you, who is trying to profit from your data, is the last step. Our DataOps teams at Teracloud can take care of all the previous steps with the utmost efficiency, quality, and security.

Carlos Barroso

Senior MLOps Engineer

teracloud.io