In our first article in the series we talked about Data Pipeline. A Data Pipeline is the process of taking data from source and getting it ready for consumption by the end users. This blog intends to explain the different data pipeline components

Data Pipeline consists of eight components as shown in the diagram below:

Data Pipeline Components

Origin

Origin is where the data is sourced from. Data can be sourced from transaction systems, social media, IoT sensors, application API’s, legacy servers, data warehouse, analytic data stores. The number of data sources and how frequently the data needs to be available for data consumers is increasing every day. Mapping data consumers expectation from the data sources is extremely important to decide the kind of data pipeline treatment given to the data from the source (Origin).

Destination

This where the data is consumed. Thus, this should be the starting point while designing a data pipeline. Data consumer requirements define the data source type, frequency, and transformation. However, the data consumer requirements also bring to light the constraints at the source (Origin). For example, in a large medical insurance company, the medical claims representative needs to see the claims as they are being generated. However, the claims processing systems takes some time (< 1 min) to move the data to the main frame system. It takes some more time for the data to move through the channel to the medical claims representative. The latency thus introduced changes the pipeline delivery SLA from real-time to near real-time. Thus, the final design needs balance data consumer requirements with the capacity and capability of Origin (Data Sources).

Dataflow

Dataflow is a set of steps that need to be orchestrated to deliver the data from Origin to Destination. Dataflow steps can sequential and/or parallel.

Parallel Dataflow
Sequential Dataflow

Dataflow steps could take the input from one step to another from Origin to Destination in a sequential manner, which is the most common Dataflow during the initial phases of Data Engineering. As the data consumption requirements mature, there is a need for an intermediate data store while maintaining the original sequential flow, this leads to the parallel data flows.

Data Storage

Data Storage is an intermediary step that enhances the data movement. There are lot of choices for an intermediate data store, and the choice needs to be based on data volume and query complexity. With the advent of new data storage options it is important to choose data storage that can support structure, format, retention duration, and is fault tolerant and has disaster recovery

Processing

This part of the data pipeline is about ingestion, persistence, transformation, and delivery. Each of the processing steps has different methods and things to consider. We will be talking about the methods and consideration in future blogs

Workflow

Workflow defines and manages the data pipeline steps. Workflow ensures that the dependencies for each of the data pipeline steps are satisfied before stating a step. Workflow needs to manage the individual steps and the overall job to achieve the results for the destination

Monitoring

Monitoring is observing the data as it moves through the data pipeline to ensure that the quality of data at the Destination.

The above components are a way to break the Data Pipeline process. Data pipeline is a key component of Data Engineering and we will talk about them in detail in the future blogs. Next Blog we will talk about the types of Data pipelines and the use cases that they support.

At Napa Analytics we take pride in helping clients with their Data Engineering needs and would love to hear from you on your Data Engineering needs/challenges. Please drop us an email at rnemani@napanalaytics.com

Data Pipeline is a sequence of steps that deliver consumable data to the end users. Why do we need a sequence of steps? In the present world, data comes from diverse sources in different formats. It is the job of a data engineer to make consumable data available to various consumers. Automated orchestration of these steps is the gist of a data pipeline.

This blog intends to talk about data pipelines. Parts of the data processes are simple, and other parts are complex. Let us start by looking at a typical data process as shown in the figure below:

Proces to deliver consumable data

The data from sources is extracted as-is and stored in the staging area. As seen from the above image, the number of data sources are finite. In the example, they are Mainframe, Cloud API, A database, and a text file. The data from the sources is stored in the staging area as one table for each source data table. Changes to the data source data structure and the addition of new data sources is not that frequent. Thus, data the data pipelines for extract and load are standard. We at Napa Analytics have built a python data framework that reduces the time and effort of creating these pipelines.

Data Pipelines

A typical data process can be considered as a combination of more than one data pipelines. The above diagram shows a typical set of data pipelines. For simplicity, we have considered only one data source (a normal situation would have more than one data sources). Dividing the data process into multiple pipelines ensures proper maintenance and ability to perform timely quality checks. Data quality is one of the main concerns with the increasing number of data sources.

Let us discuss the different pipelines:

  1. Data Pipeline # 1 – This is a straight move from Mainframe data source into staging. Staging will have a one-to-one mapping to the tables in the source database. We could go one step further of de-normalizing the data i.e., flattening the data from the multiple tables into one flat table, which can then be used to extract relevant information.
  2. Data Pipeline # 2 – 5 – These are the pipelines specific to the user requirement. For example, one of the consumers can be a MicroStrategy user who would require the table structure in a format that would feed the pre-defined reports. Another consumer can be a data analyst who is interested in dimensions and facts for aggregate analysis and reporting. Each of the consumers with their specific needs would require a specific data pipeline.

Data Pipeline(s) that deliver data in consumable format are thus a set of commands that are orchestrated to perform tasks in either sequential or parallel fashion. The initial premise of extract and load can be considered as a generic pipeline, with pipelines for transformation gaining complexity based on the consumer requirements.

The author is a data engineering expert and co-founder of Napa Analytics. Napa Analytics are working on a data framework that enables our clients with tools that reduce the effort (time and knowledge) needed in the creation and maintenance of data pipelines

Data exists in one of three states: Data at Rest, Data in Motion and Data in Use. When we understand these three states, it lays the foundation for how to extract value from that data to support business operations. This blog intends to introduce the concepts and lay a basis for the upcoming data engineering blogs in the series.

Transactional systems such as points of sale (POS), enterprise resource planning (ERP) generate and store data in a database or a mainframe. Website clicks and social media data are other sources of data. Using those two examples of types of data, we can define the three states of data.

States of Data

Data at Rest

Data at rest is data generated from a transactional system. Data analytics and business intelligence teams use data at rest to extract value. Data at rest resides in hard drives in the company’s network or cloud storage with security policies. Thus, it needs to be secure. Best practice favors encrypting data at rest and disabling access from external sources such as USB sticks or hard drives. For a long time, data at rest has been the primary source of business intelligence. Even today, dost data engineering tasks still use data at rest. The reason for the prevalence of data at rest is the existence of old systems that provide value. Another reason is that transactional systems contribute to 60% of the data sources for analytics, business intelligence, and algorithms.

Data in Motion and Data in Use

When different transactions happen in real-time, we generate Data in Motion. The shelf life of the value of data in motion is limited. To ensure the end-users get that value as soon as possible, data analytics teams need to provide data for consumption quickly. For example, social media data has mentions that have a value of maybe a day. It follows that if end users are to get any value from the data, data engineers need to extract it in less than 24 hours.

Once data is processed and available for consumption, it gives rise to Data in Use. As the name suggests, the data in use is not static or passive. And instead, it is actively moving through an IT system. Some examples of data in use include data processed in the CPU, a database or RAM.

 Now that we have identified the different data states, we will talk about extracting the data from the sources in the next set of blogs. We will also discuss how we load it into a storage layer and transform the data for consumption.