DataIngestion Archives - Napa Analytics

A Data Pipeline is a set of steps that Extract, Load and Transform data for consumption by the end user. As part of the blog series on Data Pipelines, we spoke about Data Ingestion and the different open-source and commercial players. In this blog we will talk about different data ingestion methods

The need for Data Ingestion types

In an earlier blog, we spoke about the different Data Pipeline types and how the need for data defined the data pipeline. The approaches to data ingestion we are about to explore results from the end user’s need for speed to data consumption. The Data Ingestion methods are:

Batch
Real-time
Lambda Architecture

Batch Data Ingestion method

As the name suggests, the data is extracted from the source and moved to the destination at a specified time. The ingestion process could be once data or multiple times a day at a predetermined time. This method is preferred and is the most used ingestion method.

Real-time Data Ingestion Methods

Data ingestion in real-time, also known as streaming ingestion, is ongoing data ingestion from a streaming source. A streaming source can be social media feeds/listens or data from IoT devices. In this method, data retrieval and generation happen simultaneously before storage in the data lake.

Lambda architecture-based Data Ingestion Method

Lambda architecture is a data ingestion setup that consists of both real-time and batch methods. This setup consists of batch, serving, and speed layers. The first two layers index data in batches, while the speed layer instantaneously indexes the data to make it available for consumption. The presence and the activities from each layer ensure that data is available for consumption with low latency.

Summary

Data Ingestion is the first step of the ELT process and das different methods of extracting data from the sources. The data consumption needs of the data users defines the data ingestion methods as either batch, real-time, or lambda architecture. In the next blog as part of the Data Pipeline series we will talk about the Data Storage layer

In the previous blog series, we defined data pipelines, the types of data pipelines, and the data pipeline components. We identified the three main pieces of a data pipeline: Extract, Load, and Transform. In this blog, we focus on “Extract”, also referred to as Data Ingestion.

Data Ingestion

Data Ingestion is the movement of data from different data sources to a storage destination for further processing/analysis.

In the past, most of the data sources were structured making the data ingestion simple connection using JDBC/ODBC and extracting the data. With the increase in the number and variety of data sources, data ingestion has become complex. Fortunately, there are many open-source and commercial tools that take away the complexity and make it easier to extract data from a wide variety of data sources.

Data Ingestion tools

Data Ingestion tools are software products that gather and transfer structured, semi-structured, and unstructured data from the source to staging layer. The tools provide connectivity to diverse data sources, automate data movement, and monitor the movement. There are two categories of Data Ingestion tools:

Open-source tools – With the increased use of Apache HDFS(Hadoop Distributed File System) is an open-source tool to store large amount of data. There are multitude of Apache open-source projects for ingesting, loading, and transforming data into HDFS and cloud storage. These tools are free to use, have a large community of developers that add to and support the feature sets.
Commercial tools – These are software companies that have been in the data space and have evolved to provide the connectivity, security, UI/UX, and ease of use to ingest data from different data sources.

The figure below shows a subset of open-source and commercial solutions for data ingestion.

Prevalence of Open-source tools

Most of the ELT data pipelines use open-source tools for data ingestion. Open-source tools started to support Apache Hadoop and later Sqoop and Flume were added to extract data from structured data sources. More connectors were added as the number and variety of data sources increased. In addition, the need for real-time data led to open-source tools such as Apache Kafka, Apache Samza, Apache Nifi.

Commercial tools usage

These tools cater to users who do not have the depth of technical expertise that is required by the open-source tools. These tools provide drag-and-drop functionality and the risk management that is needed by large firms.

Which is better: open-source or commercial tools?

Commercial tools with their functionality, UI/UX, and support are welcome in most major organizations. However, open-source tools are catching up in functionality and UI/UX. Organizations are noticing the improvements in open-source tools, and the broad community support and their development teams are moving more towards open-source

Summary

Data Ingestion takes data from different data sources and loads into staging layer. There are open-source and commercial tools that are available for data ingestion. Even though commercial tools provide the support and ease of use, open-source tools are catching up and are becoming important players for Data Ingestion.