Data ingestion
Key Takeaways:
- Data ingestion is the process of taking data in from a single source, and putting it into a data warehouse.
- Because multiple tools and resources rely on data, data ingestion is a cornerstone process for your customer data infrastructure.
- Automation, data governance policies, and a scaling plan are best practices to keep your data ingestion process running smoothly.
Table of Contents
What is data ingestion?
Data ingestion is the process of taking data from a source and putting it into a destination.
When using data for business or marketing purposes, organizations start by gathering data from a collection tool and putting it into a data warehouse. The process of moving that data from collection into storage is called data ingestion.
Data ingestion can be a manual process, where someone takes the data and enters it into the data warehouse by hand, but it’s most often done with data infrastructure tools or software designed to automate the process.
Types of data ingestion
Data can be ingested into a data warehouse through a variety of tools or methods, but it will typically fall into one of the following data ingestion methods:
- Batching - Data ingestion tools collect the data into a batch of multiple entries, which are processed together at a specified time, usually once per day. Batch ingestion requires fewer resources, as the batches can be run off-hours when nobody else uses the data warehouse. However, it also means that data gets updated much slower.
- Real-time processing - Data ingestion tools process the data as soon as it is collected. This is a more resource-intensive form of data ingestion, but it provides instant access to the data, so users are always working from the latest information.
- Micro-batching - This form of data ingestion is between real-time ingestion and batching. Data gets collected into small batches — such as 10 or 20 entries — which are processed as soon as enough data is collected.
- Lambda architecture processing—Lambda architecture is a data architecture style that includes a data ingestion pipeline for real-time data ingestion and one for batched data. In addition to the data layers for those ingestion pipelines, an additional layer balances the two ingestion styles. This third layer regulates the real-time and batch processing, but it requires building your data architecture around the Lambda architecture.
The data ingestion process
Because all organizations will use different data infrastructure processes and tools, the actual data ingestion process may look different for different companies. However, they should generally include the following stages:
Data collection
First, you need data to ingest. This means you need methods and tools, such as web forms, surveys, customer relationship management (CRM) tools, your existing data lakes or databases, and any other data source. Before setting up a data ingestion process, you should take some time to discover and identify the data sources you need to ingest.
Data acquisition
This stage connects the various data sources to the main warehouse. Many tools offer integration solutions or export capabilities, so it can be helpful to speak with the vendor or consult their documentation for more details on connecting their tools to the data warehouse.
This process also includes data onboarding, which is the process for getting offline data sources connected to your data ingestion process.
Data validation
Before adding any data to your warehouse, verify that it is accurate and consistent. For example, the pipeline should verify whether data is unique, whether an entry is already in the database, and whether all data-formatted entries came through in the right format. This helps keep the data at a high quality and can help prepare it for the next step.
The validation step can also help prevent some web security issues by preventing web attacks like SQL Injection or cross-site scripting (XSS), in which an attacker injects malicious payloads that your database interprets not as data, but as code.
Data hygiene
While the data has been validated, it may still be in a different format than various tools may need. For example, an organization may gather data from a source that collects dates in month, day, year format, instead of MM/DD/YYYY. Data transformation is a series of steps that adjust the data so that all entries are consistent. This process may require conversations with the people who will use the data to ensure all of the data collected, validated, and transformed is in a usable format.
Data loading
The final step of the data ingestion process is taking the transformed and validated data and putting it into the warehouse. After it is loaded, the data is ready to be used by other tools.
Data ingestion vs. data integration
Data integration is the process of taking data from multiple sources and bringing them into the same data warehouse or similar storage resource.
By comparison, data ingestion is the process of taking data from any source system and bringing it into a data warehouse. You can take in data from multiple sources and ingest them, but the ingestion process happens once for each of these sources.
Depending on the construction of the data pipeline, an organization may first ingest data and then integrate it into the data warehouse from multiple ingestion points. Data can also be integrated in batches or in real-time.
Data ingestion vs. ETL vs. ELT
Extract, transform, and load (ETL) and extract, load, transform (ELT) pipelines are data pipelines that extract (take in data), and then either transform it (convert it into usable formats) and load it (store it in the data warehouse), or load and then transform it. These processes are most frequently used to prepare data for business intelligence (BI) and analytics purposes.
ETL is the older pipeline pattern, and it was necessary for older technologies using on-premise hardware. To reduce latency, these databases needed to normalize the data into a state for reporting. ELT pipelines came later, and take advantage of cloud data warehouses and higher computing power, making it cheaper and less likely to break in transit, and preserving access to the raw data. Another advantage of ELT pipelines is in verifying the data, because the raw data in your warehouse should match the source system exactly, since no transformations occurred until after the data was loaded.
In these data pipelines, the “extract” step is the data acquisition, and they include a transformation and loading step, depending on whether an organization uses ELT or ETL. However, many data ingestion processes include additional steps to prepare the data for various purposes.
Benefits of data ingestion
The primary benefit of data ingestion is that it allows data to be moved into a data warehouse for centralized intelligence across the business, to be used for analytics, or activation. If the data is not ingested into the warehouse, it remains in the collection tool and is unavailable for combining with other sources, such as into behavioral data for personalization, segmentation, and customer experience.
There are benefits to configuring a data ingestion process beyond data availability, including:
- Data automation—Some forms of data, such as offline data, may require a manual process to collect and use. Adding that data to your data ingestion processes can reduce manual collection and input, freeing people up to do other tasks.
- Data insights - Ingesting data into a data warehouse gives your BI tools access to it, giving an organization valuable insights into data that can help with revenue growth. For example, understanding the relationship between call center wait times and net promoter scores (NPS).
- Data uniformity - Not all data collectors will gather the same data or gather it in the same format. The transformation processes ensure that all the data looks similar, so teams are always on the same page. Knowing the data format (such as names being in the “First name Last name” format) can help an organization better plan and prepare new tools they bring on board to activate that data.
- Data flexibility - Different data tools will pull various pieces of data out of your overall data set. Ideally, all the tools an organization has should be able to work from the same data set, which means some tools may collect more data than they need. For example, data sets may include customer names, but some data activation tools may work with anonymized data — they will apply pseudonyms or strip the name data from any data, even if it’s the same data your CRM pulls from.
Data ingestion use cases
Because the purpose of data ingestion is to get your data from its collection points into a warehouse, the use cases will be as varied as the data sources. For example:
- Unifying data in a warehouse - All of your data collectors can obtain and unify data in a data warehouse, creating a single source of truth for customer data for use in downstream destinations like ad platforms or email marketing tools. . Ingesting data into a warehouse may also involve identity resolution to consolidate multiple pieces of data about the same users.
- Transaction data - Point of sale and checkout systems collect data about purchasing habits, which your ingestion tools can gather to provide insights for teams to use in balance inventory.
- Social media - Social media platforms collect data about post engagement, and ingest that data to aggregate content and analyze sentiment for use in marketing campaigns.
- Machine learning (ML) and AI tools - ML and AI tools require data for the training sets they use to generate content. Ingesting usable data and sending it to those tools can help refine training models.
Data ingestion challenges
Data ingestion is a cornerstone piece in an organization’s data infrastructure and pipeline. While it is a straightforward concept, there are some challenges that can create huge downstream impacts if they are not addressed or at least planned for. Here are some of the most common issues:
Data capture and tool compatibility
Most organizations likely already have some tools and collectors available, they shouldn’t take for granted that they can get all the data their teams need. Some of the data collection tools in use may end up being incompatible with the organization’s data ingestion or data warehouse tools.
To address these issues, work with vendors to configure tools properly, or consider finding new tools that are compatible.
Data governance and schema drift
When you begin collecting data, define a model for how that data will be stored and formatted. The data model defines how you will organize the data and metadata, such as all the data your tools will collect for each customer, and how it will all be cataloged so your other tools can work with the data. This is part of the data governance process, which also includes making sure your data is stored securely and in accordance with applicable data laws.
Schema drift occurs when a source schema changes and the changes are not reflected across all tools or processes that use the data model. This can create situations where tools are ingesting data that does not get stored properly, or collectors stop working because they don’t know how to store the data anymore.
The best way to address these issues is to communicate and define a data contract, which contains clear processes for data model changes and alerting any relevant teams about updates to the data governance policies.
Data hygiene and quality
Ingested data can be incomplete or disorganized. For example, an organization may find its data warehouse contains incomplete entries for customers, outdated or duplicated entries, or even old records from a legacy system.
The best way to address these issues are with quality control checks and automation. Periodically check that the data getting ingested still matches the data model, that you’re still getting the data you need, and that all the tools are working properly. Wherever possible, automate these checks and theingestion processes. This can minimize human error while also freeing up resources. However, it’s important to also create and publish policies for reporting errors in the database to address them before they create issues or cause schema drift.
Scaling and cost
As a company grows and begins to take in more data, it will need more storage for that data. That also means building new pipelines to bring that data in. This will increase the costs and resources needed to run your data pipelines and infrastructure.
While these increases may be inevitable, organizations can mitigate some of their effects by planning to scale up in the future, and by determining how to effectively store historical data. Some techniques for this include: Aggregating historical data and moving raw data out of the data warehouse and into cloud “cold” storage buckets, and implementing incremental processing during transformation to ensure you are processing only new data.
Find data storage solutions that scale with the business requirements and data ingestion tools that can work with as wide a variety of formats as possible. Remember these plans may change, so organizations should be prepared to manage those changes.
What teams and roles are involved with data ingestion?
Any conversation about data ingestion should involve the teams that create and capture data. This includes data and analytics teams, website or web application teams, information technology (IT), product and infrastructure, customer service, finance, and marketing. The infrastructure, data, and IT teams will likely know what data ingestion tools are already in use at an organization.
When building a data ingestion process, ask the teams involved to provide details about the data that their customer interactions generate and how other systems handle, store, and access that data.
Data ingestion tools and platforms
There are a variety of open-source and commercial tools designed to streamline and manage the data ingestion process. These tools include
- Fivetran
- Meltano
- Talend Data Fabric
- Apache Nifi
- Amazon Kinesis
- IBM Cloud Pak for Data
- SnapLogic
- Oracle Data Integrator
- Airbyte
- Matillion
- Keboola
More from the University
Looking for guidance on your Data Warehouse?
Supercharge your favorite marketing and sales tools with intelligent customer audiences built in BigQuery, Snowflake, or Redshift.