Instead of an abstract description, here’s a scenario: the CEO wants to know how much money your business could save by purchasing materials in bulk and distributing them to your various locations.
You need to be able to determine how to charge back any unused materials to different business units.
This likely requires you to take data from your ERP system, your supply chain system, potentially third-party vendors, and data around your internal business structure. In years past, some companies may have tried to create this report within Excel, having multiple business analysts and engineers contribute to data extraction and manipulation.
Data engineers allow an organization to efficiently and effectively collect data from various sources, generally storing that data into a data lake or into several Kafka topics. Once the data has been collected from each system, a data engineer can determine how to optimally join the data sets.
With that in place, data engineers can build data pipelines to allow data to flow out of the source systems. The result of this data pipeline is then stored in a separate location — generally in a highly available format for various business intelligence tools to query.
Data engineers are also responsible for ensuring that these data pipelines have correct inputs and outputs. This frequently involves data reconciliation or additional data pipelines to validate against the source systems. Data engineers also have to ensure that data pipelines flow continuously and keep information up to date, utilizing various monitoring tools and site reliability engineering (SRE) practices.
In a phrase, data engineers add value as they automate and optimize complex systems, transforming data into an accessible and usable business asset.
ELT and ETL
Data pipelines come in different flavors, and it’s the role of the data engineer to know which strategy to use and why.
The two most common strategies center around the concepts of extraction, loading, and transforming (ELT) of data. Data always has to be extracted in some manner first from a source of data, but what should happen next is not as simple.
The ELT use case is commonly seen within data lake architectures or systems that need raw extracted data from multiple sources. This allows for various processes and systems to process data from the same extraction. If you are joining data from a variety of systems and sources, it’s beneficial to co-locate that data and store it in one place before performing transformations to the data.
PRO TIP: Generally speaking, an ELT-type workflow really is an ELT-L process, where the transformed data is then loaded into another location for consumption such as Snowflake, AWS Redshift, or Hadoop.
In contrast, an ETL (extract, transform, load) process puts the heavy compute involved with transformation before loading the result into a file system, database, or data warehouse. This style often isn’t as performant compared to an ELT process, as data for each batch or stream is often required from dependent or related systems. This means that on each execution, you would have to requery data from the necessary systems, adding extra load to those systems and additional time waiting for the data to be available.
However, in cases where simple transformations are being applied to a single source of data, ETL can be more appropriate as it reduces the complexity of your system, potentially at the cost of data enablement
The general recommendation is to use ELT processes, when possible, to increase data performance, availability, and enablement.
Performance
It’s not as simple as having data correct and available for a data engineer. Data must also be performant. When processing gigabytes, terabytes, or even petabytes of data, processes and checks must be put in place to ensure that data is meeting service level agreements (SLAs) and adds value to the business as quickly as possible.
Example
Imagine if your company was an airline, and you wanted to provide pricing to customers based on inputs from a variety of different systems to offer a price to customers. If your price is too high, customers will book with other airlines. If your price is too low, your profit margins take a hit.
Suddenly, there’s a blockage in the Suez Canal, and freighters hauling oil cannot make it out of Saudi Arabia, disrupting the global supply chain and driving the price of oil and gas up. Commercial airplanes use a lot of fuel, to the tune of almost 20 billion gallons a year. This is going to dramatically affect the cost to operate your business and should be reflected as fast as possible in your pricing.
In order for this to happen, data engineers have to design and implement data pipelines that are efficient and performant.
Continuous Integration and Continuous Delivery
Code is never a “set it and forget it” type solution. Data governance requirements, tooling, best practices, security procedures, and business requirements are always quickly changing and adapting; your production environment should be as well.
This means that deployments need to be automated and verifiable. Older styles of software deployment frequently resulted in running a build, copying and pasting the result onto your production server, and performing a manual “smoke test” to see if the application was working as expected.
This does not scale and introduces risk to your business.
If you’re live testing on a production environment, any bugs or issues that you may have missed in testing (or any environment-specific influences on your code), will result in a poor customer experience since these bugs or errors will be presented to the end user. The best practice for promotion of code is to put automated processes in place that verify code works as expected in different scenarios. This is frequently done with unit and integration tests.
Unit tests verify that individual pieces of code, given a set of inputs, produce expected outputs independently of the other code that uses that piece of code. These add value to verify complex logic within the individual piece of code, as well as providing proof that the code executes expectedly.
Another level up from that is integration testing. This ensures that pieces of code work together and produce expected output(s) for a given set of inputs. This is often the more critical layer of testing, as it ensures that systems integrate as expected with each other.
By combining unit tests and integration tests with modern deployment strategies such as blue-green deployments, the probability of impact to your customers and your business by new code is significantly reduced. Everything is validated based on the established tests before changes are promoted to an environment.
Disaster Recovery
Many businesses focus on providing as much value to their customers as quickly as possible, but it’s also critical to ensure that you have a plan in the event of a system failure. While many companies rely heavily on cloud providers to minimize downtime and guarantee SLAs, failure will inevitably happen. This means that systems must be designed to tolerate a critical system failure.
Disaster recovery in data engineering generally falls into two metrics:
- Recovery Time Objective (RTO)
- Recovery Point Objective (RPO)
In the event of a disaster recovery scenario, businesses need to have standards in place to understand the impact to their customers and how long their systems will be unavailable. Data engineers are responsible for putting processes in place to ensure that data pipelines, databases, and data warehouses meet these metrics.
Example
Imagine if your company was an airline and you needed to provide customers with the ability to book flights, but suddenly, your data center explodes. Your business has established a data sync process to replicate data to another data center, but that process was interrupted and data loss has occurred. You need to re-establish your primary database in your application suite from the replicated database. The RPO represents how much data is lost between the cut-over, and the RTO represents how long customers are unable to book flights.
Data engineers frequently have to evaluate, design, and implement systems to minimize impact to customers in the event of failure.