Data engineers are called the backbone of all data science and engineering. But this makes them sound like second-rate citizens when in fact they are usually holding up the entire infrastructure of a business. Analyzing data, building predictive models, and building software is how all modern businesses are built and scaled, but this isn’t possible without having good data to begin with. That’s where data engineers come in.
Data engineers are responsible for ingesting data in a way that is reliable, timely, and accurate. In more technical terms, the responsibilities of data engineers encompass a set of operations aimed at developing interfaces and processes for the flow and access of information.
There’s a cultural shorthand that’s applicable here. You’ve probably heard it: garbage in, garbage out. A data analyst or data scientist’s ability to derive value is limited by the quality of the data itself. The data engineer ensures the quality of that data. Since no company wants to generate insights based on garbage data, data engineering is one of the fastest growing jobs, with a 50% year-over-year growth in 2019.
This article is going to cover everything that you need to know about the role of a data engineer, including its typical roles and responsibilities, how it differs from a data scientist, the required skills of a data engineer, and some interview questions.
Roles and Responsibilities
Regardless of the domain they’re working in (finance, social media, etc.), data engineers have common responsibilities that combine the knowledge and skills of information technology, engineering, and databases.
- Data Pipeline Testing/ Maintenance: Data engineers test the performance and efficiency of each part of a system or model during the initial phase. Moreover, they collaborate with the testing team to make sure everything is running smoothly.
- Designing Data Platform Architecture: At its core, data engineers have to design the architecture of a data platform.
- Provide Data-Access Tools: In the majority of cases, data-access tools are not needed, as data scientists can use data lakes to fetch or extract data from storage. But if the enterprise requires business intelligence (BI) for analysts and other users, data engineers set up the tools to create visuals, observe data, and generate reports.
- Development of Data Instances: Data engineers use their coding skills to build, customize, and overlook integration tools, warehouses, databases, and analytical systems.
- Handle Data and Metadata: Data is stored in a warehouse in a structured or unstructured manner. However, storage systems may also contain meta-data and a data engineer is responsible for managing all the data stored in it. Moreover, they have to structure it precisely through database management systems (DBMS).
- Monitoring the Stability of Pipelines: Managing the entire performance, efficiency, and stability of the system is essential while the warehouse is being cleaned after equal time intervals. The automated parts of the system also need to be tracked and upgraded according to the data and requirements.
The responsibilities of a data engineer differ according to their team size, platform size, the complexity of the projects, and seniority level. In some enterprises, the roles of a data engineer and data scientist may be much more detailed and granular.
Difference Between a Data Engineer and a Data Scientist
There is a major overlap between data scientists and data engineers when it comes to skills and responsibilities, but their primary difference is their focus. The common role of a data engineer is to build infrastructure and architecture for data generation and cleaning. Data scientists handle performing advanced statistical analysis on that generated data.
Data scientists are focused on constantly interacting with the data infrastructure built and maintained by data engineers. However, they do not primarily engage in building and maintaining that infrastructure. Rather, they are considered as internal clients responsible for conducting market and business operations research to determine trends and relations. For doing this, data scientists require a lot of sophisticated machines and mechanisms to communicate with and act upon data.
Conversely, data engineers complement data scientists and support analysts by providing them with infrastructure and tools that can be leveraged to deliver strategic solutions to business problems. They build scalable, highly-performing infrastructure for:
- providing business insights from unstructured data
- executing analytical projects with a focus on gathering, analyzing, and visualizing data
- developing real-time analytical solutions
In short, data scientists require the support of data engineers. While data scientists work with advanced analytical tools such as Hadoop, statistical modeling, SPSS, and Python/R, data engineers are focused on the tools that support them. For instance, a data engineer’s tools may consist of SQL, NoSQL, MySQL, Cassandra, and other data infrastructure services.
Skills for data engineers correlate with the responsibilities they’re responsible for. Skill sets may differ, as there is a vast array of things data engineers do. Given below are some of the skills that data engineers need to master.
In the majority of cases, data engineers use particular tools and software to design and develop data storage. They’re responsible for storing both structured and unstructured data, which need to be connected to analytical interfaces.
SQL databases are the most commonly used tool that data engineers use to facilitate these transactions. It is the standard programming language for developing and managing relational database systems. Other tools such as Redshift, Talend, and Informatica are used to develop large distributed data storages (noSQL), cloud warehouses, or deploy data into data platforms.
NoSQL databases, on the other hand, are non-tabular and come in different types based on their data model, including a graph or document. Data engineers must be aware of the methods to manipulate database management systems (DBMS) — a software application that gives an interface to databases for data storage and retrieval.
The data model is an important part of the data engineering pipeline. It is the mechanism of converting a document of a sophisticated software design system to a diagram that can comprehend, using text and symbols to represent the flow of data. During the analysis and design phase, data models are built to make sure that the requirements of a new application are completely understood. These models can also be manipulated later in the data lifecycle to justify data designs that were developed by the programmers on an ad-hoc basis.
A data engineer should have knowledge of entity-relationship modeling, along with a clear understanding of normalization and denormalization tradeoffs. Moreover, they should be familiar with dimensional modeling and other related concepts.
Data engineers and data scientists work hand in hand, sometimes. This is why a strong understanding of algorithms, data transformation methods, and data modeling is required to work with data platforms. Data engineers are responsible for developing ETL (data extraction, transformation, and loading), analytical tools, and storage tools and software. Thus, expertise with existing ETL and BI solutions is a much-needed requirement.
ETL refers to how the data is taken (data extraction) from a data source, converted (data transformation) into a format that can be easily analyzed, and stored (data loading) into a data warehouse. The tool uses batch processing to help users examine data relevant to a particular business problem. The tool extracts data from various resources, applies specific rules to the data based on the business requirements, and then loads the converted data into a database or BI platform in order to be used and viewed by anyone in the organization.
Data engineers focus mostly on data modeling and data architecture, but a basic knowledge of algorithms and data structure is also needed for understanding the big picture of the enterprise’s overall data function, along with defining checkpoints and end goals for the business problem.
Of particular importance is the data engineer’s ability to develop inexpensive methods for the transfer of large amounts of data. If you’re responsible for a database with potentially millions (let alone billions) of records, it’s important to find the most efficient solution, since at that scale, differences of milliseconds add up quickly and even minor errors can mean an hours-long difference in computation.
Data Engineering Interview Questions
In this section, we’re going to highlight two of the most common types of questions for the data engineering role from our bank of real interview questions on Interview Query: Database Design, ETL, and SQL. In addition to these highlighted types, you can also expect questions concerning algorithms and sometimes Python added into the mix.
This type of question involves abstractly thinking through the requirements of a given system. For instance, you might be asked to develop a database schema for storing customer address information for a large company. You might think that such a question sounds really simple: you make the relevant fields for address, city, state, and zip code and the database is ready to go.
But then the interviewer tells you that the company wants to keep track of their customers’ moving history. Now you’ve added the dimension of time to the database and things will have to be radically restructured. In theory, your database(s) have to be able to show that one customer has lived where another customer is currently living.
This type of question is often difficult to answer because of its scope relative to the time you have to answer the question. You may have a lot of great potential ideas, but only fifteen minutes to answer. So how do you narrow the scope of a system design question? One way is to ask the interviewer clarifying questions to shrink the question’s scope.
For instance, in our address example above, you might ask the interviewer why the company wants to track its customers’ moving history. If they’re trying to do location-based analytics, you can design your database one way. If they’re trying to track changes in purchase habits based on how recently a customer has moved, you can design it another. The important thing is to clarify, so that you can spend your time in the interview as effectively as possible.
Here’s a sample interview question from our question bank on Interview Query. This question came up in an interview at Dropbox:
Click Data Schema
How would you create a schema to represent client click data on the web?
Here’s a hint:
These types of questions are more architecture based and are generally given to test experience within developing databases, setting up architectures, and in this case, representing client side tracking in the form of clicks.
What exactly does click data on the web mean? Any form of button clicks, scrolls, or action at all as an interaction with the client interface, in this case desktop, would be somehow represented into a schema form for the end user to query. This does not include client views however.
A simple but effective design schema would be to first represent each action with a specific label. In this case assigning each click event a name or label describing its specific action.
View the full solution to this question on Interview Query.
For more information on the database design interview, check out Database Design Interview Questions today.
ETL (Extract, Transform, and Load)
ETL processes constitute a large portion of a data engineer’s responsibilities on the job. Essentially, we want to be able to take data from one place (Extract), change it in some way (Transform), and then put that data somewhere else (Load). This type of question will assess your familiarity with the challenges native to ETL processes in order to determine whether you’re a good fit for the data engineering role.
For example, you might be asked how you would perform an ETL process that seeks to transform an unstructured data source into a structured relational database. During this portion of the interview, you’ll want to be more liberal with your questions: what sort of data are you going to be moving? How often will the ETL process be performed? Once you have a more nuanced understanding of the requirements of the question, answering using ETL fundamentals is a pretty straightforward task.
Here’s a sample question from our bank of real interview questions on Interview Query. This question was originally asked by Amazon:
Modifying a billion rows
Let's say you have a table with a billion rows.
How would you add a column inserting data from the original source without affecting the user experience?
Here’s a hint:
Before jumping into the question we should remember to clarify a few details that we can potentially get out of the interviewer. It helps to ask questions to understand and show that you can think holistically about the problem. Rushing too fast into a solution is a red flag for many interviewers.
Given the problem statement, let's clarify.
1. What is the potential impact of downtime?
We always have to remember to get more context out of the question. In this scenario when we are talking about a table with a billion rows, this could be a table that is powering an entire company in which downtime would affect thousands of dollars in sales or could just be an offline analytics events table that would cause impact to internal employees.
It's crucial to then mention to the interviewer and assess, what are the potential effects of downtime for seconds, minutes, and hours. Figuring out the impact is pertinent then to determining our strategy going forward.
You can view the full solution to this ETL question on Interview Query.
As mentioned above, SQL is the most commonly used query language for data engineers. A SQL question will generally try to test a candidate’s basic (or advanced) competency in the language. In on-site interviews, SQL questions often involve “white-boarding,” and will not involve directly interacting with a code interpreter.
The reason for this trend in the SQL interview is that SQL queries involving large amounts of data are computationally expensive, so there is a high cost associated with incorrect SQL queries on the job. In other words, it’s very important to be able to get your SQL right on the very first try, since getting it wrong means potentially hours of wasted time.
While a data engineer may be pulling queries of a different nature than their coworkers’, companies still want to know that the person they’re putting in charge of data flow understands how to formulate a correct, efficient query. You may also want to brush up on your Data Definition Language while you’re practicing SQL, too, since you never know when an interviewer is going to hit you with a curveball. (Do you know how to make your primary key auto-increment?)
Here’s a SQL question from our bank of real interview questions on Interview Query. This question was asked in an interview at Facebook:
You're given a table that represents search results from searches on Facebook. The query column is the search term, position column represents each position the search result came in, and the rating column represents the human rating of the search result from 1 to 5 where 5 is high relevance and 1 is low relevance.
1. Write a query to compute a metric to measure the quality of the search results for each query.
2. You want to be able to compute a metric that measures the precision of the ranking system based on position. For example, if the results for dog and cat are....
|dog||1000||1||2||picture of hotdog|
|cat||123||1||4||picture of cat|
...we would rank 'cat' as having a better search result ranking precision than 'dog' based on the correct sorting by rating.
Write a query to create a metric that can validate and rank the queries by their search result precision. Round the metric (avg_rating column) to 2 decimal places.
Here’s a hint:
The ratings of each search result (1 to 5 where 5 is high relevance and 1 is low relevance) could be used to compute a metric that measures the quality of the search results for each query.
For example, if the search query for 'tiger' has 5s for each result, then that would average to be a perfect result.
Therefore, taking the average of the ratings seems like a good way to measure the quality of the search results.
Solve this interview question in our on-site SQL Editor on Interview Query.
The Bottom Line
Over the last few years, the demand for the role of a data engineer has been on the rise exponentially. Companies of all sizes are actively searching for data engineers to address their data challenges. Everyone is looking to hire data engineers and the demand is far from being flooded like other fields. The skillset is important and those who gain some experience will have the opportunity to earn big. Even if you don’t have the right amount of experience, you can do a relevant certification to land yourself the perfect job!
Thanks for Reading
We hope that you found the article helpful. If you did, subscribe to the Interview Query blog to receive weekly updates and new articles to keep you up-to-date on the state of the data science profession and loaded with tips and tricks to keep you one step ahead of the competition.
If you're currently in the process of interviewing for a job in the data science field, or just looking to brush up on your DS skills, consider signing up for a free account on Interview Query for access to our question bank filled with real interview questions from companies like Google, Facebook, Amazon, and more. Our questions are sorted by question type, position, and difficulty to make sure that you're practicing exactly what you need to practice to level up your skills and ace your next interview.
Finally, if you're looking for an all-inclusive starter course for data science, look no further than the Data Science Course on Interview Query. Covering subjects ranging from Product Intuition to Machine Learning to SQL, our course will bring you up to date on the things you need to know to become a data scientist.
Take the next step in your data science career. Try Interview Query today.