Data modeling is an integral step in database design, and seeks to create high-level representations of data flows within a specific database or project.
Typically, modeling is kept both relatively simple and visualization-heavy (often in a flowchart form). The aim is to help companies optimize their databases for a particular purpose while assisting data engineers in creating databases that incorporate high degrees of security and data integrity.
Data modeling is a field that relies on the use of intuition, logical reasoning, and the ability to survey your project from a birds-eye view. Most data modeling interviews, especially scenario-based questions, tend to focus on assessing your ability to predict and detect edge cases, then create guard rails for those cases while balancing efficiency.
These are the things that you should expect in a data modeling interview:
An interviewee showing a high understanding of the underlying theory and concepts of data modeling and design is typically not seen as the best applicant immediately. Instead, companies treat that knowledge as the bare minimum in deciding whether or not someone is worth inviting back to the second round of interviews.
As such, the study of conceptual data modeling material is integral for a better callback rate for interviewees, but cannot be the end of your preparation. Below is a list of theory-based data modeling questions:
1. What is data modeling?
Data modeling is a visual representation of data flow, policies, internal information system architecture, and how they each relate to other objects, structures, and data points.
Data modeling revolves around the needs of the company, typically expanding and contracting as initiatives start and finish. Moreover, the methods of modeling data follow specific rules in order to properly assess and accurately design database systems, which are then implemented down to the last detail.
2. What are the four different types of data modeling?
3. What are the steps of data modeling?
Data modeling, as an approach, requires extensive and rigorous planning in order to both properly convey business requirements and provide stakeholders with a clear path as to how data flows.
Despite the differing approaches toward data modeling, there is a tendency within data modeling methods to trace the following steps:
4. What is denormalization in data modeling and how does it differ from normalization?
Normalization and denormalization are database modeling methods that counteract each other within the “time vs. space” complexity debate. If you have a bit of a background working with algorithms, it is likely that you have encountered the Big O notation, by which an algorithm’s time and space complexity can be expressed.
Typically, time and space have an inverse relationship; the more extensive an algorithm’s memory requirement is, the faster the algorithm will perform. The exact relationship present to the question can then lend itself towards normalization or denormalization.
Normalization focuses on removing redundant data to maximize space efficiency. In big data, these space savings can go from minuscule to massive (even potentially reducing one’s operating costs).
However, because of the reduction in space, queries are typically slower. Denormalization works to deprioritize space and increase the algorithm’s speed in performing queries and joins. This is because denormalization allows for data duplication, by which creating queries and joins is easier and faster.
5. What is data governance, and why is it important?
Data governance is a set of self-determined rules and policies for handling your database system. It is an integral part of data modeling and, in a sense, defines your database’s philosophy and approach.
An important thing to remember about data governance is that these so-called “self-determined” rules should not be made in a vacuum and should still be compliant with both industry standards and other technologies you plan to use during the life cycle of a database.
Data governance also provides structure to the policies that take care of how data:
The conceptual theory is essential, but the bread and butter of an interview is the ability of your interviewer to check your practical skills.
This section will tackle your ability to think quickly and generate practical, efficient, and intelligent solutions to the following problems.
One great way to approach scenario-based questions is to ask your interviewer for the specifics of the situation. We do this in order to create a data model that aligns perfectly with their business goals and requirements; the more we know about the problem, the better the solution can be designed.
We can continually optimize for the best, but sometimes, the best solution is relative to the context. Moreover, sometimes the best is not the best, as the “best” may come with overhead costs and more.
Below, we list vital questions to ask your interviewer to have a better view of your project:
This question is crucial as one can use the response to get a good grasp on which design decisions to make. But, equally important, it can help us determine the degree of normalization to employ within our data model.
Knowing how fast and how frequently we need data analytics can help us identify how much emphasis we should put on our OLAP design. This scenario is a case wherein the “best” may not be the best solution for the problem at hand.
Generating the swiftest pipeline is possible but may not be at all necessary. However, if we haphazardly choose the fastest solution without considering the business needs, we may go over budget and over time as complex, fast ETL pipelines require high upfront costs and ongoing overhead to put into action.
For example, if the requirement is a relatively simple report once a quarter, other solutions will likely suffice.
Here, we can help design schemas that optimize for only querying for data and not modifying it. Or if user input is integral to the solution, we can reach a compromise and generate schemas that specialize in both the pulling and pushing data.
For example, if we need fast performance in looking up data, we can create a data model that tends towards a denormalized approach. However, if we modify, insert, or delete data frequently, we may want a more normalized system.
Datasets can give you essential insights into the data you will be creating a data model for and can optimize the model accordingly to business needs. In addition, having a clear-cut dataset can help you visualize your data model.
If the interviewer does not give out a specific dataset, go ahead and try to look one up online, or check out this article which contains 83 free datasets for your project.
Lastly, to formulate and visualize your data models, you will need to create both the desired dimensions and their respective hierarchies based on your data and specifications.
Other questions you should be asking include the following:
Data modeling is critical to building a designated database from the ground up. More often than not, interview questions that also require knowledge of SQL concepts are asked.
Despite no-SQL systems slowly creeping up in adoption (i.e., MongoDB and ScyllaDB), SQL remains king when creating robust RDBMS.
Below are key SQL questions that will likely appear in your data modeling interviews:
A primary key in SQL is a type of self-induced restriction or constraint that limits the occurrence of a value within a tablet to once only. Moreover, NULL values are not allowed. The primary key constraint combines the NOT NULL and UNIQUE constraints.
You can create a table with a single field as a primary key using the following code:
CREATE TABLE Employees (
ID int NOT NULL,
FirstName VARCHAR(255) NOT NULL,
LastName VARCHAR(255),
PRIMARY KEY (ID)
);
To enforce a primary key on two fields, do the following:
CREATE TABLE Employees (
ID int NOT NULL,
FirstName VARCHAR(255) NOT NULL,
LastName VARCHAR(255),
CONSTRAINT PK_Employees PRIMARY KEY (ID, FirstName)
);
In SQL, as well as with data modeling, the concept of relations is crucial. From reviewing table relationships in SQL to the connections of in-between elements and their attributes in data modeling, relations can play a considerable part in simplifying these interactions into a unified schema.
Relations are represented in the following:
There is a tendency to interchange OLAP and OLTP due to their similar names. Nevertheless, OLAP and OLTP are systems that could be categorized as infrastructures with opposite philosophies. OLAP and OLTP are not mutually exclusive but rather work together to create an efficient database architecture.
OLAP stands for OnLine Analytical Processing, while OLTP stands for OnLine Transaction Processing.
OLTP involves itself with fast systems; processing transactions, recording and capturing each detail, and working with built-in data guards that ensure data integrity. Because OLTP works in a high-volume and fast-paced environment, it needs to be able to process and store information rapidly while minimizing heavy processing.
A real-life application of OLTP occurs during online shopping when a database records the user ID, order information, and other relevant data. OLTP quickly generates accurate records and avoids colliding data (i.e., time-sensitive information such as in-store item stock amounts).
On the other hand, an OLAP utilizes the information from the OLTP system to generate insights that drive business decisions. While OLTP is swift, OLAP, due to its analytics-oriented approach, utilizes more time in creating accurate insights generated from the data warehouse.