Data Modeling Interview Questions

Data Modeling Interview Questions

Overview

Data modeling is an integral step in database design, and seeks to create high-level representations of data flows within a specific database or project.

Typically, modeling is kept both relatively simple and visualization-heavy (often in a flowchart form). The aim is to help companies optimize their databases for a particular purpose while assisting data engineers in creating databases that incorporate high degrees of security and data integrity.

Data Modeling Interview Questions

What can I expect from a data modeling interview?

Data modeling is a field that relies on the use of intuition, logical reasoning, and the ability to survey your project from a birds-eye view. Most data modeling interviews, especially scenario-based questions, tend to focus on assessing your ability to predict and detect edge cases, then create guard rails for those cases while balancing efficiency.

These are the things that you should expect in a data modeling interview:

  • Scenario-based questions that tackle planning and risk assessment.
  • Theoretical-based questions that tackle the basics.
  • A good number of interviewees will ask you about data governance.
  • Translate solutions into queries (SQL).
  • Typically, all of these questions are high-level (they handle a lot of abstraction

Data Modeling Interview Questions: Theoretical Concepts

An interviewee showing a high understanding of the underlying theory and concepts of data modeling and design is typically not seen as the best applicant immediately. Instead, companies treat that knowledge as the bare minimum in deciding whether or not someone is worth inviting back to the second round of interviews.

As such, the study of conceptual data modeling material is integral for a better callback rate for interviewees, but cannot be the end of your preparation. Below is a list of theory-based data modeling questions:

Data Modeling Basics

1. What is data modeling?

Data modeling is a visual representation of data flow, policies, internal information system architecture, and how they each relate to other objects, structures, and data points.

Data modeling revolves around the needs of the company, typically expanding and contracting as initiatives start and finish. Moreover, the methods of modeling data follow specific rules in order to properly assess and accurately design database systems, which are then implemented down to the last detail.

2. What are the four different types of data modeling?

  • The hierarchical data model derives its architecture from the shape of a tree, and each data retrieval request requires the whole tree to be traversed. The relationships inside this model are one-to-many. As such, this model is known to be relatively inefficient compared to more modern alternatives.
  • The E-R (Entity-relationship) data model focuses on the relationships between entities and truly excels as a visual model. That visualization is also great at depicting the database design objectives.
  • A relational model uses data tables to collect individual elements into relations. Developed by IBM researcher E.F. Codd, relational models rarely touch on the physical properties of the data storage technology being used.
  • Finally, the network model uses an approach akin to a graph in which there are nodes and edges. This model allows you to make connections between fundamental relationships. Moreover, it allows for flexibility in representing one-to-many, many-to-one, many-to-many, and one-to-one relationships.

3. What are the steps of data modeling?

Data modeling, as an approach, requires extensive and rigorous planning in order to both properly convey business requirements and provide stakeholders with a clear path as to how data flows.

Despite the differing approaches toward data modeling, there is a tendency within data modeling methods to trace the following steps:

  1. Identifying and exploring entities - in this stage, we will determine the entities, their definitions, and critical properties. For example, for a “student” entity, there will be properties such as “ID”, “First Name”, “Last Name”, “Year”, and more.
  2. Identifying entity relationships - now that we have structured our key entities and their properties, we will generate and connect these entities based on their designated relationship. For example, we could assign the entity “Teacher” and connect them with a “Student.”
  3. Mapping attributes to entities toward business requirements - at this point, we will be evaluating and mapping entities in regard to their ability to fulfill a business requirement. For example, at a learning institute, are relationships mapped thoroughly to help generate a fully interactive database?
  4. Determining the degree of normalization - while it may seem like normalization is a binary approach, wherein either you apply it to your data or not, normalization is most often applied only up to a certain degree. In this stage of data modeling, it is crucial to find the balance between efficiency and speed versus the resources you have available.
  5. Iterate and finalize - now that you have made your data model, it is best to start again and build the entire model from the beginning– this time with a contextual view of how to approach your data modeling project.

Technical Data Modeling Questions

4. What is denormalization in data modeling and how does it differ from normalization?

Normalization and denormalization are database modeling methods that counteract each other within the “time vs. space” complexity debate. If you have a bit of a background working with algorithms, it is likely that you have encountered the Big O notation, by which an algorithm’s time and space complexity can be expressed.

Typically, time and space have an inverse relationship; the more extensive an algorithm’s memory requirement is, the faster the algorithm will perform. The exact relationship present to the question can then lend itself towards normalization or denormalization.

Normalization focuses on removing redundant data to maximize space efficiency. In big data, these space savings can go from minuscule to massive (even potentially reducing one’s operating costs).

However, because of the reduction in space, queries are typically slower. Denormalization works to deprioritize space and increase the algorithm’s speed in performing queries and joins. This is because denormalization allows for data duplication, by which creating queries and joins is easier and faster.

5. What is data governance, and why is it important?

Data governance is a set of self-determined rules and policies for handling your database system. It is an integral part of data modeling and, in a sense, defines your database’s philosophy and approach.

An important thing to remember about data governance is that these so-called “self-determined” rules should not be made in a vacuum and should still be compliant with both industry standards and other technologies you plan to use during the life cycle of a database.

Data governance also provides structure to the policies that take care of how data:

  • Is handled
  • Is secured
  • Is kept private
  • Is disposed
  • Maintains Integrity

Data Modeling Interview Questions: Scenario-Based Question

The conceptual theory is essential, but the bread and butter of an interview is the ability of your interviewer to check your practical skills.

This section will tackle your ability to think quickly and generate practical, efficient, and intelligent solutions to the following problems.

Case Question: How can we design a data warehouse for an online shop? Keep in mind that the online shop may grow at an exponential rate, so scalability is something we should be keeping in mind throughout.

One great way to approach scenario-based questions is to ask your interviewer for the specifics of the situation. We do this in order to create a data model that aligns perfectly with their business goals and requirements; the more we know about the problem, the better the solution can be designed.

We can continually optimize for the best, but sometimes, the best solution is relative to the context. Moreover, sometimes the best is not the best, as the “best” may come with overhead costs and more.

Below, we list vital questions to ask your interviewer to have a better view of your project:

  • What is the latency requirement for this data warehouse?

This question is crucial as one can use the response to get a good grasp on which design decisions to make. But, equally important, it can help us determine the degree of normalization to employ within our data model.

  • How fast do we need our analytics?

Knowing how fast and how frequently we need data analytics can help us identify how much emphasis we should put on our OLAP design. This scenario is a case wherein the “best” may not be the best solution for the problem at hand.

Generating the swiftest pipeline is possible but may not be at all necessary. However, if we haphazardly choose the fastest solution without considering the business needs, we may go over budget and over time as complex, fast ETL pipelines require high upfront costs and ongoing overhead to put into action.

For example, if the requirement is a relatively simple report once a quarter, other solutions will likely suffice.

  • What do user queries look like? Am I only pulling data, or am I also pushing them?

Here, we can help design schemas that optimize for only querying for data and not modifying it. Or if user input is integral to the solution, we can reach a compromise and generate schemas that specialize in both the pulling and pushing data.

For example, if we need fast performance in looking up data, we can create a data model that tends towards a denormalized approach. However, if we modify, insert, or delete data frequently, we may want a more normalized system.

  • Ask for dataset examples.

Datasets can give you essential insights into the data you will be creating a data model for and can optimize the model accordingly to business needs. In addition, having a clear-cut dataset can help you visualize your data model.

If the interviewer does not give out a specific dataset, go ahead and try to look one up online, or check out this article which contains 83 free datasets for your project.

  • Identify the dimensions and hierarchies.

Lastly, to formulate and visualize your data models, you will need to create both the desired dimensions and their respective hierarchies based on your data and specifications.

Other questions you should be asking include the following:

  • What architecture should I use?
  • Should I build the architecture from scratch?
  • How much information is going through and from the database in a specified time frame?
  • Are event-based or batch-based architectures better for my model?

Data Modeling Interview Questions: SQL-Based Questions

Data modeling is critical to building a designated database from the ground up. More often than not, interview questions that also require knowledge of SQL concepts are asked.

Despite no-SQL systems slowly creeping up in adoption (i.e., MongoDB and ScyllaDB), SQL remains king when creating robust RDBMS.

Below are key SQL questions that will likely appear in your data modeling interviews:

1. What is a Primary Key in SQL? (from “How Long Does It Take To Learn SQL”)

A primary key in SQL is a type of self-induced restriction or constraint that limits the occurrence of a value within a tablet to once only. Moreover, NULL values are not allowed. The primary key constraint combines the NOT NULL and UNIQUE constraints.

You can create a table with a single field as a primary key using the following code:

CREATE TABLE Employees (
	ID int NOT NULL,
	FirstName VARCHAR(255) NOT NULL,
	LastName VARCHAR(255),
      PRIMARY KEY (ID)
);

To enforce a primary key on two fields, do the following:

CREATE TABLE Employees (
	ID int NOT NULL,
	FirstName VARCHAR(255) NOT NULL,
	LastName VARCHAR(255),
	CONSTRAINT PK_Employees PRIMARY KEY (ID, FirstName)
);

2. What are the types of relationships in SQL?

In SQL, as well as with data modeling, the concept of relations is crucial. From reviewing table relationships in SQL to the connections of in-between elements and their attributes in data modeling, relations can play a considerable part in simplifying these interactions into a unified schema.

Relations are represented in the following:

  • Self-referencing relationships: occur when a record in a table has a direct relationship with itself.
  • One-to-one: a relationship wherein a record in one table is directly linked to a record in another table, wherein the relationship between these two records is the only relationship they have.
  • One-to-many and many-to-one: a prevalent relationship wherein the one record, which we name record A, is linked with many records, where the records related to record A are exclusively connected to record A (or vice versa).
  • Many-to-many: the records from both sides have many relationships to other records.

3. OLAP vs. OLTP: What’s the Difference?

There is a tendency to interchange OLAP and OLTP due to their similar names. Nevertheless, OLAP and OLTP are systems that could be categorized as infrastructures with opposite philosophies. OLAP and OLTP are not mutually exclusive but rather work together to create an efficient database architecture.

OLAP stands for OnLine Analytical Processing, while OLTP stands for OnLine Transaction Processing.

OLTP involves itself with fast systems; processing transactions, recording and capturing each detail, and working with built-in data guards that ensure data integrity. Because OLTP works in a high-volume and fast-paced environment, it needs to be able to process and store information rapidly while minimizing heavy processing.

A real-life application of OLTP occurs during online shopping when a database records the user ID, order information, and other relevant data. OLTP quickly generates accurate records and avoids colliding data (i.e., time-sensitive information such as in-store item stock amounts).

On the other hand, an OLAP utilizes the information from the OLTP system to generate insights that drive business decisions. While OLTP is swift, OLAP, due to its analytics-oriented approach, utilizes more time in creating accurate insights generated from the data warehouse.