Top 35+ Most Asked Data Modeling Interview Questions and Answers
1) What do you understand by a data model?
A data model is a set of different data elements. It specifies how they are related to each other and the real-world entity properties. Data models consist of entities. Here, entities are the objects and concepts whose data we want to track. These entities are stored in a table found in a database. For example, if a table consists of customers, products, manufacturers, buyers, and sellers, they are called potential entities. Each entity has attributes-details that the users want to track. For example, a customer's name is an attribute.
2) What is data modeling?
Data modeling is creating data models to store in a database. It is a conceptual representation of data objects, the association between different data objects, and the rules. It also represents how the data flows. In other words, data modeling is creating a simplified diagram that contains data elements in the form of texts and symbols.
3) What are the different types of data models?
There are mainly three types of data models:
4) What do you understand by the fact and fact table in data modeling?
In data modeling, the fact is used to represent quantitative data. For example, the net amount which is due is a fact. On the other hand, a fact table contains numerical data and foreign keys from dimensional tables.
5) What is a Table?
A table is a structure used to store data in the form of rows and columns. Columns are also known as fields and are used to show data in vertical alignment. Rows are also called records or tuples and represent data's horizontal alignment.
6) What are the several design schema used in data modeling?
There are mainly two different types of data modeling schemes used in data modeling:
Note: The star schema is an important case of the snowflake schema and is used more effectively for handling simpler queries.
7) What do you understand by database normalization?
Database normalization is the process of structuring and designing the database to reduce data redundancy without losing integrity. It usually works on a relational database according to so-called normal forms. The main motive of database normalization is to reduce data redundancy and improve data integrity. Edgar F. Codd first proposed the process of database normalization as part of his relational model.
8) What is the main usage of database normalization?
Following are the main motives of database normalization:
9) What is denormalization in a database? / What do you understand by Data Denormalization? What are its advantages and disadvantages?
Data Denormalization is a technique used on a previously-normalized database to increase performance. In this technique, redundant data is added to an already normalized database that enhances the read performance by sacrificing write performance.
In the denormalization database optimization technique, we add redundant data to one or more tables. This can help us avoid costly joins in a relational database. In other words, we can say that denormalization is the process of improving the read performance of a database by sacrificing some write performance, by adding redundant copies of data, or by grouping it.
Advantages of Denormalization:
Disadvantages of Denormalization:
Note: We must remember that denormalization does not mean not doing normalization. It is an optimization technique applied after doing normalization.
10) When should we use denormalization?
Following are some situations when we have to use denormalization:
11) What do you understand by dimension and attribute?
Dimensions are used to represent qualitative data. For example, product, class, plan, etc., are dimensions. On the other hand, an attribute is a value that a dimension contains. A dimension table has textual or descriptive attributes. For example, the product category and product name are two attributes of the product dimension table.
12) What do you understand by data sparsity?
Data sparsity is a term used to specify the phenomenon of not observing enough data in a dataset. It specifies how much data you have for the entity/ dimension of the model.
13) What is the primary key? / What is a primary key constraint?
The primary key or primary key constraint is a column or group that unequally identifies every row in the table. The primary key constraint is imposed on the column data to avoid null and duplicate values. The primary key value must not be null. Every table must contain one primary key.
For example, Social security number, bank account number, bank routing number, phone number, Aadhar number, etc.
14) What is a foreign key? / What is a foreign key constraint?
A foreign key is a group of attributes used to link parent and child tables. The parent table has a primary key, and a foreign key constraint is imposed on a column in the child table. The foreign key column value in the child table will always refer to primary key values in the parent table.
It means that the value of the foreign key column available in the child table refers to the primary key's value in the parent table.
15) What is a composite primary key? / What is a composite primary key constraint?
Composite primary key or composite primary key constraint specifies a case when more than one column is a part of the primary key. This is called a composite primary key constraint.
16) What is a composite foreign key? / What is a composite foreign key constraint?
Composite foreign key or composite foreign key constraint specifies a case when a group of columns is available in a foreign key. This is called a composite foreign key constraint.
17) What do you understand by data mart?
A data mart is a condensed version of a data warehouse. This is designed to use by a specific department, unit, or set of users in an organization. For example, marketing, sales, HR, finance, etc.
18) What do you understand by surrogate key? What are the benefits of using the surrogate key?
A surrogate key is a unique key in the database used for an entity in the client's business or an object within the database. This is used when we cannot use natural keys to create a unique primary table key. In this case, the data modeler or architect decides to use surrogate or helping keys for a table in the LDM. That's why surrogate keys are also known as helping keys. A surrogate key is a substitute for natural keys.
Following are some benefits of using surrogate keys:
19) What are the different types of normalization used in Data Modeling?
In Data Modeling, the following five types of normalization are generally used:
20) What are the important types of relationships in a data model?
There are three types of relationships in a data model:
21) What is forward data engineering?
Forward data engineering or forward engineering is used to automatically generate or translate a logical model into a physical model.
22) What do you understand by discrete and continuous data?
Discreet data is a type of data that is finite or defined. It doesn't change. For example, gender, telephone numbers, identity number. On the other hand, continuous data is a type of data that changes in a continuous and ordered manner. For example, age, time, etc.
23) What do you understand by an identifying relationship in DBMS?
In DBMS, an identifying relationship is a relationship between two entities in which an instance of a child entity is identified through its association with a parent entity. The identifying relationship specifies that the child entity is dependent on the parent entity for its identity and cannot exist without it. Generally, parent and child tables are present in a data model and are connected by a relationship line.
24) What is the full form of PDaP?
In DBMS, PDaP stands for Praedico Data Platform. It is a data cube for storing data as a summary. The data in PDaP is stored so that the users can report it with ease. The biggest advantage of PDaP is that it acts as a data cube for storing data as a summary and helps users analyze data quickly.
25) What do you mean by a non-identifying relationship?
A non-identifying relationship is a relationship between two entities in which an instance of the child entity is not identified through its association with a parent entity. In this case, the child entity is not dependent on the parent entity and can exist without it. This relationship is drawn by dotted lines by connecting these two tables.
26) What is Business Intelligence, and what is its usage?
Business Intelligence or BI is a set of technology-driven processes, architectures, and technologies that convert raw data into meaningful information that can be beneficial and profitable for business. It is a suite of software and services that transforms data into actionable intelligence and knowledge. The biggest advantage of Business Intelligence is that it helps executives, managers, and workers to make smart business actions by using informed business decisions.
27) What is metadata? What are its different types?
Metadata is data that provides information about other data. It gives information about other data but not the content of the data, for example, the text of a message or the image itself.
It describes the data about data and shows what type of data is stored in the database system.
Descriptive metadata: The descriptive metadata provides descriptive information about a resource. It is mainly used for discovery and identification. The main elements of descriptive metadata are title, abstract, author, keywords, etc. Following is a list of several distinct types of metadata:
Administrative metadata: Administrative metadata is used to provide information to manage a resource, like a resource type, permissions, and when and how it was created.
Structural metadata: The structural metadata specifies data containers and indicates how compound objects are put together. It also describes the types, versions, relationships, and other characteristics of digital materials. For example, how pages are ordered to form chapters.
Reference metadata: The reference metadata provides information about the contents and quality of statistical data.
Statistical metadata: Statistical metadata describes processes that collect, process, or produce statistical data. It is also called process data.
Legal metadata: The legal metadata provides information about the creator, copyright holder, and public licensing.
28) What is Microsoft Sequence Clustering algorithm?
Microsoft Sequence Clustering algorithm is a unique algorithm used to combine sequence analysis with clustering. This algorithm collects similar paths or paths related to each other and sequences of data having events. After collecting the most common sequences, this algorithm performs clustering to find similar sequences.
29) What is an analysis service in Data Modeling?
Analysis service is a product of Microsoft Azure used in Data Modeling. It is a fully managed platform as a service (PaaS) that provides enterprise-grade data models in the cloud. It provides a combined view of the data used in data mining or OLAP. The analysis services use an advanced mashup and modeling features to combine data from multiple data sources, define metrics, and secure data in a single, trusted tabular semantic data model. The biggest advantage of using an analysis service is that it provides users with an easier and faster way to perform ad hoc data analysis using Power BI and Excel tools.
30) What is data mart? What are the key features of a data mart?
A data mart is a subset of a data warehouse. It mainly focuses on a specific part of the business, department, or subject area. It provides specific data to a defined group of users within an organization. It is the best solution for a specific business area as it facilitates the users to quickly access important data without wasting time searching through the entire data warehouse. Every big organization has a data mart for a specific department in the business, such as finance, sales, marketing, etc.
Key features of a data mart:
31) What do you understand by the time series algorithm?
Time Series algorithm is a tool of Microsoft that provides an optimized set of multiple algorithms for forecasting continuous values, such as product sales over time. Time series algorithm is better than other Microsoft algorithms such as decision trees because other Microsoft algorithms, like decision trees, require additional columns of new information as input to predict a trend. In contrast, the time series model does not need these input types. The time series model can predict trends based only on the original dataset used to create the model. It also facilitates us to add new data to the model when we make a prediction and automatically add the new data in the trend analysis.
32) What is a data warehouse, and what is data warehousing?
A data warehouse is a repository of electronically stored data of an organization extracted from operational systems and made available for ad-hoc queries and scheduled reporting. It is a data management system designed to enable and support business intelligence activities, such as analytics.
The main purpose of a data warehouse is to perform queries and analyze the data. It contains a large amount of historical data usually derived from various sources such as application log files and transaction applications. It centralizes and consolidates a large amount of data from multiple sources. Its analytical capabilities allow an organization to derive valuable business insights from their data and help in decision-making. It contains valuable data that data scientists and business analysts can use to improve and enhance the business. Because of these capabilities, a data warehouse is called a "single source of truth" for an organization.
Data warehousing is a process for collecting and managing data derived from various sources such as application log files and transaction applications.
Data warehousing is mainly used in the BI system built for data analysis and reporting. In this process, data warehousing collects and analyses data from multiple sources, allowing an organization to derive valuable business insights from their data and help in decision-making. This is very useful for data scientists and business analysts to improve and enhance businesses.
33) What are the key features of a data warehouse?
Following are the key features of a data warehouse:
34) What is Bitmap Indexing? What is the requirement of Bitmap Indexing? Explain with an example.
Bitmap Indexing is a special type of database indexing that uses bitmaps (bit arrays). This is used to answer queries by executing bitwise operations. This technique is mainly used for huge databases when the column is of low cardinality, and these columns are most frequently used in the query.
Requirement of Bitmap Indexing:
Let's see an example to understand clearly the requirement of Bitmap Indexing. Suppose there is a company with an employee table with entries like EmpNo, EmpName, Job, New_Emp, and salary. In this company, the employees are hired once in the year, so it is obvious that the table will be updated very less and will remain static most of the time, but the columns will be frequently used in queries to retrieve data like, No. of female employees in the company, etc. In this case, we need a file organization method that must be extremely fast to give quick results. But any of the traditional file organization methods are not that fast. We go for a better method of storing and retrieving data called Bitmap Indexing.
35) What is the key difference between a Data Mart and a Data Warehouse?
As we know, both Data Mart and Data Warehouse are used to store the data. The main difference between Data Mart and Data Warehouse is that Data Warehouse is the type of database which is data-oriented. On the other hand, Data Mart is the type of database that is project-oriented. Let's see the key differences between a Data Mart and a Data Warehouse in the following table:
36) What is Junk Dimension in Data Warehousing?
In Data Warehousing, Junk Dimension is a dimension table that consists of attributes that do not belong in the fact table or any of the existing dimension tables. It combines two or more related cardinalities into one dimension. These attributes may usually be text or various flags, for example, non-generic comments or just simple yes/no or true/false indicators. It is either Boolean or flag values. By combining these indicator fields into a single dimension, we only need to build a single dimension table and the number of fields in the fact table. It also decreases the size of the fact table.