Pandas (Python Data Analysis Library) is a powerful tool for performing data manipulations, cleaning, and preparations with Python. Pandas utilize a unique data structure called a “DataFrame” that showcases 2-dimensional matrices that can better represent tabular data.
With its built-in functions and methods, Pandas allows easier data cleanup and manipulation by building on the arrays used in NumPy.
Below is a list of Pandas interview questions, sorted based on their difficulty.
Pandas is the go-to Python library for data manipulation from various sources, including file formats (.csv,.xslx,.json), structured query language (SQL) databases, and websites, especially when tabular data is involved.
Job titles that handle tabular data appreciate and expect a strong understanding of Pandas. See below for the job positions that require or value Pandas experience:
Pandas plays an important role in helping clean and manage tabular data and allows data scientists to do more with less code. Moreover, because of the built-in modifications to primitive functions that come with the Pandas library, it typically runs through data faster than Python.
Because of this, at its core, Pandas is a tool that helps provide timely data analysis reports through ease of use and its sheer power.
For the following Pandas interview questions, we utilize Jupyter Notebook.
These easy Pandas interview questions can help you assess your level of understanding. The following questions are usually asked by the interviewer to identify the candidates that can proceed to the next step of the interview process.
Pandas is a Python library designed for connecting .csv, .xlsx, and other tabular-oriented data sources to Python, allowing you to create scripts that can help modify said DataFrames.
Although other libraries, such as NumPy arrays, are better suited for modifying purely numerical data, when a mix of data types is introduced, especially strings, Pandas provides the user with many control functions. It’s also useful for creating more presentable versions of data than arrays.
The Pandas library concerns itself with two primary data structures: the DataFrame and the series.
The DataFrame is a two-dimensional, mutable data structure with rows and columns as its axes. As mentioned earlier, Pandas’ DataFrame handles multiple data types, and each block (also referred to as a cell in a sheet) is identified by the data inside it, its row, and its column.
Meanwhile, the series is the one-dimensional counterpart to the DataFrame.
Pandas is a great library due to the following key features:
One of the most significant features in Pandas is the ability to equate a DataFrame variable to an entire .csv, .sql, and other tabular data files. How can you achieve this exactly?
The general syntax is typically similar to the following:
import pandas as pd
myvar = pd.read_<data type>('path')
For a more specific example, see the following:
import pandas as pd
myvar = pd.read_csv('data/test.csv')
Building on the earlier code, you can import the DataFrame using Python’s built-in print() function. However, there are ways to manipulate the function’s behavior.
There is the default code that prints the DataFrame’s head (first five series) and the DataFrame’s tail (last five series):
import pandas as pd
myvar = pd.read_<data type>('path')
df
You can also print the data from index zero to n by using the head function (where n is the last index to print):
import pandas as pd
myvar = pd.read_<data type>('path')
df.head(n)
You can do the same from the end using the tail function:
import pandas as pd
myvar = pd.read_<data type>('path')
df.tail(n)
Note: Leaving an object with an undeclared variable at the end of a Jupyter Notebook will print out the value of that variable.
To access a specific element inside a DataFrame, you can utilize the loc[] and at[] functions. The loc[] function can specify the column and the at[] function can determine the row of your specific element.
The syntax for this is as follows:
import pandas as pd
myvar = pd.read_<data type>('path')
specificElement = myvar.loc[index].at['Row Name']
Notice that the loc[] function takes in an integer whereas the at[] function takes in a string.
When you try to print a DataFrame using Pandas’ default setup, you may observe that Pandas will try to print only the head and the tail (meaning the first and last five series).
You can try to bypass this using the tail() and head() functions, wherein you pass the number of columns as the parameter. But you will soon realize that instead of displaying the entire DataFrame, Pandas will display the following:
This is not necessarily the result you want. As such, you can remove the print limitations using the following code:
pd.set_option("display.max_rows", None, "display.max_columns", None)
All you need to do is print the DataFrame, and you should see the whole thing. Use the following code as an example:
import pandas as pd
df = pd.read_csv('file_path')
pd.set_option("display.max_rows", None, "display.max_columns", None)
df
The following are some Intermediate Pandas interview questions:
Merge and concatenate are the most commonly used functions when fusing two datasets (a series or a DataFrame). Though they may seem the same, along with join() and append(), they all serve different purposes.
Merge is utilized for general use compared to concatenate. When you have two datasets that share information, merge is the best option.
Merge has two required parameters: the left and right datasets, and four variants: inner merge, outer merge, left merge, and right merge.
The inner merges (including the left and right) put you at risk of losing data upon merging. However, the outer merge includes all data from both datasets.
Use the following code as an example, given a dataset:
import pandas as pd
df = pd.read_csv('file_path')
dataFrameOne = pd.DataFrame({"Segment" : ['Mobile', 'Web', 'University'], "Answer" : ['Instagram', 'Facebook', 'Snapchat']})
dataFrameTwo = pd.DataFrame({"Count" : [2559, 32, 0], "Answer" : ['Instagram', 'Facebook', 'Snapchat']})
With DataFrameOne:
With DataFrameTwo:
You can use the following code to merge:
merged = pd.merge(dataFrameOne, dataFrameTwo)
Which will give you:
As you can see, Pandas automatically merges rows and columns with the same value. However, if you use concatenate instead of merging datasets that contain columns of the same value and name, it will automatically stitch the two rows together.
Using the following code:
conc = pd.concat([dataFrameOne, dataFrameTwo], axis = "columns")
You will get the following result:
Even though the value of the “Answer” columns is the same, it is added twice to the fused dataset due to the nature of concatenation.
For Pandas, one function helps you include or exclude specific data types into a DataFrame.
The select_dtypes() function can help you include or exclude specific elements based on whether they are of the specified function inside the parameter.
For example, given the specific DataFrame:
Using the following code:
df.select_dtypes(include='int64')
You get the following output:
This results in a DataFrame with only the columns of the datatype ‘int64’. If you use the exclude keyword argument instead of include, however, you get the following result:
Using the following code:
df.select_dtypes(exclude='int64')
A pivot table in Pandas is a valuable tool that allows you to summarize and aggregate data. To create a pivot table, use the pivot() function to help reshape and aggregate data into a compact form.
Given the previous DataFrame:
You can create a pivot table that aggregates all the answers and their frequencies and links them to their specific demographic (using the segment description column) with the following code:
pvt = df.pivot_table('Count', index = 'Segment Description', columns = 'Answer')
Which will result in this aggregated table:
As you can see, this table displays more insight into the answers per demographic and allows you to create assumptions at a glance. While pivot tables should not solely be the basis of your conclusion, they help generate a bird’s eye view of the data.
In Pandas, melt() is an integral tool for reshaping DataFrames from wide to long format. Most utilize melt() to transpose DataFrame data into a computer-readable form factor, allowing for easier transposition when involving other libraries aside from Pandas.
Below is our demonstration of how melt() works, given a dataset df:
You can use the general syntax of melt(), which is:
var = pd.melt(<DataFrame>, id_vars =[ '<identifier column>]', value_vars = ['<column to unpivot>'])
And perform your own melt() using the following code:
melted = pd.melt(df, id_vars = ['Segment Description'], value_vars = ['Answer'])
melted
This displays the following:
Missing data is a far too common problem data scientists, analysts, and engineers face daily. Thankfully, Pandas is a great tool for cleaning datasets and allowing you to eliminate or supplement missing data from your DataFrame.
One of the fundamental things to know about missing data is their Pandas representation. Typically, missing data in Pandas is represented by the NaN (not a number, floating type) and None (object, singleton) keywords.
To check for missing data, you can use the isnull() and notnull() functions. On the other hand, if you want to fill in missing data, you can use the following functions:
fillna()
allows you to place values to null objects, either by replacing all instances with a specific value or by using the method parameter to replace all null values with the last observed valid ones. This method is incredibly flexible for modifying missing valuesreplace()
allows you to replace all null values with a specific valuedropna()
allows you to drop valuesinterpolation()
allows you to change values with interpolation methodsThe groupby() function allows you to create clusters using a specific identifier. Given the previous DataFrame:
You can use the groupby() function to group the DataFrame as per their answer using the following code:
gp = df.groupby(["Answer"]).sum()
You can then view the first group by typing the following:
gp.head()
Or you can also type the following:
gp.sum()
To get the following result:
You are given a month’s electricity consumption data (kWh) in 15-minute intervals for a commercial facility located in San Diego, CA. These data can be read into a Pandas DataFrame using the pre-coded _get_data
function in the cell below. Each interval is labeled with the time it began, and all intervals are 15 minutes long. Your task is to add logic to the two stubbed functions in the cell below module. The first method is called energy_charge
and the second is called demand_charge
.
An example pricing scheme (electricity tariff) is outlined below. Your functions will need to use the electricity consumption data and the pricing scheme to calculate the energy and demand charges. Please feel free to import any other third-party packages you think you will need. In addition, feel free to write as many extra helper functions as you need and include them in the cell below file as well.
For the succeeding Pandas interview questions, use the given dataset, df, to perform the instructed operations. The dataset contains students, their state of origin, their university name, and their guardian’s monthly income.
The DataFrame to be used is the following:
This problem is quite straightforward and requires you to find all unique instances of universities. What this implies is that you should ignore the frequency of appearance and only list each unique university once
Using the general unique method:
df[<column>].unique()
You can obtain a list containing all the schools, capped at one instance. So using the code:
uniq = df['School Name'].unique()
You get the following output:
array(['NYU', 'UCLA', 'MIT', 'Brown', 'Arizona State', 'Colombia U.',
'Penn State'], dtype=object)
To drop a row with a specific value, use the following syntax:
Given DataFrame df:
df = df[df.<column> != '<to be removed value>']
So given the DataFrame, if you were to remove all students from New York, you should do the following:
df = df[df.State != 'New York']
Which should result in the following output:
Using the concepts from earlier, you can remove students above a certain income threshold without sorting.
Using the code:
ecclass = df[df.Income <= 200000]
You can also use the code:
df.query('Income <= 200000', inplace = True)
Doing so will provide the following output:
To sort, use the sort_value function. You can apply this concept to the DataFrame using the following code:
sorts = df.sort_values(by = 'Income', ascending = True)
#sorts income in ascending order
Which will result in the following output:
And if you use this code:
sorts = df.sort_values(by = 'Income', ascending = False)
#sorts income in descending order
You’ll get this result:
Let’s say you want to build a naive recommender. You’re given two tables, one table called friends with a user_id
and friend_id
column representing each user’s friends, and another table called page_likes
with a user_id
and a page_id
representing the page each user liked.
Example:
Input:
friends
table
Column | Type |
---|---|
user_id |
INTEGER |
friend_id |
INTEGER |
page_likes
table
Column | Type |
---|---|
user_id |
INTEGER |
page_id |
INTEGER |
Output:
Column | Type |
---|---|
user_id |
INTEGER |
page_id |
INTEGER |
num_friend_likes |
INTEGER |
Write an SQL query to create a metric to recommend pages for each user based on recommendations from their friend’s liked pages.
Note: It shouldn’t recommend pages that the user already likes.
def liked_pages(friends: pd.DataFrame, page_likes: pd.DataFrame):
friends = (friends
.merge(
page_likes,
left_on=['friend_id'],
right_on=['user_id'],
how='inner',
suffixes=['', '_y'])
.groupby(['user_id', 'page_id'], as_index=False)
.agg(num_friend_likes=('user_id_y', 'count'))
.merge(
page_likes.assign(liked_page=1),
left_on=['user_id', 'page_id'],
right_on=['user_id', 'page_id'],
how='left')
.query('liked_page != 1')
.drop(columns='liked_page')
)
return friends
For this solution, use an inner merge to encapsulate all values that appear both on the ‘friend_id’
and ‘user_id’
, as shown here:
.merge(
page_likes,
left_on=['friend_id'],
right_on=['user_id'],
how='inner',
suffixes=['', '_y'])
Then merge them to find which pages were commonly liked, thus accomplishing a better recommendation algorithm. Then, group them accordingly.
Afterward, use the aggregate function to find the page with the highest likes and, as a result, find the best page to recommend. However, there is a tendency wherein the page you will be recommending is a page that the user has liked already.
To accommodate this, left merge to generate a DataFrame that has values unique to itself and drops redundant values (i.e., pages already liked by the user).
.merge(
page_likes.assign(liked_page=1),
left_on=['user_id', 'page_id'],
right_on=['user_id', 'page_id'],
how='left')
.query('liked_page != 1')
.drop(columns='liked_page')
Let’s say that you work for a B2B SaaS company that has been around for three years.
The company has two revenue lines (product_type): the first is labeled as “service” and is a consulting-type model where clients are serviced at an hourly rate for a one-off project. The second revenue line is “software”, which clients can purchase on an ongoing subscription basis.
Given a table of payment data, write a query to calculate the average revenue per client.
Note: Round the result to two decimals.
Example:
Input:
payments
table
Column | Type |
---|---|
id |
INTEGER |
user_id |
NTEGER |
created_at |
DATETIME |
amount_per_unit |
FLOAT |
quantity |
INTEGER |
product_type |
VARCHAR |
Output:
Column | Type |
---|---|
average_lifetime_revenue |
FLOAT |
You can use the following solution as a guide if you get stuck:
import pandas as pd
def average_revenue_per_customer(payments: pd.DataFrame):
payments['revenue'] = payments.amount_per_unit * payments.quantity
return pd.DataFrame({'average_lifetime_revenue':[payments.revenue.sum() / len(payments.user_id.unique())]})
Let’s break this code down into manageable chunks. To generate the total revenue, multiply the ‘amount_per_unit’
by the ‘payments.quantity’
.
payments['revenue'] = payments.amount_per_unit * payments.quantity
You can then calculate the average by taking the sum of the revenues and dividing it by the number of users (using the len function, which returns the number of items inside an object).