Interview Query

Stratified Split

1
Have you seen this question before?

Let’s say you work as a medical researcher.

You are given a dataframe of patient data containing the age of the patient and two columns, smoking and cancer, indicating if the patient is a smoker or has cancer, respectively.

Write a function, stratified_split, that splits the dataframe into train and test sets while preserving the approximate ratios for the values in a specified column (given by a col parameter).

Note: Do not use scikit-learn.

Example:

Input:

print(df)
...
   age smoking cancer
0   25     yes    yes
1   32      no     no
2   10     yes     no
3   40     yes     no
4   75      no     no
5   80     yes     no
6   60     yes     no
7   60      no    yes
8   40     yes    yes
9   80     yes     no

Output:

def stratified_split(df, train_ratio=0.7, col='cancer') -> print(X_train)
...
   age smoking cancer
8   40     yes    yes
6   60     yes     no
7   60      no    yes
4   75      no     no
9   80     yes     no
1   32      no     no
2   10     yes     no
-----------------------
print(X_test)
...
   age smoking cancer
0   25     yes    yes
5   80     yes     no
3   40     yes     no
Next question: Previous NaN Values
.....
Python 3.9.6
Loading editor
Use Shift + Enter to run code