Stratified Split

0:00:00

Let’s say you work as a medical researcher.

You are given a dataframe of patient data containing the age of the patient and two columns, smoking and cancer, indicating if the patient is a smoker or has cancer, respectively.

Write a function, stratified_split, that splits the dataframe into train and test sets while preserving the approximate ratios for the values in a specified column (given by a col variable). Do not return the training set. Instead, return the number of columns in the training set that are in the "no" class of col

Note: Do not use scikit-learn.

Example:

Input:


print(df)
...
   age smoking cancer
0   25     yes    yes
1   32      no     no
2   10     yes     no
3   40     yes     no
4   75      no     no
5   80     yes     no
6   60     yes     no
7   60      no    yes
8   40     yes    yes
9   80     yes     no

Output:

def stratified_split(df, train_ratio=0.7, col='cancer') -> 5

The resulting dataframe was:

   age smoking cancer
1   32      no     no
3   40     yes     no
4   75      no     no
6   60     yes     no
7   60      no    yes
8   40     yes    yes
9   80     yes     no

.
.
.
.
.

Comments

Loading comments