Stratified Split
Start Timer
0:00:00
Let’s say you work as a medical researcher.
You are given a dataframe of patient data containing the age of the patient and two columns, smoking and cancer, indicating if the patient is a smoker or has cancer, respectively.
Write a function, stratified_split, that splits the dataframe into train and test sets while preserving the approximate ratios for the values in a specified column (given by a col variable). Do not return the training set. Instead, return the number of columns in the training set that are in the "no" class of col
Note: Do not use scikit-learn.
Example:
Input:
print(df)
...
age smoking cancer
0 25 yes yes
1 32 no no
2 10 yes no
3 40 yes no
4 75 no no
5 80 yes no
6 60 yes no
7 60 no yes
8 40 yes yes
9 80 yes no
Output:
def stratified_split(df, train_ratio=0.7, col='cancer') -> 5
The resulting dataframe was:
age smoking cancer
1 32 no no
3 40 yes no
4 75 no no
6 60 yes no
7 60 no yes
8 40 yes yes
9 80 yes no
.
.
.
.
.
.
.
.
.
Comments