Stratified Split
Let’s say you work as a medical researcher.
You are given a dataframe of patient data containing the age
of the patient and two columns, smoking
and cancer
, indicating if the patient is a smoker or has cancer, respectively.
Write a function, stratified_split
, that splits the dataframe into train and test sets while preserving the approximate ratios for the values in a specified column (given by a col
parameter).
Note: Do not use scikit-learn
.
Example:
Input:
print(df)
...
age smoking cancer
0 25 yes yes
1 32 no no
2 10 yes no
3 40 yes no
4 75 no no
5 80 yes no
6 60 yes no
7 60 no yes
8 40 yes yes
9 80 yes no
Output:
def stratified_split(df, train_ratio=0.7, col='cancer') -> print(X_train)
...
age smoking cancer
8 40 yes yes
6 60 yes no
7 60 no yes
4 75 no no
9 80 yes no
1 32 no no
2 10 yes no
-----------------------
print(X_test)
...
age smoking cancer
0 25 yes yes
5 80 yes no
3 40 yes no
Next question: Previous NaN Values.....
Loading editor