جهت دانشجویان علاقمند به روشهای پیشرفته آماری و زبان پایتون
منبع
https://github.com/fatimaAfzaal/Diabetes-Prediction-Project-Using-Random-Forest
We have to define the problem and set goal that we we have to do
This is the step where we have to collect the data required. As in our project we get the dataset from kaggle(https://www.kaggle.com/datasets/mathchi/diabetes-data-set)
Also in this step we will explore data to get insight and use visualizations as well if needed to understand data
Mounted at /content/gdrive
Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|
0 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
764 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
765 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
766 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
767 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
768 rows × 8 columns
Total number of records: 768
Parameter are: Index(['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Glucose 768 non-null int64 1 BloodPressure 768 non-null int64 2 SkinThickness 768 non-null int64 3 Insulin 768 non-null int64 4 BMI 768 non-null float64 5 DiabetesPedigreeFunction 768 non-null float64 6 Age 768 non-null int64 7 Outcome 768 non-null int64 dtypes: float64(2), int64(6) memory usage: 48.1 KB
Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
count 768.000000 mean 0.348958 std 0.476951 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 1.000000 Name: Outcome, dtype: float64
In this step we will perform task as handling missing values and encode catagorical data
False
(268, 500)
BloodPressure 0.065068 SkinThickness 0.074752 Insulin 0.130548 DiabetesPedigreeFunction 0.173844 Age 0.238356 BMI 0.292695 Glucose 0.466581 Outcome 1.000000 Name: Outcome, dtype: float64
In this step we will split data in to testing and training data
Firstly we will select the right algorithm according to our requirement and then we will train data on model
Ensemble Accuracy: 0.7835497835497836
Confusion Matrix: [[140 17] [ 33 41]]
Enter Glucose level: 1 Enter Blood Pressure: 1 Enter Skin Thickness: 1 Enter Insulin level: 1 Enter BMI: 11 Enter Diabetes Pedigree Function: 1 Enter Age: 1 The model predicts: No diabetes