import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
bike_data = pd.read_csv("bike_usage_0.csv", encoding="ansi")
bike_data.head()
Date_out | Time_out | Station_no_out | Station_out | Membership_type | Gender | Age_Group | Momentum | Station_no_in | Station_in | Date_in | Bike_no | Carbon_amount | Distance | Duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-03 | 0 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 정기권 | NaN | ~10대 | 28.27 | 122 | 신성기사식당 앞 | 2019-10-03 오전 12:20:42 | SPB-15000 | 0.24 | 1050 | 5 |
1 | 2019-10-03 | 0 | 150 | 150. 서강대역 2번출구 앞 | 정기권 | NaN | 20대 | 146.46 | 2065 | 서울시여성가족재단 | 2019-10-03 오전 1:16:32 | SPB-13087 | 1.32 | 5690 | 32 |
2 | 2019-10-03 | 0 | 240 | 240. 문래역 4번출구 앞 | 정기권 | NaN | 20대 | 37.13 | 245 | 삼성생명 당산사옥 앞 | 2019-10-03 오전 12:18:21 | SPB-23229 | 0.29 | 1250 | 10 |
3 | 2019-10-03 | 0 | 623 | 623. 서울시립대 정문 앞 | 정기권 | NaN | 20대 | 134.62 | 1346 | 길음8골어린이공원 옆 | 2019-10-03 오전 1:15:39 | SPB-14181 | 1.21 | 5230 | 24 |
4 | 2019-10-03 | 0 | 633 | 633. 청량리 기업은행 앞 | 정기권 | NaN | 20대 | 85.83 | 568 | 청계8가사거리 부근 | 2019-10-03 오전 12:17:58 | SPB-15221 | 0.67 | 2890 | 11 |
labels = bike_data.Gender.dropna().unique()
sizes = bike_data.Gender.dropna().value_counts()
bike_data.Gender.value_counts()
M 11025 F 7219 m 14 f 4 Name: Gender, dtype: int64
labels
array(['M', 'F', 'm', 'f'], dtype=object)
colors = ["yellowgreen", "lightskyblue", 'lightcoral', 'blue','coral']
plt.pie(sizes, labels = labels, colors = colors, autopct = '%1.1f%%', startangle= 90)
plt.show()
plt.hist(bike_data.Distance, color='blue')
plt.show()
plt.hist(bike_data.Distance, color='blue', bins=1000)
plt.show()
plt.boxplot(bike_data.Distance)
plt.show()
under_5000 = bike_data[bike_data.Distance<5000]
plt.boxplot(under_5000.Distance)
plt.show()
under_5000 = bike_data[bike_data.Distance<5000]
plt.boxplot([under_5000.Distance[under_5000.Gender =="F"], under_5000.Distance[under_5000.Gender=="M"]])
plt.xticks([1,2],['Female','male'])
plt.show()
plt.plot(bike_data['Distance'].groupby(bike_data['Date_out']).sum())
plt.show()
plt.bar(labels, height = sizes, color ='blue')
plt.show()
bike_data['Gender'].value_counts().plot(kind='bar')
plt.show()
bike_data = pd.read_csv("bike_usage_0.csv", encoding="ansi")
bike_data.Gender.unique()
array([nan, 'M', 'F', 'm', 'f'], dtype=object)
bike_data.loc[bike_data.Gender.isnull(), 'Gender'] = "U"
bike_data.Gender.value_counts()
U 18333 M 11025 F 7219 m 14 f 4 Name: Gender, dtype: int64
bike_data[bike_data.Gender=="f"]
Date_out | Time_out | Station_no_out | Station_out | Membership_type | Gender | Age_Group | Momentum | Station_no_in | Station_in | Date_in | Bike_no | Carbon_amount | Distance | Duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16548 | 2019-10-04 | 13 | 205 | 205. 산업은행 앞 | 일일권 | f | 20대 | 156.29 | 1834 | 월드메르디앙 벤처센터 2차 | 2019-10-04 오후 2:18:10 | SPB-22306 | 1.76 | 7590 | 59 |
18978 | 2019-10-04 | 17 | 2252 | 2252. 하이브랜드 앞 | 정기권 | f | 20대 | 63.84 | 2243 | 서울가정법원 | 2019-10-04 오후 5:39:57 | SPB-22209 | 0.58 | 2480 | 13 |
31071 | 2019-10-05 | 15 | 152 | 152. 마포구민체육센터 앞 | 일일권 | f | 30대 | 132.07 | 106 | 합정역 7번출구 앞 | 2019-10-05 오후 5:39:38 | SPB-20461 | 1.55 | 6670 | 113 |
33075 | 2019-10-05 | 18 | 2251 | 2251. 더케이호텔 입구(양재2) | 정기권 | f | 40대 | 40.51 | 2243 | 서울가정법원 | 2019-10-05 오후 6:29:34 | SPB-05988 | 0.45 | 1930 | 11 |
bike_data.loc[bike_data.Gender == 'f','Gender'] = "F"
bike_data.loc[bike_data.Gender == 'm','Gender'] = "M"
bike_data.Gender.value_counts()
U 18333 M 11039 F 7223 Name: Gender, dtype: int64
bike_data[bike_data.Distance == 0].head(2)
Date_out | Time_out | Station_no_out | Station_out | Membership_type | Gender | Age_Group | Momentum | Station_no_in | Station_in | Date_in | Bike_no | Carbon_amount | Distance | Duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
132 | 2019-10-03 | 2 | 416 | 416. 상암월드컵파크 1단지 교차로 | 정기권 | U | 50대 | 0 | 133 | 해담는다리 | 2019-10-03 오전 3:05:25 | SPB-14937 | 0 | 0 | 12 |
164 | 2019-10-03 | 2 | 113 | 113. 홍대입구역 2번출구 앞 | 일일권 | U | 20대 | 0 | 176 | 명지대학교 도서관 | 2019-10-03 오전 2:55:19 | SPB-00842 | 0 | 0 | 27 |
bike_data.loc[bike_data.Distance == 0, 'Duration'].max()
214
bike_data = bike_data[bike_data.Distance != 0]
bike_data[bike_data.Distance == 0]
Date_out | Time_out | Station_no_out | Station_out | Membership_type | Gender | Age_Group | Momentum | Station_no_in | Station_in | Date_in | Bike_no | Carbon_amount | Distance | Duration |
---|
len(bike_data[bike_data.Duration == 0])
0
bike_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 35581 entries, 0 to 36594 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date_out 35581 non-null object 1 Time_out 35581 non-null int64 2 Station_no_out 35581 non-null int64 3 Station_out 35581 non-null object 4 Membership_type 35581 non-null object 5 Gender 35581 non-null object 6 Age_Group 35581 non-null object 7 Momentum 35581 non-null object 8 Station_no_in 35581 non-null int64 9 Station_in 35581 non-null object 10 Date_in 35581 non-null object 11 Bike_no 35581 non-null object 12 Carbon_amount 35581 non-null object 13 Distance 35581 non-null int64 14 Duration 35581 non-null int64 dtypes: int64(5), object(10) memory usage: 4.3+ MB
pd.pivot_table(bike_data, index = 'Age_Group', columns = 'Membership_type', values = "Distance", aggfunc= np.sum)
Membership_type | 단체권 | 일일권 | 일일권(비회원) | 정기권 |
---|---|---|---|---|
Age_Group | ||||
20대 | 371870.0 | 24609660.0 | 14160.0 | 32250310.0 |
30대 | 92580.0 | 15251200.0 | NaN | 26300820.0 |
40대 | 177820.0 | 6991400.0 | NaN | 20321980.0 |
50대 | 39560.0 | 2371560.0 | NaN | 11612290.0 |
60대 | NaN | 388380.0 | NaN | 3840400.0 |
70대~ | 2450.0 | 76880.0 | NaN | 629750.0 |
~10대 | 50280.0 | 4623160.0 | NaN | 4168920.0 |
bike_pivot = pd.pivot_table(bike_data, index='Age_Group', columns='Membership_type', values='Distance', aggfunc = np.sum)
bike_pivot
Membership_type | 단체권 | 일일권 | 일일권(비회원) | 정기권 |
---|---|---|---|---|
Age_Group | ||||
20대 | 371870.0 | 24609660.0 | 14160.0 | 32250310.0 |
30대 | 92580.0 | 15251200.0 | NaN | 26300820.0 |
40대 | 177820.0 | 6991400.0 | NaN | 20321980.0 |
50대 | 39560.0 | 2371560.0 | NaN | 11612290.0 |
60대 | NaN | 388380.0 | NaN | 3840400.0 |
70대~ | 2450.0 | 76880.0 | NaN | 629750.0 |
~10대 | 50280.0 | 4623160.0 | NaN | 4168920.0 |
bike_pivot = bike_pivot.reset_index()
pd.melt(bike_pivot, id_vars='Age_Group', value_vars=['단체권','일일권','일일권(비회원)','정기권'], var_name='Membership_type', value_name='Total_Dist')
Age_Group | Membership_type | Total_Dist | |
---|---|---|---|
0 | 20대 | 단체권 | 371870.0 |
1 | 30대 | 단체권 | 92580.0 |
2 | 40대 | 단체권 | 177820.0 |
3 | 50대 | 단체권 | 39560.0 |
4 | 60대 | 단체권 | NaN |
5 | 70대~ | 단체권 | 2450.0 |
6 | ~10대 | 단체권 | 50280.0 |
7 | 20대 | 일일권 | 24609660.0 |
8 | 30대 | 일일권 | 15251200.0 |
9 | 40대 | 일일권 | 6991400.0 |
10 | 50대 | 일일권 | 2371560.0 |
11 | 60대 | 일일권 | 388380.0 |
12 | 70대~ | 일일권 | 76880.0 |
13 | ~10대 | 일일권 | 4623160.0 |
14 | 20대 | 일일권(비회원) | 14160.0 |
15 | 30대 | 일일권(비회원) | NaN |
16 | 40대 | 일일권(비회원) | NaN |
17 | 50대 | 일일권(비회원) | NaN |
18 | 60대 | 일일권(비회원) | NaN |
19 | 70대~ | 일일권(비회원) | NaN |
20 | ~10대 | 일일권(비회원) | NaN |
21 | 20대 | 정기권 | 32250310.0 |
22 | 30대 | 정기권 | 26300820.0 |
23 | 40대 | 정기권 | 20321980.0 |
24 | 50대 | 정기권 | 11612290.0 |
25 | 60대 | 정기권 | 3840400.0 |
26 | 70대~ | 정기권 | 629750.0 |
27 | ~10대 | 정기권 | 4168920.0 |
stations = pd.read_csv("stations.csv")
stations.head()
Gu | ID | Station | Address | Latitude | Longitude | Date | No_of_Racks | |
---|---|---|---|---|---|---|---|---|
0 | 마포구 | 101 | 101. (구)합정동 주민센터 | 서울특별시 마포구 동교로8길 58 | 37.549561 | 126.905754 | 2015-09-06 23:40 | 5 |
1 | 마포구 | 102 | 102. 망원역 1번출구 앞 | 서울특별시 마포구 월드컵로 72 | 37.555649 | 126.910629 | 2015-09-06 23:42 | 20 |
2 | 마포구 | 103 | 103. 망원역 2번출구 앞 | 서울특별시 마포구 월드컵로 79 | 37.554951 | 126.910835 | 2015-09-06 23:43 | 14 |
3 | 마포구 | 104 | 104. 합정역 1번출구 앞 | 서울특별시 마포구 양화로 59 | 37.550629 | 126.914986 | 2015-09-06 23:44 | 13 |
4 | 마포구 | 105 | 105. 합정역 5번출구 앞 | 서울특별시 마포구 양화로 48 | 37.550007 | 126.914825 | 2015-09-06 23:45 | 5 |
bike_data = pd.merge(bike_data, stations, left_on='Station_no_out', right_on='ID')
bike_data.head()
Date_out | Time_out | Station_no_out | Station_out | Membership_type | Gender | Age_Group | Momentum | Station_no_in | Station_in | ... | Distance | Duration | Gu | ID | Station | Address | Latitude | Longitude | Date | No_of_Racks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-03 | 0 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 정기권 | U | ~10대 | 28.27 | 122 | 신성기사식당 앞 | ... | 1050 | 5 | 마포구 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 서울특별시 마포구 신촌로 106 | 37.555054 | 126.937569 | 2015-09-07 10:37 | 15 |
1 | 2019-10-03 | 10 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 정기권 | F | 20대 | 14.41 | 125 | 서강대 남문 옆 | ... | 560 | 3 | 마포구 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 서울특별시 마포구 신촌로 106 | 37.555054 | 126.937569 | 2015-09-07 10:37 | 15 |
2 | 2019-10-03 | 12 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 정기권 | U | 20대 | 27.75 | 125 | 서강대 남문 옆 | ... | 960 | 62 | 마포구 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 서울특별시 마포구 신촌로 106 | 37.555054 | 126.937569 | 2015-09-07 10:37 | 15 |
3 | 2019-10-03 | 13 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 정기권 | U | 30대 | 25.26 | 118 | 광흥창역 2번출구 앞 | ... | 1160 | 10 | 마포구 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 서울특별시 마포구 신촌로 106 | 37.555054 | 126.937569 | 2015-09-07 10:37 | 15 |
4 | 2019-10-03 | 14 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 정기권 | M | 20대 | 15.12 | 125 | 서강대 남문 옆 | ... | 570 | 17 | 마포구 | 129 | 129. 신촌역(2호선) 6번출구 옆 | 서울특별시 마포구 신촌로 106 | 37.555054 | 126.937569 | 2015-09-07 10:37 | 15 |
5 rows × 23 columns
bike_data.Gu.value_counts()
영등포구 10290 마포구 9345 서초구 6543 동대문구 4797 은평구 4606 Name: Gu, dtype: int64
y_gu = bike_data[bike_data.Gu=='영등포구']
m_gu = bike_data[bike_data.Gu=='마포구']
from scipy import stats
stats.levene(y_gu.Distance, m_gu.Distance)
LeveneResult(statistic=3.5647234607192013, pvalue=0.05903430224682354)
np.mean(y_gu.Distance), np.mean(m_gu.Distance)
(4190.278911564626, 4514.426966292135)
stats.ttest_ind(y_gu.Distance, m_gu.Distance, equal_var = True)
Ttest_indResult(statistic=-4.002195758414915, pvalue=6.298774059911862e-05)
s_gu = bike_data[bike_data.Gu == '서초구']
d_gu = bike_data[bike_data.Gu == '동대문구']
e_gu = bike_data[bike_data.Gu == '은평구']
from scipy import stats
stats.bartlett(y_gu.Distance, m_gu.Distance, s_gu.Distance, d_gu.Distance, e_gu.Distance)
BartlettResult(statistic=405.99591324805436, pvalue=1.4084240027307602e-86)
귀무가설 기각, 등분석 가정 x
등분산으로 가정하고
File "<ipython-input-48-a40b814f7648>", line 1 등분산으로 가정하고 ^ SyntaxError: invalid syntax
stats.f_oneway(y_gu.Distance, m_gu.Distance, s_gu.Distance, d_gu.Distance, e_gu.Distance)
F_onewayResult(statistic=37.75546101206967, pvalue=1.4366160740892166e-31)
귀무가설 기각 모든 그룹의 평균이 같지 않다.
File "<ipython-input-50-a3c53b63c081>", line 1 귀무가설 기각 모든 그룹의 평균이 같지 않다. ^ SyntaxError: invalid syntax
plot_data = [y_gu.Distance, m_gu.Distance, s_gu.Distance, d_gu.Distance, e_gu.Distance]
#plt.boxplot(plot_data)
plt.boxplot(plot_data, showfliers = False)
plt.show()
from statsmodels.stats.multicomp import pairwise_tukeyhsd
hsd = pairwise_tukeyhsd(bike_data.Distance, bike_data.Gu)
hsd.summary()
group1 | group2 | meandiff | p-adj | lower | upper | reject |
---|---|---|---|---|---|---|
동대문구 | 마포구 | 793.049 | 0.001 | 515.079 | 1071.019 | True |
동대문구 | 서초구 | 1210.5524 | 0.001 | 913.0775 | 1508.0272 | True |
동대문구 | 영등포구 | 468.901 | 0.001 | 195.2948 | 742.5072 | True |
동대문구 | 은평구 | 351.3685 | 0.0249 | 28.5165 | 674.2204 | True |
마포구 | 서초구 | 417.5033 | 0.001 | 165.2287 | 669.778 | True |
마포구 | 영등포구 | -324.1481 | 0.001 | -547.7807 | -100.5154 | True |
마포구 | 은평구 | -441.6805 | 0.001 | -723.4333 | -159.9278 | True |
서초구 | 영등포구 | -741.6514 | 0.001 | -989.1095 | -494.1933 | True |
서초구 | 은평구 | -859.1839 | 0.001 | -1160.1964 | -558.1714 | True |
영등포구 | 은평구 | -117.5325 | 0.7495 | -394.9809 | 159.9159 | False |
from scipy.stats import chi2_contingency
crosstab = pd.crosstab(bike_data.Age_Group, bike_data.Membership_type)
chi2_contingency(crosstab)
(1383.2239098895247, 5.690745840063902e-283, 18, array([[5.13956887e+01, 3.43679017e+03, 7.90702903e-01, 1.05780234e+04], [3.37559372e+01, 2.25723355e+03, 5.19322110e-01, 6.94749119e+03], [2.18158568e+01, 1.45880956e+03, 3.35628566e-01, 4.49003895e+03], [1.15089514e+01, 7.69594728e+02, 1.77060791e-01, 2.36871926e+03], [3.31384728e+00, 2.21594418e+02, 5.09822658e-02, 6.82040752e+02], [6.61308001e-01, 4.42211574e+01, 1.01739693e-02, 1.36107361e+02], [7.54841067e+00, 5.04756415e+02, 1.16129395e-01, 1.55357904e+03]]))
result = chi2_contingency(crosstab)
print('Chi2 Statistic :{}, p-value : {}'.format(result[0],result[1]))
Chi2 Statistic :1383.2239098895247, p-value : 5.690745840063902e-283
귀무가설 Age 그룹과 Membership 그룹은 독립적이다
귀무가설 기각 연관성이 있다.
dist_by_gu = pd.pivot_table(bike_data, index='Gu', values ='Distance', aggfunc = len)
dist_by_gu
Distance | |
---|---|
Gu | |
동대문구 | 4797 |
마포구 | 9345 |
서초구 | 6543 |
영등포구 | 10290 |
은평구 | 4606 |
population = pd.read_csv('population_by_Gu.txt', sep='\t')
population
Gu | Family | Population | Male | Female | D_Total | D_Male | D_Female | F_Total | F_Male | F_Female | per_Family | over_65 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 동대문구 | 164191 | 363023 | 178490 | 184533 | 346194 | 172113 | 174081 | 16829 | 6377 | 10452 | 2.11 | 59350 |
1 | 은평구 | 207681 | 484546 | 233360 | 251186 | 480032 | 231528 | 248504 | 4514 | 1832 | 2682 | 2.31 | 80738 |
2 | 마포구 | 175023 | 385925 | 181303 | 204622 | 374035 | 176891 | 197144 | 11890 | 4412 | 7478 | 2.14 | 52429 |
3 | 영등포구 | 174806 | 400986 | 200986 | 200000 | 367678 | 182438 | 185240 | 33308 | 18548 | 14760 | 2.10 | 57872 |
4 | 서초구 | 173199 | 435107 | 208181 | 226926 | 430826 | 206039 | 224787 | 4281 | 2142 | 2139 | 2.49 | 57136 |
by_gu = pd.merge(dist_by_gu,population, on='Gu')
by_gu
Gu | Distance | Family | Population | Male | Female | D_Total | D_Male | D_Female | F_Total | F_Male | F_Female | per_Family | over_65 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 동대문구 | 4797 | 164191 | 363023 | 178490 | 184533 | 346194 | 172113 | 174081 | 16829 | 6377 | 10452 | 2.11 | 59350 |
1 | 마포구 | 9345 | 175023 | 385925 | 181303 | 204622 | 374035 | 176891 | 197144 | 11890 | 4412 | 7478 | 2.14 | 52429 |
2 | 서초구 | 6543 | 173199 | 435107 | 208181 | 226926 | 430826 | 206039 | 224787 | 4281 | 2142 | 2139 | 2.49 | 57136 |
3 | 영등포구 | 10290 | 174806 | 400986 | 200986 | 200000 | 367678 | 182438 | 185240 | 33308 | 18548 | 14760 | 2.10 | 57872 |
4 | 은평구 | 4606 | 207681 | 484546 | 233360 | 251186 | 480032 | 231528 | 248504 | 4514 | 1832 | 2682 | 2.31 | 80738 |
plt.scatter(by_gu.Distance, by_gu.Population)
plt.show()
stats.pearsonr(by_gu.Distance, by_gu.Population)
(-0.3547744402265295, 0.5579504252953678)
상관계수, P값
귀무가설 : 상관관계가 없다
by_gu = pd.merge(dist_by_gu, population, on ='Gu')[['Gu','Distance','Population']]
by_gu.corr()
Distance | Population | |
---|---|---|
Distance | 1.000000 | -0.354774 |
Population | -0.354774 | 1.000000 |
weather = pd.read_csv("weather.csv")
weather
date_old | date | time | temp | cum_precipitation | humidity | insolation | sunshine | wind | wind_direction | sea_lvl_pressure | pressure | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-03 오전 12:00:00 | 2019-10-03 | 12 | 20.0 | 23.2 | 94.1 | 3.40 | 0 | 5.5 | 351.7 | 1004.1 | 994.2 |
1 | 2019-10-03 오전 12:01:00 | 2019-10-03 | 12 | 20.1 | 0.0 | 94.1 | 0.00 | 0 | 3.7 | 348.6 | 1004.1 | 994.2 |
2 | 2019-10-03 오전 12:02:00 | 2019-10-03 | 12 | 20.0 | 0.0 | 94.1 | 0.00 | 0 | 3.6 | 346.4 | 1004.1 | 994.2 |
3 | 2019-10-03 오전 12:03:00 | 2019-10-03 | 12 | 20.0 | 0.0 | 94.1 | 0.00 | 0 | 3.1 | 349.1 | 1004.1 | 994.2 |
4 | 2019-10-03 오전 12:04:00 | 2019-10-03 | 12 | 20.0 | 0.0 | 94.0 | 0.00 | 0 | 3.4 | 335.9 | 1004.1 | 994.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4315 | 2019-10-05 오후 11:55:00 | 2019-10-05 | 23 | 14.8 | 0.6 | 63.6 | 8.24 | 8400 | 3.2 | 350.9 | 1021.3 | 1011.0 |
4316 | 2019-10-05 오후 11:56:00 | 2019-10-05 | 23 | 14.8 | 0.6 | 63.5 | 8.24 | 8400 | 2.9 | 354.3 | 1021.3 | 1011.0 |
4317 | 2019-10-05 오후 11:57:00 | 2019-10-05 | 23 | 14.8 | 0.6 | 63.7 | 8.24 | 8400 | 3.1 | 3.9 | 1021.3 | 1011.0 |
4318 | 2019-10-05 오후 11:58:00 | 2019-10-05 | 23 | 14.7 | 0.6 | 63.9 | 8.24 | 8400 | 2.3 | 10.0 | 1021.2 | 1011.0 |
4319 | 2019-10-05 오후 11:59:00 | 2019-10-05 | 23 | 14.8 | 0.6 | 64.0 | 8.24 | 8400 | 2.6 | 351.7 | 1021.3 | 1011.0 |
4320 rows × 12 columns
new_weather = pd.pivot_table(weather, index = ['date','time'],
values = ['temp', 'cum_precipitation', 'humidity', 'insolation', 'sunshine', 'wind', 'wind_direction', 'sea_lvl_pressure', 'pressure'], aggfunc = np.mean)
new_weather
cum_precipitation | humidity | insolation | pressure | sea_lvl_pressure | sunshine | temp | wind | wind_direction | ||
---|---|---|---|---|---|---|---|---|---|---|
date | time | |||||||||
2019-10-03 | 1 | 2.361667 | 93.846667 | 0.0000 | 993.010000 | 1002.910000 | 0 | 20.016667 | 3.290000 | 178.788333 |
2 | 3.353333 | 93.453333 | 0.0000 | 992.668333 | 1002.568333 | 0 | 19.908333 | 3.056667 | 333.400000 | |
3 | 3.930000 | 91.686667 | 0.0000 | 992.253333 | 1002.153333 | 0 | 19.923333 | 2.125000 | 330.110000 | |
4 | 4.423333 | 93.061667 | 0.0000 | 992.316667 | 1002.216667 | 0 | 19.928333 | 1.931667 | 251.535000 | |
5 | 4.500000 | 95.028333 | 0.0000 | 992.835000 | 1002.735000 | 0 | 19.871667 | 2.886667 | 236.116667 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2019-10-05 | 20 | 0.600000 | 61.381667 | 8.2400 | 1009.146667 | 1019.273333 | 8400 | 16.610000 | 3.631667 | 125.785000 |
21 | 0.600000 | 62.598333 | 8.2400 | 1009.773333 | 1019.973333 | 8400 | 15.988333 | 3.686667 | 225.700000 | |
22 | 0.600000 | 63.560000 | 8.2400 | 1010.376667 | 1020.576667 | 8400 | 15.436667 | 3.680000 | 225.253333 | |
23 | 0.600000 | 63.426667 | 8.2400 | 1010.846667 | 1021.085000 | 8400 | 14.986667 | 3.401667 | 179.688333 | |
24 | 0.600000 | 69.291667 | 3.6575 | 1005.155000 | 1015.140000 | 1703 | 20.473333 | 2.841667 | 248.531667 |
72 rows × 9 columns
new_weather = new_weather.reset_index()
new_weather #분단위 데이터를 시간단위로 변경 4320 -> 72
date | time | cum_precipitation | humidity | insolation | pressure | sea_lvl_pressure | sunshine | temp | wind | wind_direction | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-03 | 1 | 2.361667 | 93.846667 | 0.0000 | 993.010000 | 1002.910000 | 0 | 20.016667 | 3.290000 | 178.788333 |
1 | 2019-10-03 | 2 | 3.353333 | 93.453333 | 0.0000 | 992.668333 | 1002.568333 | 0 | 19.908333 | 3.056667 | 333.400000 |
2 | 2019-10-03 | 3 | 3.930000 | 91.686667 | 0.0000 | 992.253333 | 1002.153333 | 0 | 19.923333 | 2.125000 | 330.110000 |
3 | 2019-10-03 | 4 | 4.423333 | 93.061667 | 0.0000 | 992.316667 | 1002.216667 | 0 | 19.928333 | 1.931667 | 251.535000 |
4 | 2019-10-03 | 5 | 4.500000 | 95.028333 | 0.0000 | 992.835000 | 1002.735000 | 0 | 19.871667 | 2.886667 | 236.116667 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
67 | 2019-10-05 | 20 | 0.600000 | 61.381667 | 8.2400 | 1009.146667 | 1019.273333 | 8400 | 16.610000 | 3.631667 | 125.785000 |
68 | 2019-10-05 | 21 | 0.600000 | 62.598333 | 8.2400 | 1009.773333 | 1019.973333 | 8400 | 15.988333 | 3.686667 | 225.700000 |
69 | 2019-10-05 | 22 | 0.600000 | 63.560000 | 8.2400 | 1010.376667 | 1020.576667 | 8400 | 15.436667 | 3.680000 | 225.253333 |
70 | 2019-10-05 | 23 | 0.600000 | 63.426667 | 8.2400 | 1010.846667 | 1021.085000 | 8400 | 14.986667 | 3.401667 | 179.688333 |
71 | 2019-10-05 | 24 | 0.600000 | 69.291667 | 3.6575 | 1005.155000 | 1015.140000 | 1703 | 20.473333 | 2.841667 | 248.531667 |
72 rows × 11 columns
new_bike = pd.pivot_table(bike_data, index = ['Date_out', 'Time_out'], values = ['Distance'], aggfunc = len)
new_bike = new_bike.reset_index()
new_bike
Date_out | Time_out | Distance | |
---|---|---|---|
0 | 2019-10-03 | 0 | 40 |
1 | 2019-10-03 | 1 | 64 |
2 | 2019-10-03 | 2 | 73 |
3 | 2019-10-03 | 3 | 78 |
4 | 2019-10-03 | 4 | 57 |
... | ... | ... | ... |
67 | 2019-10-05 | 19 | 648 |
68 | 2019-10-05 | 20 | 619 |
69 | 2019-10-05 | 21 | 664 |
70 | 2019-10-05 | 22 | 585 |
71 | 2019-10-05 | 23 | 418 |
72 rows × 3 columns
new_bike.rename(columns = {'Distance':'Count'}, inplace = True)
new_bike.columns
Index(['Date_out', 'Time_out', 'Count'], dtype='object')
bike_weather = pd.merge(new_bike, new_weather, left_on = ['Date_out', 'Time_out'], right_on = ['date','time'])
bike_weather
Date_out | Time_out | Count | date | time | cum_precipitation | humidity | insolation | pressure | sea_lvl_pressure | sunshine | temp | wind | wind_direction | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-03 | 1 | 64 | 2019-10-03 | 1 | 2.361667 | 93.846667 | 0.00 | 993.010000 | 1002.910000 | 0 | 20.016667 | 3.290000 | 178.788333 |
1 | 2019-10-03 | 2 | 73 | 2019-10-03 | 2 | 3.353333 | 93.453333 | 0.00 | 992.668333 | 1002.568333 | 0 | 19.908333 | 3.056667 | 333.400000 |
2 | 2019-10-03 | 3 | 78 | 2019-10-03 | 3 | 3.930000 | 91.686667 | 0.00 | 992.253333 | 1002.153333 | 0 | 19.923333 | 2.125000 | 330.110000 |
3 | 2019-10-03 | 4 | 57 | 2019-10-03 | 4 | 4.423333 | 93.061667 | 0.00 | 992.316667 | 1002.216667 | 0 | 19.928333 | 1.931667 | 251.535000 |
4 | 2019-10-03 | 5 | 43 | 2019-10-03 | 5 | 4.500000 | 95.028333 | 0.00 | 992.835000 | 1002.735000 | 0 | 19.871667 | 2.886667 | 236.116667 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64 | 2019-10-05 | 19 | 648 | 2019-10-05 | 19 | 0.600000 | 60.570000 | 8.24 | 1008.340000 | 1018.440000 | 8400 | 17.171667 | 4.085000 | 146.493333 |
65 | 2019-10-05 | 20 | 619 | 2019-10-05 | 20 | 0.600000 | 61.381667 | 8.24 | 1009.146667 | 1019.273333 | 8400 | 16.610000 | 3.631667 | 125.785000 |
66 | 2019-10-05 | 21 | 664 | 2019-10-05 | 21 | 0.600000 | 62.598333 | 8.24 | 1009.773333 | 1019.973333 | 8400 | 15.988333 | 3.686667 | 225.700000 |
67 | 2019-10-05 | 22 | 585 | 2019-10-05 | 22 | 0.600000 | 63.560000 | 8.24 | 1010.376667 | 1020.576667 | 8400 | 15.436667 | 3.680000 | 225.253333 |
68 | 2019-10-05 | 23 | 418 | 2019-10-05 | 23 | 0.600000 | 63.426667 | 8.24 | 1010.846667 | 1021.085000 | 8400 | 14.986667 | 3.401667 | 179.688333 |
69 rows × 14 columns
stats.linregress(bike_weather.temp, bike_weather.Count)
LinregressResult(slope=38.79375042397865, intercept=-320.1460800887602, rvalue=0.43840554053464026, pvalue=0.0001647148903170432, stderr=9.716288753605141)
slop, intercept, r_value, p_value, std_err = stats.linregress(bike_weather.temp, bike_weather.Count)
print("R-squared : %f" %r_value**2)
R-squared : 0.192199
import statsmodels.api as sm
x0 = bike_weather.temp
x1 = sm.add_constant(x0)
y = bike_weather.Count
model = sm.OLS(y, x1)
result = model.fit()
print(result.summary())
OLS Regression Results ============================================================================== Dep. Variable: Count R-squared: 0.192 Model: OLS Adj. R-squared: 0.180 Method: Least Squares F-statistic: 15.94 Date: Sun, 13 Jun 2021 Prob (F-statistic): 0.000165 Time: 09:42:49 Log-Likelihood: -474.73 No. Observations: 69 AIC: 953.5 Df Residuals: 67 BIC: 957.9 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -320.1461 207.877 -1.540 0.128 -735.070 94.778 temp 38.7938 9.716 3.993 0.000 19.400 58.188 ============================================================================== Omnibus: 9.887 Durbin-Watson: 0.207 Prob(Omnibus): 0.007 Jarque-Bera (JB): 3.162 Skew: -0.076 Prob(JB): 0.206 Kurtosis: 1.962 Cond. No. 155. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
R-squared: 0.192 Prob (F-statistic): 0.000165
귀무가설 : 회귀식이 존재하지 않는다
temp 38.7938 9.716 3.993 0.000 19.400 58.188 p값 0.05 이하 귀무가설 기가 온도가 영향을 미친다
from sklearn.model_selection import train_test_split
X = bike_weather[['cum_precipitation','humidity','temp','wind']]
y = bike_weather.Count
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =0.3, random_state =123)
import statsmodels.api as sm
X1 = sm.add_constant(X_train)
model = sm.OLS(y_train, X1)
result = model.fit()
print(result.summary())
OLS Regression Results ============================================================================== Dep. Variable: Count R-squared: 0.629 Model: OLS Adj. R-squared: 0.595 Method: Least Squares F-statistic: 18.23 Date: Sun, 13 Jun 2021 Prob (F-statistic): 7.99e-09 Time: 14:26:26 Log-Likelihood: -312.08 No. Observations: 48 AIC: 634.2 Df Residuals: 43 BIC: 643.5 Df Model: 4 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------- const 1765.8805 406.043 4.349 0.000 947.017 2584.744 cum_precipitation 3.3829 12.484 0.271 0.788 -21.793 28.558 humidity -16.5323 2.560 -6.458 0.000 -21.695 -11.370 temp -4.0097 11.521 -0.348 0.730 -27.244 19.224 wind 0.9488 30.574 0.031 0.975 -60.709 62.607 ============================================================================== Omnibus: 4.591 Durbin-Watson: 1.995 Prob(Omnibus): 0.101 Jarque-Bera (JB): 4.143 Skew: 0.643 Prob(JB): 0.126 Kurtosis: 2.354 Cond. No. 1.27e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.27e+03. This might indicate that there are strong multicollinearity or other numerical problems.
X1 = sm.add_constant(X_test)
pred = result.predict(X1)
pred
8 359.689227 37 776.467633 40 596.610515 56 446.904312 23 398.884587 53 215.006054 9 468.031981 43 373.272545 68 662.458239 1 155.301651 60 600.559534 67 658.713630 42 399.725912 63 689.399962 24 351.858574 58 607.149355 59 650.659018 55 409.889723 29 371.885866 6 205.195444 61 543.214005 dtype: float64
from sklearn import metrics
print('MAE :', metrics.mean_absolute_error(y_test, pred))
print('MSE :', metrics.mean_squared_error(y_test, pred))
print('RMSE :', np.sqrt(metrics.mean_squared_error(y_test, pred)))
print('MAPE :', np.mean(np.abs((y_test - pred) / y_test)) * 100)
MAE : 154.96781747396312 MSE : 40194.987815230634 RMSE : 200.48687691525006 MAPE : 37.74978631438906
bike_weather['Rain_YN'] ='N'
bike_weather.loc[bike_weather.cum_precipitation > 0, 'Rain_YN'] ='Y'
bike_weather
Date_out | Time_out | Count | date | time | cum_precipitation | humidity | insolation | pressure | sea_lvl_pressure | sunshine | temp | wind | wind_direction | Rain_YN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-03 | 1 | 64 | 2019-10-03 | 1 | 2.361667 | 93.846667 | 0.00 | 993.010000 | 1002.910000 | 0 | 20.016667 | 3.290000 | 178.788333 | Y |
1 | 2019-10-03 | 2 | 73 | 2019-10-03 | 2 | 3.353333 | 93.453333 | 0.00 | 992.668333 | 1002.568333 | 0 | 19.908333 | 3.056667 | 333.400000 | Y |
2 | 2019-10-03 | 3 | 78 | 2019-10-03 | 3 | 3.930000 | 91.686667 | 0.00 | 992.253333 | 1002.153333 | 0 | 19.923333 | 2.125000 | 330.110000 | Y |
3 | 2019-10-03 | 4 | 57 | 2019-10-03 | 4 | 4.423333 | 93.061667 | 0.00 | 992.316667 | 1002.216667 | 0 | 19.928333 | 1.931667 | 251.535000 | Y |
4 | 2019-10-03 | 5 | 43 | 2019-10-03 | 5 | 4.500000 | 95.028333 | 0.00 | 992.835000 | 1002.735000 | 0 | 19.871667 | 2.886667 | 236.116667 | Y |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64 | 2019-10-05 | 19 | 648 | 2019-10-05 | 19 | 0.600000 | 60.570000 | 8.24 | 1008.340000 | 1018.440000 | 8400 | 17.171667 | 4.085000 | 146.493333 | Y |
65 | 2019-10-05 | 20 | 619 | 2019-10-05 | 20 | 0.600000 | 61.381667 | 8.24 | 1009.146667 | 1019.273333 | 8400 | 16.610000 | 3.631667 | 125.785000 | Y |
66 | 2019-10-05 | 21 | 664 | 2019-10-05 | 21 | 0.600000 | 62.598333 | 8.24 | 1009.773333 | 1019.973333 | 8400 | 15.988333 | 3.686667 | 225.700000 | Y |
67 | 2019-10-05 | 22 | 585 | 2019-10-05 | 22 | 0.600000 | 63.560000 | 8.24 | 1010.376667 | 1020.576667 | 8400 | 15.436667 | 3.680000 | 225.253333 | Y |
68 | 2019-10-05 | 23 | 418 | 2019-10-05 | 23 | 0.600000 | 63.426667 | 8.24 | 1010.846667 | 1021.085000 | 8400 | 14.986667 | 3.401667 | 179.688333 | Y |
69 rows × 15 columns
#one hot encoding
ohe = pd.get_dummies(bike_weather['Rain_YN'])
ohe
N | Y | |
---|---|---|
0 | 0 | 1 |
1 | 0 | 1 |
2 | 0 | 1 |
3 | 0 | 1 |
4 | 0 | 1 |
... | ... | ... |
64 | 0 | 1 |
65 | 0 | 1 |
66 | 0 | 1 |
67 | 0 | 1 |
68 | 0 | 1 |
69 rows × 2 columns
bike_weather = pd.concat([bike_weather, ohe], axis=1, sort=False)
bike_weather
Date_out | Time_out | Count | date | time | cum_precipitation | humidity | insolation | pressure | sea_lvl_pressure | sunshine | temp | wind | wind_direction | Rain_YN | N | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-03 | 1 | 64 | 2019-10-03 | 1 | 2.361667 | 93.846667 | 0.00 | 993.010000 | 1002.910000 | 0 | 20.016667 | 3.290000 | 178.788333 | Y | 0 | 1 |
1 | 2019-10-03 | 2 | 73 | 2019-10-03 | 2 | 3.353333 | 93.453333 | 0.00 | 992.668333 | 1002.568333 | 0 | 19.908333 | 3.056667 | 333.400000 | Y | 0 | 1 |
2 | 2019-10-03 | 3 | 78 | 2019-10-03 | 3 | 3.930000 | 91.686667 | 0.00 | 992.253333 | 1002.153333 | 0 | 19.923333 | 2.125000 | 330.110000 | Y | 0 | 1 |
3 | 2019-10-03 | 4 | 57 | 2019-10-03 | 4 | 4.423333 | 93.061667 | 0.00 | 992.316667 | 1002.216667 | 0 | 19.928333 | 1.931667 | 251.535000 | Y | 0 | 1 |
4 | 2019-10-03 | 5 | 43 | 2019-10-03 | 5 | 4.500000 | 95.028333 | 0.00 | 992.835000 | 1002.735000 | 0 | 19.871667 | 2.886667 | 236.116667 | Y | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64 | 2019-10-05 | 19 | 648 | 2019-10-05 | 19 | 0.600000 | 60.570000 | 8.24 | 1008.340000 | 1018.440000 | 8400 | 17.171667 | 4.085000 | 146.493333 | Y | 0 | 1 |
65 | 2019-10-05 | 20 | 619 | 2019-10-05 | 20 | 0.600000 | 61.381667 | 8.24 | 1009.146667 | 1019.273333 | 8400 | 16.610000 | 3.631667 | 125.785000 | Y | 0 | 1 |
66 | 2019-10-05 | 21 | 664 | 2019-10-05 | 21 | 0.600000 | 62.598333 | 8.24 | 1009.773333 | 1019.973333 | 8400 | 15.988333 | 3.686667 | 225.700000 | Y | 0 | 1 |
67 | 2019-10-05 | 22 | 585 | 2019-10-05 | 22 | 0.600000 | 63.560000 | 8.24 | 1010.376667 | 1020.576667 | 8400 | 15.436667 | 3.680000 | 225.253333 | Y | 0 | 1 |
68 | 2019-10-05 | 23 | 418 | 2019-10-05 | 23 | 0.600000 | 63.426667 | 8.24 | 1010.846667 | 1021.085000 | 8400 | 14.986667 | 3.401667 | 179.688333 | Y | 0 | 1 |
69 rows × 17 columns
from sklearn.model_selection import train_test_split
X = bike_weather[['humidity','temp','wind','N','Y']]
y = bike_weather.Count
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.3, random_state = 123)
import statsmodels.api as sm
X1 = sm.add_constant(X_train)
model = sm.OLS(y_train, X1)
result = model.fit()
print(result.summary())
X1 = sm.add_constant(X_test)
pred = result.predict(X1)
from sklearn import metrics
print('MAE :', metrics.mean_absolute_error(y_test, pred))
print('MSE :', metrics.mean_squared_error(y_test, pred))
print('RMSE :', np.sqrt(metrics.mean_squared_error(y_test, pred)))
print('MAPE :', np.mean(np.abs((y_test - pred) / y_test)) * 100)
OLS Regression Results ============================================================================== Dep. Variable: Count R-squared: 0.639 Model: OLS Adj. R-squared: 0.606 Method: Least Squares F-statistic: 19.05 Date: Sun, 13 Jun 2021 Prob (F-statistic): 4.44e-09 Time: 14:42:30 Log-Likelihood: -311.41 No. Observations: 48 AIC: 632.8 Df Residuals: 43 BIC: 642.2 Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 1152.1179 259.635 4.437 0.000 628.514 1675.722 humidity -16.2627 2.498 -6.510 0.000 -21.301 -11.225 temp -1.6750 10.895 -0.154 0.879 -23.647 20.297 wind -10.6456 31.830 -0.334 0.740 -74.838 53.546 N 546.2192 133.130 4.103 0.000 277.736 814.702 Y 605.8987 131.741 4.599 0.000 340.217 871.580 ============================================================================== Omnibus: 5.222 Durbin-Watson: 1.974 Prob(Omnibus): 0.073 Jarque-Bera (JB): 5.093 Skew: 0.759 Prob(JB): 0.0784 Kurtosis: 2.505 Cond. No. 2.86e+17 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 3.45e-30. This might indicate that there are strong multicollinearity problems or that the design matrix is singular. MAE : 150.83364479946903 MSE : 41933.42705995384 RMSE : 204.77652956321396 MAPE : 37.74308826445337
bike_weather['over_500'] = 1
bike_weather.loc[bike_weather.Count < 500, 'over_500'] = 0
bike_weather
Date_out | Time_out | Count | date | time | cum_precipitation | humidity | insolation | pressure | sea_lvl_pressure | sunshine | temp | wind | wind_direction | Rain_YN | N | Y | over_500 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-03 | 1 | 64 | 2019-10-03 | 1 | 2.361667 | 93.846667 | 0.00 | 993.010000 | 1002.910000 | 0 | 20.016667 | 3.290000 | 178.788333 | Y | 0 | 1 | 0 |
1 | 2019-10-03 | 2 | 73 | 2019-10-03 | 2 | 3.353333 | 93.453333 | 0.00 | 992.668333 | 1002.568333 | 0 | 19.908333 | 3.056667 | 333.400000 | Y | 0 | 1 | 0 |
2 | 2019-10-03 | 3 | 78 | 2019-10-03 | 3 | 3.930000 | 91.686667 | 0.00 | 992.253333 | 1002.153333 | 0 | 19.923333 | 2.125000 | 330.110000 | Y | 0 | 1 | 0 |
3 | 2019-10-03 | 4 | 57 | 2019-10-03 | 4 | 4.423333 | 93.061667 | 0.00 | 992.316667 | 1002.216667 | 0 | 19.928333 | 1.931667 | 251.535000 | Y | 0 | 1 | 0 |
4 | 2019-10-03 | 5 | 43 | 2019-10-03 | 5 | 4.500000 | 95.028333 | 0.00 | 992.835000 | 1002.735000 | 0 | 19.871667 | 2.886667 | 236.116667 | Y | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
64 | 2019-10-05 | 19 | 648 | 2019-10-05 | 19 | 0.600000 | 60.570000 | 8.24 | 1008.340000 | 1018.440000 | 8400 | 17.171667 | 4.085000 | 146.493333 | Y | 0 | 1 | 1 |
65 | 2019-10-05 | 20 | 619 | 2019-10-05 | 20 | 0.600000 | 61.381667 | 8.24 | 1009.146667 | 1019.273333 | 8400 | 16.610000 | 3.631667 | 125.785000 | Y | 0 | 1 | 1 |
66 | 2019-10-05 | 21 | 664 | 2019-10-05 | 21 | 0.600000 | 62.598333 | 8.24 | 1009.773333 | 1019.973333 | 8400 | 15.988333 | 3.686667 | 225.700000 | Y | 0 | 1 | 1 |
67 | 2019-10-05 | 22 | 585 | 2019-10-05 | 22 | 0.600000 | 63.560000 | 8.24 | 1010.376667 | 1020.576667 | 8400 | 15.436667 | 3.680000 | 225.253333 | Y | 0 | 1 | 1 |
68 | 2019-10-05 | 23 | 418 | 2019-10-05 | 23 | 0.600000 | 63.426667 | 8.24 | 1010.846667 | 1021.085000 | 8400 | 14.986667 | 3.401667 | 179.688333 | Y | 0 | 1 | 0 |
69 rows × 18 columns
from sklearn.model_selection import train_test_split
X = bike_weather[['cum_precipitation', 'humidity','temp','wind']]
y = bike_weather.over_500
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=123)
import statsmodels.api as sm
X1 = sm.add_constant(X_train)
logit_model = sm.Logit(y_train, X1)
result = logit_model.fit()
print(result.summary())
Optimization terminated successfully. Current function value: 0.392092 Iterations 7 Logit Regression Results ============================================================================== Dep. Variable: over_500 No. Observations: 48 Model: Logit Df Residuals: 43 Method: MLE Df Model: 4 Date: Sun, 13 Jun 2021 Pseudo R-squ.: 0.4227 Time: 14:48:32 Log-Likelihood: -18.820 converged: True LL-Null: -32.601 Covariance Type: nonrobust LLR p-value: 1.530e-05 ===================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------- const 11.4773 8.488 1.352 0.176 -5.158 28.113 cum_precipitation 0.0593 0.217 0.273 0.785 -0.366 0.485 humidity -0.1640 0.054 -3.018 0.003 -0.271 -0.057 temp 0.0219 0.272 0.080 0.936 -0.511 0.554 wind 0.2569 0.558 0.460 0.645 -0.837 1.350 =====================================================================================
# Pseudo R-squ.: 0.4227, LLR p-value: 1.530e-05
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
print('Train set 정확도 : %.2f'%log_reg.score(X_train, y_train))
print('Test set 정확도 : %.2f'%log_reg.score(X_test, y_test))
Train set 정확도 : 0.88 Test set 정확도 : 0.76
from sklearn.metrics import classification_report
y_pred = log_reg.predict(X_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.70 0.78 0.74 9 1 0.82 0.75 0.78 12 accuracy 0.76 21 macro avg 0.76 0.76 0.76 21 weighted avg 0.77 0.76 0.76 21
import os
os.environ['PATH'] += os.pathsep + 'C:\Program Files\Graphviz\bin'
#from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
X = X_train
y = y_train
dTree = DecisionTreeClassifier()
dTreeModel =dTree.fit(X,y)
dTreeModel
DecisionTreeClassifier()
DecisionTreeClassifier(ccp_alpha =0.0, class_weight=None, criterion='gini',
max_depth =None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
DecisionTreeClassifier()
from sklearn.tree import export_graphviz
import pydotplus
from IPython.display import Image
dot_data = export_graphviz(dTreeModel, out_file=None,
feature_names=['cum_precipitation','humidity','temp','wind'],
class_names =('Y','N'), filled =True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
#지니 불순도 불순도가 낮아지도록
dTreeModel.predict(X_test)
array([1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1], dtype=int64)
from sklearn.metrics import accuracy_score
y_pred = dTreeModel.predict(X_test)
print('Accuracy : %2f'%accuracy_score(y_test,y_pred))
Accuracy : 0.809524
y_pred = dTreeModel.predict(X_test)
print(classification_report(y_test,y_pred))
precision recall f1-score support 0 0.86 0.67 0.75 9 1 0.79 0.92 0.85 12 accuracy 0.81 21 macro avg 0.82 0.79 0.80 21 weighted avg 0.82 0.81 0.80 21
# 의사결정나무 과적합이 많이 나옴
# 학습정확도 높고, 예측정확도는 낮은편
# 직관적 이해쉽고, 데이터에 대한 제약이 적다.
import pandas as pd
weather = pd.read_csv('weather.csv')
weather
date_old | date | time | temp | cum_precipitation | humidity | insolation | sunshine | wind | wind_direction | sea_lvl_pressure | pressure | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-10-03 오전 12:00:00 | 2019-10-03 | 12 | 20.0 | 23.2 | 94.1 | 3.40 | 0 | 5.5 | 351.7 | 1004.1 | 994.2 |
1 | 2019-10-03 오전 12:01:00 | 2019-10-03 | 12 | 20.1 | 0.0 | 94.1 | 0.00 | 0 | 3.7 | 348.6 | 1004.1 | 994.2 |
2 | 2019-10-03 오전 12:02:00 | 2019-10-03 | 12 | 20.0 | 0.0 | 94.1 | 0.00 | 0 | 3.6 | 346.4 | 1004.1 | 994.2 |
3 | 2019-10-03 오전 12:03:00 | 2019-10-03 | 12 | 20.0 | 0.0 | 94.1 | 0.00 | 0 | 3.1 | 349.1 | 1004.1 | 994.2 |
4 | 2019-10-03 오전 12:04:00 | 2019-10-03 | 12 | 20.0 | 0.0 | 94.0 | 0.00 | 0 | 3.4 | 335.9 | 1004.1 | 994.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4315 | 2019-10-05 오후 11:55:00 | 2019-10-05 | 23 | 14.8 | 0.6 | 63.6 | 8.24 | 8400 | 3.2 | 350.9 | 1021.3 | 1011.0 |
4316 | 2019-10-05 오후 11:56:00 | 2019-10-05 | 23 | 14.8 | 0.6 | 63.5 | 8.24 | 8400 | 2.9 | 354.3 | 1021.3 | 1011.0 |
4317 | 2019-10-05 오후 11:57:00 | 2019-10-05 | 23 | 14.8 | 0.6 | 63.7 | 8.24 | 8400 | 3.1 | 3.9 | 1021.3 | 1011.0 |
4318 | 2019-10-05 오후 11:58:00 | 2019-10-05 | 23 | 14.7 | 0.6 | 63.9 | 8.24 | 8400 | 2.3 | 10.0 | 1021.2 | 1011.0 |
4319 | 2019-10-05 오후 11:59:00 | 2019-10-05 | 23 | 14.8 | 0.6 | 64.0 | 8.24 | 8400 | 2.6 | 351.7 | 1021.3 | 1011.0 |
4320 rows × 12 columns
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
weather = pd.read_csv('weather.csv')
X = np.array(weather.humidity).reshape(-1,1)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_scaled
array([[0.97205589], [0.97205589], [0.97205589], ..., [0.36526946], [0.36926148], [0.37125749]])
n_bike = pd.pivot_table(bike_data, index=['Gu','Date_out','Time_out'], values = 'Distance', aggfunc = len)
n_bike = n_bike.reset_index()
n_bike.rename(columns = {'Distance':'Count'}, inplace=True)
n_bike
Gu | Date_out | Time_out | Count | |
---|---|---|---|---|
0 | 동대문구 | 2019-10-03 | 0 | 6 |
1 | 동대문구 | 2019-10-03 | 1 | 7 |
2 | 동대문구 | 2019-10-03 | 2 | 6 |
3 | 동대문구 | 2019-10-03 | 3 | 9 |
4 | 동대문구 | 2019-10-03 | 4 | 5 |
... | ... | ... | ... | ... |
355 | 은평구 | 2019-10-05 | 19 | 72 |
356 | 은평구 | 2019-10-05 | 20 | 63 |
357 | 은평구 | 2019-10-05 | 21 | 62 |
358 | 은평구 | 2019-10-05 | 22 | 63 |
359 | 은평구 | 2019-10-05 | 23 | 66 |
360 rows × 4 columns
n_bike2 = pd.pivot_table(n_bike, index = 'Gu', columns = 'Time_out', values = 'Count', aggfunc = np.mean)
n_bike2 = n_bike2.reset_index()
n_bike2
Time_out | Gu | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 동대문구 | 47.000000 | 45.000000 | 37.333333 | 29.666667 | 15.333333 | 16.000000 | 23.666667 | 39.666667 | 61.666667 | ... | 85.333333 | 84.666667 | 92.666667 | 106.666667 | 111.666667 | 99.666667 | 93.000000 | 93.333333 | 84.333333 | 83.000000 |
1 | 마포구 | 86.666667 | 70.000000 | 48.666667 | 38.666667 | 31.666667 | 27.333333 | 33.666667 | 68.666667 | 91.333333 | ... | 160.333333 | 200.333333 | 195.333333 | 233.000000 | 243.666667 | 204.333333 | 199.333333 | 201.333333 | 158.333333 | 118.000000 |
2 | 서초구 | 49.000000 | 47.000000 | 33.666667 | 32.666667 | 18.666667 | 16.333333 | 26.333333 | 47.333333 | 77.333333 | ... | 123.000000 | 135.000000 | 142.666667 | 175.000000 | 187.333333 | 148.000000 | 145.333333 | 131.666667 | 113.000000 | 80.333333 |
3 | 영등포구 | 86.333333 | 66.333333 | 49.333333 | 36.000000 | 26.333333 | 31.000000 | 53.333333 | 89.000000 | 127.333333 | ... | 194.000000 | 207.333333 | 219.333333 | 249.666667 | 275.666667 | 221.333333 | 200.000000 | 216.333333 | 187.333333 | 139.333333 |
4 | 은평구 | 48.666667 | 42.000000 | 35.333333 | 23.333333 | 12.333333 | 17.666667 | 26.000000 | 45.000000 | 65.666667 | ... | 85.333333 | 86.000000 | 91.000000 | 107.000000 | 97.333333 | 89.333333 | 93.666667 | 72.000000 | 79.333333 | 70.333333 |
5 rows × 25 columns
from sklearn import cluster
X = n_bike2.iloc[0:5, 1:25]
y = n_bike2.Gu
km2 = cluster.KMeans(n_clusters=2).fit(X)
km3 = cluster.KMeans(n_clusters=3).fit(X)
km4 = cluster.KMeans(n_clusters=4).fit(X)
n_bike2['2_Cluster'] = km2.labels_
n_bike2['3_Cluster'] = km3.labels_
n_bike2['4_Cluster'] = km4.labels_
n_bike2[['Gu','2_Cluster','3_Cluster','4_Cluster']]
Time_out | Gu | 2_Cluster | 3_Cluster | 4_Cluster |
---|---|---|---|---|
0 | 동대문구 | 0 | 2 | 1 |
1 | 마포구 | 1 | 1 | 0 |
2 | 서초구 | 0 | 0 | 2 |
3 | 영등포구 | 1 | 1 | 3 |
4 | 은평구 | 0 | 2 | 1 |