项目_天池津南数字制造

标题:津南数字制造算法挑战赛【赛场一】
地址:https://tianchi.aliyun.com/competition/entrance/231695/information

特征01_特征观察

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
INFO:__main__:columns_info:
type count count_rate unique_count unique_set mean std min 25% 50% 75% max
A1 int 1396 1.000 3 [(300, 1377), (200, 13), (250, 6)] 298.853868 10.130552 200.000 300.000 300.000 300.000 300.0000
A10 int 1396 1.000 4 [(100, 658), (102, 416), (101, 298), (103, 24)] 100.861032 0.905198 100.000 100.000 101.000 102.000 103.0000
A11 object 1396 1.000 94 [(9:00:00, 251), (17:00:00, 247), (1:00:00, 15... NaN NaN NaN NaN NaN NaN NaN
A12 float 1396 1.000 9 [(103.0, 684), (102.0, 406), (104.0, 140), (10... 102.641834 0.915387 98.000 102.000 103.000 103.000 107.0000
A13 float 1396 1.000 3 [(0.2, 1394), (0.15, 1), (0.12, 1)] 0.199907 0.002524 0.120 0.200 0.200 0.200 0.2000
A14 object 1396 1.000 92 [(10:00:00, 251), (18:00:00, 248), (2:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A15 float 1396 1.000 10 [(104.0, 695), (103.0, 309), (105.0, 220), (10... 103.829370 0.963639 100.000 103.000 104.000 104.000 109.0000
A16 object 1396 1.000 94 [(11:00:00, 250), (19:00:00, 248), (3:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A17 float 1396 1.000 13 [(105.0, 688), (104.0, 177), (106.0, 171), (10... 104.766905 1.401446 89.000 104.000 105.000 105.000 108.0000
A18 float 1396 1.000 2 [(0.2, 1395), (0.1, 1)] 0.199928 0.002676 0.100 0.200 0.200 0.200 0.2000
A19 int 1396 1.000 6 [(200, 906), (300, 459), (100, 27), (150, 2), ... 231.067335 50.478071 100.000 200.000 200.000 300.000 350.0000
A2 float 42 0.030 2 [(125.0, 42)] 125.000000 0.000000 125.000 125.000 125.000 125.000 125.0000
A20 object 1396 1.000 159 [(11:00-12:00, 239), (19:00-20:00, 233), (3:00... NaN NaN NaN NaN NaN NaN NaN
A21 float 1393 0.998 13 [(50.0, 1254), (40.0, 63), (30.0, 42), (35.0, ... 48.707825 4.976531 20.000 50.000 50.000 50.000 90.0000
A22 float 1396 1.000 4 [(9.0, 1216), (10.0, 174), (8.0, 5), (3.5, 1)] 9.117120 0.369152 3.500 9.000 9.000 9.000 10.0000
A23 float 1393 0.998 4 [(5.0, 1391), (10.0, 1), (4.0, 1)] 5.002872 0.136638 4.000 5.000 5.000 5.000 10.0000
A24 object 1395 0.999 92 [(12:00:00, 258), (20:00:00, 244), (4:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A25 object 1396 1.000 15 [(80, 542), (70, 527), (78, 186), (79, 79), (7... NaN NaN NaN NaN NaN NaN NaN
A26 object 1394 0.999 89 [(13:00:00, 265), (21:00:00, 248), (5:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A27 float 1396 1.000 13 [(73.0, 630), (78.0, 282), (75.0, 209), (72.0,... 74.396848 3.044490 45.000 73.000 73.000 77.000 80.0000
A28 object 1396 1.000 157 [(13:00-14:00, 243), (21:00-22:00, 234), (5:00... NaN NaN NaN NaN NaN NaN NaN
A3 float 1354 0.970 4 [(405.0, 1336), (270.0, 12), (340.0, 6)] 403.515510 13.348093 270.000 405.000 405.000 405.000 405.0000
A4 int 1396 1.000 4 [(700, 1336), (980, 42), (470, 12), (590, 6)] 705.974212 53.214754 470.000 700.000 700.000 700.000 980.0000
A5 object 1396 1.000 67 [(6:00:00, 269), (14:00:00, 260), (22:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A6 float 1396 1.000 39 [(29.0, 493), (30.0, 198), (21.0, 197), (24.0,... 28.287751 6.742765 17.000 24.000 29.000 30.000 97.0000
A7 object 149 0.107 76 [(12:40:00, 13), (15:40:00, 7), (7:00:00, 5), ... NaN NaN NaN NaN NaN NaN NaN
A8 float 149 0.107 9 [(80.0, 118), (73.0, 16), (74.0, 8), (82.0, 3)... 78.818792 2.683920 70.000 80.000 80.000 80.000 82.0000
A9 object 1396 1.000 95 [(8:00:00, 252), (16:00:00, 248), (0:00:00, 15... NaN NaN NaN NaN NaN NaN NaN
B1 float 1386 0.993 22 [(320.0, 751), (300.0, 127), (350.0, 113), (34... 334.452742 105.120753 3.500 320.000 320.000 330.000 1200.0000
B10 object 1152 0.825 181 [(10:30-12:00, 166), (18:30-20:00, 165), (2:30... NaN NaN NaN NaN NaN NaN NaN
B11 object 547 0.392 38 [(20:00-21:00, 154), (12:00-13:00, 140), (4:00... NaN NaN NaN NaN NaN NaN NaN
B12 float 1395 0.999 5 [(1200.0, 777), (800.0, 584), (900.0, 20), (40... 1020.215054 205.920155 400.000 800.000 1200.000 1200.000 1200.0000
B13 float 1395 0.999 4 [(0.15, 1388), (0.03, 6), (0.06, 1)] 0.149419 0.008213 0.030 0.150 0.150 0.150 0.1500
B14 int 1396 1.000 21 [(400, 740), (420, 329), (440, 226), (460, 35)... 410.403295 26.018410 40.000 400.000 400.000 420.000 460.0000
B2 float 1394 0.999 4 [(3.5, 1374), (0.15, 19), (3.6, 1)] 3.454412 0.388585 0.150 3.500 3.500 3.500 3.6000
B3 float 1394 0.999 3 [(3.5, 1393), (3.6, 1)] 3.500072 0.002678 3.500 3.500 3.500 3.500 3.6000
B4 object 1396 1.000 178 [(14:00-15:00, 240), (22:00-23:00, 212), (6:00... NaN NaN NaN NaN NaN NaN NaN
B5 object 1395 0.999 61 [(15:00:00, 245), (23:00:00, 215), (7:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
B6 int 1396 1.000 36 [(80, 640), (65, 299), (60, 155), (79, 51), (7... 72.065186 9.161986 40.000 65.000 78.000 80.000 80.0000
B7 object 1396 1.000 58 [(17:00:00, 263), (1:00:00, 211), (9:00:00, 16... NaN NaN NaN NaN NaN NaN NaN
B8 float 1395 0.999 26 [(45.0, 1085), (40.0, 142), (50.0, 44), (28.0,... 43.709677 4.338396 20.000 45.000 45.000 45.000 73.0000
B9 object 1396 1.000 178 [(17:00-18:30, 189), (9:00-10:30, 158), (1:00-... NaN NaN NaN NaN NaN NaN NaN
rate float 1396 1.000 73 [(0.902, 305), (0.93, 128), (0.890999999999999... 0.923244 0.030880 0.624 0.902 0.925 0.943 1.0008


target_corr
pearson spearman mine
rate 1.000000 1.000000 0.998080
B14 0.478892 0.675522 0.835437
B12 0.392409 0.432610 0.522049
B6 0.365125 0.403446 0.462616
A10 0.350775 0.392945 0.438394
A19 -0.217527 -0.247848 0.377268
A27 -0.175551 -0.252845 0.360564
B1 0.102545 0.072208 0.285754
A6 0.026943 0.119990 0.268452
A15 0.206822 0.249648 0.258671
A12 0.254165 0.304103 0.243930
A17 0.177628 0.222954 0.241899
B8 0.174779 0.213332 0.187707
A22 -0.171821 -0.189534 0.142401
A8 0.210669 0.102360 0.139327
A2 NaN NaN 0.117767
A21 0.108092 0.129685 0.108338
A4 -0.213740 -0.169453 0.074376
A1 0.021058 0.012813 0.052154
B13 -0.064316 -0.063643 0.036630
A3 0.023490 0.012256 0.036331
B2 -0.096040 -0.098150 0.034360
B3 -0.008892 -0.007674 0.033967
A23 0.027710 0.051047 0.030690
A13 0.011695 0.019792 0.004721
A18 0.008885 0.007691 0.003226


float基本是离散的
type count count_rate unique_count unique_set mean std min 25% 50% 75% max
A18 float 1396 1.000 2 [(0.2, 1395), (0.1, 1)] 0.199928 0.002676 0.100 0.200 0.200 0.200 0.2000
A2 float 42 0.030 2 [(125.0, 42)] 125.000000 0.000000 125.000 125.000 125.000 125.000 125.0000
A13 float 1396 1.000 3 [(0.2, 1394), (0.15, 1), (0.12, 1)] 0.199907 0.002524 0.120 0.200 0.200 0.200 0.2000
B3 float 1394 0.999 3 [(3.5, 1393), (3.6, 1)] 3.500072 0.002678 3.500 3.500 3.500 3.500 3.6000
B2 float 1394 0.999 4 [(3.5, 1374), (0.15, 19), (3.6, 1)] 3.454412 0.388585 0.150 3.500 3.500 3.500 3.6000
A22 float 1396 1.000 4 [(9.0, 1216), (10.0, 174), (8.0, 5), (3.5, 1)] 9.117120 0.369152 3.500 9.000 9.000 9.000 10.0000
A23 float 1393 0.998 4 [(5.0, 1391), (10.0, 1), (4.0, 1)] 5.002872 0.136638 4.000 5.000 5.000 5.000 10.0000
B13 float 1395 0.999 4 [(0.15, 1388), (0.03, 6), (0.06, 1)] 0.149419 0.008213 0.030 0.150 0.150 0.150 0.1500
A3 float 1354 0.970 4 [(405.0, 1336), (270.0, 12), (340.0, 6)] 403.515510 13.348093 270.000 405.000 405.000 405.000 405.0000
B12 float 1395 0.999 5 [(1200.0, 777), (800.0, 584), (900.0, 20), (40... 1020.215054 205.920155 400.000 800.000 1200.000 1200.000 1200.0000
A12 float 1396 1.000 9 [(103.0, 684), (102.0, 406), (104.0, 140), (10... 102.641834 0.915387 98.000 102.000 103.000 103.000 107.0000
A8 float 149 0.107 9 [(80.0, 118), (73.0, 16), (74.0, 8), (82.0, 3)... 78.818792 2.683920 70.000 80.000 80.000 80.000 82.0000
A15 float 1396 1.000 10 [(104.0, 695), (103.0, 309), (105.0, 220), (10... 103.829370 0.963639 100.000 103.000 104.000 104.000 109.0000
A27 float 1396 1.000 13 [(73.0, 630), (78.0, 282), (75.0, 209), (72.0,... 74.396848 3.044490 45.000 73.000 73.000 77.000 80.0000
A21 float 1393 0.998 13 [(50.0, 1254), (40.0, 63), (30.0, 42), (35.0, ... 48.707825 4.976531 20.000 50.000 50.000 50.000 90.0000
A17 float 1396 1.000 13 [(105.0, 688), (104.0, 177), (106.0, 171), (10... 104.766905 1.401446 89.000 104.000 105.000 105.000 108.0000
B1 float 1386 0.993 22 [(320.0, 751), (300.0, 127), (350.0, 113), (34... 334.452742 105.120753 3.500 320.000 320.000 330.000 1200.0000
B8 float 1395 0.999 26 [(45.0, 1085), (40.0, 142), (50.0, 44), (28.0,... 43.709677 4.338396 20.000 45.000 45.000 45.000 73.0000
A6 float 1396 1.000 39 [(29.0, 493), (30.0, 198), (21.0, 197), (24.0,... 28.287751 6.742765 17.000 24.000 29.000 30.000 97.0000
rate float 1396 1.000 73 [(0.902, 305), (0.93, 128), (0.890999999999999... 0.923244 0.030880 0.624 0.902 0.925 0.943 1.0008


columns_info[columns_info['type']=='object']
type count count_rate unique_count unique_set mean std min 25% 50% 75% max
A11 object 1389 1.000 94 [(9:00:00, 251), (17:00:00, 245), (1:00:00, 15... NaN NaN NaN NaN NaN NaN NaN
A14 object 1389 1.000 92 [(10:00:00, 251), (18:00:00, 246), (2:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A16 object 1389 1.000 94 [(11:00:00, 250), (19:00:00, 246), (3:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A20 object 1389 1.000 158 [(11:00-12:00, 239), (19:00-20:00, 233), (3:00... NaN NaN NaN NaN NaN NaN NaN
A24 object 1389 1.000 91 [(12:00:00, 259), (20:00:00, 243), (4:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A26 object 1389 1.000 87 [(13:00:00, 267), (21:00:00, 248), (5:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A28 object 1389 1.000 157 [(13:00-14:00, 243), (21:00-22:00, 234), (5:00... NaN NaN NaN NaN NaN NaN NaN
A5 object 1389 1.000 67 [(6:00:00, 269), (14:00:00, 260), (22:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A9 object 1389 1.000 95 [(8:00:00, 252), (16:00:00, 246), (0:00:00, 15... NaN NaN NaN NaN NaN NaN NaN
B4 object 1389 1.000 177 [(14:00-15:00, 240), (22:00-23:00, 212), (6:00... NaN NaN NaN NaN NaN NaN NaN
B5 object 1389 1.000 60 [(15:00:00, 246), (23:00:00, 215), (7:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
B7 object 1389 1.000 58 [(17:00:00, 263), (1:00:00, 211), (9:00:00, 16... NaN NaN NaN NaN NaN NaN NaN
B9 object 1389 1.000 175 [(17:00-18:30, 189), (9:00-10:30, 157), (1:00-... NaN NaN NaN NaN NaN NaN NaN

特征02_丢弃缺失率过高或值单一特征

1
2
3
4
5
6
丢弃有效取值<0.90的特征
drop_columns
Index(['A2', 'A7', 'A8', 'B10', 'B11'], dtype='object')

丢弃取值单一特征
A1,A13,A18,A23,A3, A4,B13,B2,B3

特征03_特征可视化corr_pic

1
2
3
4
a,target均值基本相等,未发现强关联
b,名义float实际基本离散的
4,float特征,数量<20的做dummy处理,>20的暂
5,A25的单个脏数据清理,转为int处理








特征04_第一版baseline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
基础策略01
1,丢弃损失率>90%的特征:['A2', 'A7', 'A8', 'B10', 'B11']
2,丢弃取值单一:['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23', 'B3', 'B13']
3,缺失值填充众数:['A21', 'A23', 'A24', 'A26', 'A3', 'B1', 'B2', 'B3', 'B12', 'B13', 'B5', 'B8']
4,float如果unique_count<20,dummy,>20,暂时不处理
5,使用已有的float特征做预测
feature_columns = list(dummy_result_columns) + list(cannot_dummy_columns)
name model mean std
0 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... DecisionTreeRegressor -9.625545e-04 7.795378e-05
1 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... RandomForestRegressor -7.542296e-04 6.568240e-05
2 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... XGBRegressor -7.024208e-04 8.961907e-05
3 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... SVR -1.951809e-03 9.253026e-05
4 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... LinearRegression -8.229440e+16 1.645888e+17


基础策略01_改动01
6,丢弃target<0.85的点.添加到上图第3步后面,第4步前

name model mean std
0 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... DecisionTreeRegressor -8.219052e-04 6.105489e-05
1 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... RandomForestRegressor -6.369389e-04 4.182528e-05
2 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... XGBRegressor -5.989824e-04 4.076728e-05
3 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... SVR -8.555181e-04 4.585744e-05
4 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... LinearRegression -6.930078e+16 1.386016e+17

误差的确都降低了.约1*e-04

插播:计算基础误差
mean_squared_error(train_data[target_column],[train_data[target_column].mean() for i in range(0,train_data.shape[0])])
0.0008223730641064133

也就是说用taerget的自身mse为0.00082,可以大概评估自己距离瞎猜的距离
目前水平是基本瞎蒙

特征05_时间特征处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
典型格式:
9:00:00:B7,B5,A9,A5,A26,A24,A16,A14,A11
11:00-12:00:B9,B4,A28,A20

提取特征:
第一种:开始时间(小时)
第二种:开始时间(小时),结束时间(小时),持续时间(小时)

baseline的最后面添加

处理前成绩:(此时feature只有float特征dummy出来的特征)
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
name model mean std
0 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... DecisionTreeRegressor -8.049540e-04 7.960457e-05
1 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... RandomForestRegressor -6.537043e-04 3.725424e-05
2 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... XGBRegressor -5.961219e-04 3.902166e-05
3 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... SVR -8.561289e-04 4.623989e-05
4 71['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... LinearRegression -6.955097e+16 1.391019e+17

单纯时间特征以及时间特征和原有float特征结合
name model mean std
0 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... DecisionTreeRegressor -8.652548e-04 7.786419e-05
1 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... RandomForestRegressor -6.603347e-04 4.249767e-05
2 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... XGBRegressor -6.457705e-04 3.340577e-05
3 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... SVR -8.561289e-04 4.623989e-05
4 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... LinearRegression -7.675846e-03 1.384563e-02
5 92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... DecisionTreeRegressor -8.042192e-04 5.273865e-05
6 92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... RandomForestRegressor -5.910289e-04 3.373143e-05
7 92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... XGBRegressor -5.538508e-04 2.729725e-05
8 92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... SVR -8.561289e-04 4.623989e-05
9 92['A12_dummy_98.0', 'A12_dummy_100.0', 'A12_d... LinearRegression -1.082012e+15 2.148065e+15

从图上看也的确没有太大差异,基本也是平均的

2,时间特征按照10分中转为int
name model mean std
0 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... DecisionTreeRegressor -9.235728e-04 6.219508e-05
1 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... RandomForestRegressor -6.644564e-04 5.259317e-05
2 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... XGBRegressor -6.250570e-04 2.938126e-05
3 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... SVR -8.561289e-04 4.623989e-05
4 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... LinearRegression -7.507477e-04 8.794647e-05
5 177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... DecisionTreeRegressor -7.938051e-04 3.984663e-05
6 177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... RandomForestRegressor -5.772725e-04 3.244304e-05
7 177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... XGBRegressor -5.519323e-04 3.310014e-05
8 177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... SVR -8.561289e-04 4.623989e-05
9 177['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... LinearRegression -3.866811e+15 7.706426e+15

3,将cannot_dummy_columns转为空,所有float都转为dummy
name model mean std
0 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... DecisionTreeRegressor -8.925815e-04 3.946585e-05
1 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... RandomForestRegressor -6.580392e-04 5.113999e-05
2 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... XGBRegressor -6.250570e-04 2.938126e-05
3 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... SVR -8.561289e-04 4.623989e-05
4 21['B7_factor_hh', 'B5_factor_hh', 'A9_factor_... LinearRegression -7.507477e-04 8.794647e-05
5 174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... DecisionTreeRegressor -8.027813e-04 7.862450e-05
6 174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... RandomForestRegressor -5.781245e-04 2.876818e-05
7 174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... XGBRegressor -5.493320e-04 3.110661e-05
8 174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... SVR -8.561289e-04 4.623989e-05
9 174['A6_dummy_17.0', 'A6_dummy_18.0', 'A6_dumm... LinearRegression -7.316864e+15 1.463373e+16

特征06_双策略比对ref01

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
ref01策略:
1,删类别唯一:['B3', 'B13', 'A13', 'A18', 'A23']
2,删除缺失率超过90%的列:A1,A2,A3,A4,B2
3,train:只保留target>0.87
4,all_data.fillna(-1)
5,类别转float(/3600):['A5', 'A7', 'A9', 'A11', 'A14', 'A16', 'A24', 'A26', 'B5', 'B7']
6,时间差计算(/3600): ['A20', 'A28', 'B4', 'B9', 'B10', 'B11']
7,所有特征转为int形式:
8,数据样式:(样本id和target会被drop掉.)
train.head(5)
样本id A5 A6 A7 A8 A9 A10 A11 A12 A14 A15 A16 A17 A19 A20 A21 A22 A24 A25 A26 A27 A28 B1 B4 B5 B6 B7 B8 B9 B10 B11 B12 B14 target
0 sample_1528 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.879
1 sample_1698 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0.902
2 sample_639 1 1 0 0 1 2 1 1 1 1 1 1 1 0 0 0 1 2 1 1 1 1 0 1 1 2 0 0 0 1 1 0 0.936
3 sample_483 2 0 0 0 2 0 2 0 2 0 2 0 1 0 0 1 2 3 2 2 1 2 0 2 0 3 0 0 0 0 0 0 0.902
4 sample_617 3 1 0 0 3 1 3 1 3 1 3 1 1 1 0 0 3 1 3 1 1 1 0 3 1 4 0 0 0 1 1 1 0.983
9,效果评估:
for model in [DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()]:
score=cross_val_score(model, X_train, y_train, scoring='mean_squared_error', cv=5)
print(model.__class__.__name__+' '+str(score.mean())+' '+str(score.std()))

DecisionTreeRegressor -0.0003171201297343733 3.808162841960937e-05
RandomForestRegressor -0.000224005974881151 2.2218174576281365e-05
XGBRegressor -0.00021295863145713904 2.0333409455864207e-05
SVR -0.0009419628520902001 4.666016023199918e-05
LinearRegression -0.0006419399676652643 4.0137774687448616e-05

10,target切分5分,dunmmy形式分位5列
11,对于每个类别特征(前面的数据列)
for f1 in 类别:
for f2 In label切分后5个标签(列):
order_label = train.groupby([f1])[f2].mean()
for df in [train, test]:
df[col_name] = df[f].map(order_label) # 正是此处逻辑错误
这是一个奇怪的代码
奇怪在1,df[f].map这里的笔误,其实是f1.这个后续可以改下
奇怪2,作者说改好后分数下降了.后续看分数差异大小在决定吧

12,丢弃5个列特征(5个列就用来干了统计的事情,之后就丢了)
13,最终特征:
train.columns
Index(['A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A14', 'A15',
...
'B12_intTarget_0.0_mean', 'B12_intTarget_1.0_mean', 'B12_intTarget_2.0_mean', 'B12_intTarget_3.0_mean', 'B12_intTarget_4.0_mean', 'B14_intTarget_0.0_mean', 'B14_intTarget_1.0_mean', 'B14_intTarget_2.0_mean', 'B14_intTarget_3.0_mean', 'B14_intTarget_4.0_mean'], dtype='object', length=192)
train.head(5)
A5 A6 A7 A8 A9 A10 A11 A12 A14 A15 A16 A17 A19 A20 A21 A22 A24 A25 A26 A27 A28 B1 B4 B5 B6 B7 B8 B9 B10 B11 B12 B14 A5_intTarget_0.0_mean A5_intTarget_1.0_mean A5_intTarget_2.0_mean A5_intTarget_3.0_mean A5_intTarget_4.0_mean A6_intTarget_0.0_mean A6_intTarget_1.0_mean A6_intTarget_2.0_mean A6_intTarget_3.0_mean A6_intTarget_4.0_mean A7_intTarget_0.0_mean A7_intTarget_1.0_mean A7_intTarget_2.0_mean A7_intTarget_3.0_mean A7_intTarget_4.0_mean A8_intTarget_0.0_mean A8_intTarget_1.0_mean A8_intTarget_2.0_mean A8_intTarget_3.0_mean A8_intTarget_4.0_mean A9_intTarget_0.0_mean A9_intTarget_1.0_mean A9_intTarget_2.0_mean A9_intTarget_3.0_mean A9_intTarget_4.0_mean A10_intTarget_0.0_mean A10_intTarget_1.0_mean A10_intTarget_2.0_mean A10_intTarget_3.0_mean A10_intTarget_4.0_mean A11_intTarget_0.0_mean A11_intTarget_1.0_mean A11_intTarget_2.0_mean A11_intTarget_3.0_mean A11_intTarget_4.0_mean A12_intTarget_0.0_mean A12_intTarget_1.0_mean \
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.285714 0.285714 0.428571 0.000000 0.000000 0.157895 0.210526 0.438596 0.105263 0.070175 0.161526 0.284903 0.396916 0.090097 0.056818 0.161526 0.284903 0.396916 0.090097 0.056818 0.058824 0.470588 0.235294 0.117647 0.117647 0.177743 0.272025 0.387944 0.092736 0.052550 0.058824 0.470588 0.235294 0.117647 0.117647 0.177500 0.26500
1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0.285714 0.285714 0.428571 0.000000 0.000000 0.157895 0.210526 0.438596 0.105263 0.070175 0.161526 0.284903 0.396916 0.090097 0.056818 0.161526 0.284903 0.396916 0.090097 0.056818 0.058824 0.470588 0.235294 0.117647 0.117647 0.177743 0.272025 0.387944 0.092736 0.052550 0.058824 0.470588 0.235294 0.117647 0.117647 0.177500 0.26500
2 1 1 0 0 1 2 1 1 1 1 1 1 1 0 0 0 1 2 1 1 1 1 0 1 1 2 0 0 0 1 1 0 0.285714 0.285714 0.428571 0.000000 0.000000 0.157895 0.210526 0.438596 0.105263 0.070175 0.161526 0.284903 0.396916 0.090097 0.056818 0.161526 0.284903 0.396916 0.090097 0.056818 0.058824 0.470588 0.235294 0.117647 0.117647 0.177743 0.272025 0.387944 0.092736 0.052550 0.058824 0.470588 0.235294 0.117647 0.117647 0.177500 0.26500
3 2 0 0 0 2 0 2 0 2 0 2 0 1 0 0 1 2 3 2 2 1 2 0 2 0 3 0 0 0 0 0 0 0.285714 0.285714 0.428571 0.000000 0.000000 0.157895 0.210526 0.438596 0.105263 0.070175 0.161526 0.284903 0.396916 0.090097 0.056818 0.161526 0.284903 0.396916 0.090097 0.056818 0.058824 0.470588 0.235294 0.117647 0.117647 0.177743 0.272025 0.387944 0.092736 0.052550 0.058824 0.470588 0.235294 0.117647 0.117647 0.177500 0.26500
4 3 1 0 0 3 1 3 1 3 1 3 1 1 1 0 0 3 1 3 1 1 1 0 3 1 4 0 0 0 1 1 1 0.130769 0.307692 0.369231 0.096154 0.080769 0.160243 0.292089 0.383367 0.093306 0.064909 0.250000 0.250000 0.500000 0.000000 0.000000 0.169492 0.220339 0.432203 0.110169 0.059322 0.134146 0.300813 0.394309 0.085366 0.077236 0.107383 0.325503 0.402685 0.097315 0.067114 0.130612 0.306122 0.391837 0.085714 0.077551 0.141593 0.29646
A12_intTarget_2.0_mean A12_intTarget_3.0_mean A12_intTarget_4.0_mean A14_intTarget_0.0_mean A14_intTarget_1.0_mean A14_intTarget_2.0_mean A14_intTarget_3.0_mean A14_intTarget_4.0_mean A15_intTarget_0.0_mean A15_intTarget_1.0_mean A15_intTarget_2.0_mean A15_intTarget_3.0_mean A15_intTarget_4.0_mean A16_intTarget_0.0_mean A16_intTarget_1.0_mean A16_intTarget_2.0_mean A16_intTarget_3.0_mean A16_intTarget_4.0_mean A17_intTarget_0.0_mean A17_intTarget_1.0_mean A17_intTarget_2.0_mean A17_intTarget_3.0_mean A17_intTarget_4.0_mean A19_intTarget_0.0_mean A19_intTarget_1.0_mean A19_intTarget_2.0_mean A19_intTarget_3.0_mean A19_intTarget_4.0_mean A20_intTarget_0.0_mean A20_intTarget_1.0_mean A20_intTarget_2.0_mean A20_intTarget_3.0_mean A20_intTarget_4.0_mean A21_intTarget_0.0_mean A21_intTarget_1.0_mean A21_intTarget_2.0_mean A21_intTarget_3.0_mean A21_intTarget_4.0_mean A22_intTarget_0.0_mean A22_intTarget_1.0_mean A22_intTarget_2.0_mean A22_intTarget_3.0_mean \
0 0.39250 0.095000 0.057500 0.055556 0.500000 0.222222 0.111111 0.111111 0.179402 0.275748 0.388704 0.079734 0.053156 0.058824 0.470588 0.235294 0.117647 0.117647 0.195402 0.258621 0.425287 0.074713 0.028736 0.147903 0.275938 0.412804 0.092715 0.055188 0.171815 0.281853 0.386100 0.090734 0.055985 0.158743 0.283642 0.398872 0.09025 0.058824 0.153527 0.289627 0.400000 0.087967
1 0.39250 0.095000 0.057500 0.055556 0.500000 0.222222 0.111111 0.111111 0.179402 0.275748 0.388704 0.079734 0.053156 0.058824 0.470588 0.235294 0.117647 0.117647 0.195402 0.258621 0.425287 0.074713 0.028736 0.147903 0.275938 0.412804 0.092715 0.055188 0.171815 0.281853 0.386100 0.090734 0.055985 0.158743 0.283642 0.398872 0.09025 0.058824 0.153527 0.289627 0.400000 0.087967
2 0.39250 0.095000 0.057500 0.055556 0.500000 0.222222 0.111111 0.111111 0.179402 0.275748 0.388704 0.079734 0.053156 0.058824 0.470588 0.235294 0.117647 0.117647 0.195402 0.258621 0.425287 0.074713 0.028736 0.147903 0.275938 0.412804 0.092715 0.055188 0.171815 0.281853 0.386100 0.090734 0.055985 0.158743 0.283642 0.398872 0.09025 0.058824 0.153527 0.289627 0.400000 0.087967
3 0.39250 0.095000 0.057500 0.055556 0.500000 0.222222 0.111111 0.111111 0.179402 0.275748 0.388704 0.079734 0.053156 0.058824 0.470588 0.235294 0.117647 0.117647 0.195402 0.258621 0.425287 0.074713 0.028736 0.147903 0.275938 0.412804 0.092715 0.055188 0.171815 0.281853 0.386100 0.090734 0.055985 0.158743 0.283642 0.398872 0.09025 0.058824 0.153527 0.289627 0.400000 0.087967
4 0.40118 0.082596 0.067847 0.130081 0.304878 0.394309 0.085366 0.077236 0.164978 0.280753 0.390738 0.088278 0.068017 0.130081 0.304878 0.394309 0.085366 0.077236 0.160294 0.302941 0.388235 0.080882 0.060294 0.160535 0.283166 0.399108 0.088071 0.061315 0.154676 0.281775 0.401679 0.091127 0.062350 0.000000 1.000000 0.000000 0.00000 0.000000 0.211765 0.217647 0.388235 0.105882
A22_intTarget_4.0_mean A24_intTarget_0.0_mean A24_intTarget_1.0_mean A24_intTarget_2.0_mean A24_intTarget_3.0_mean A24_intTarget_4.0_mean A25_intTarget_0.0_mean A25_intTarget_1.0_mean A25_intTarget_2.0_mean A25_intTarget_3.0_mean A25_intTarget_4.0_mean A26_intTarget_0.0_mean A26_intTarget_1.0_mean A26_intTarget_2.0_mean A26_intTarget_3.0_mean A26_intTarget_4.0_mean A27_intTarget_0.0_mean A27_intTarget_1.0_mean A27_intTarget_2.0_mean A27_intTarget_3.0_mean A27_intTarget_4.0_mean A28_intTarget_0.0_mean A28_intTarget_1.0_mean A28_intTarget_2.0_mean A28_intTarget_3.0_mean A28_intTarget_4.0_mean B1_intTarget_0.0_mean B1_intTarget_1.0_mean B1_intTarget_2.0_mean B1_intTarget_3.0_mean B1_intTarget_4.0_mean B4_intTarget_0.0_mean B4_intTarget_1.0_mean B4_intTarget_2.0_mean B4_intTarget_3.0_mean B4_intTarget_4.0_mean B5_intTarget_0.0_mean B5_intTarget_1.0_mean B5_intTarget_2.0_mean B5_intTarget_3.0_mean B5_intTarget_4.0_mean B6_intTarget_0.0_mean B6_intTarget_1.0_mean \
0 0.058921 0.090909 0.454545 0.363636 0.090909 0.000000 0.130435 0.260870 0.434783 0.000000 0.130435 0.090909 0.454545 0.363636 0.090909 0.000000 0.086957 0.304348 0.413043 0.043478 0.130435 0.153846 0.246154 0.410256 0.097436 0.071795 0.133929 0.339286 0.375000 0.080357 0.062500 0.158046 0.286398 0.396552 0.090038 0.059387 0.117647 0.176471 0.529412 0.176471 0.000000 0.180272 0.238095
1 0.058921 0.090909 0.454545 0.363636 0.090909 0.000000 0.130435 0.260870 0.434783 0.000000 0.130435 0.090909 0.454545 0.363636 0.090909 0.000000 0.086957 0.304348 0.413043 0.043478 0.130435 0.153846 0.246154 0.410256 0.097436 0.071795 0.133929 0.339286 0.375000 0.080357 0.062500 0.158046 0.286398 0.396552 0.090038 0.059387 0.117647 0.176471 0.529412 0.176471 0.000000 0.180272 0.238095
2 0.058921 0.090909 0.454545 0.363636 0.090909 0.000000 0.130435 0.260870 0.434783 0.000000 0.130435 0.090909 0.454545 0.363636 0.090909 0.000000 0.086957 0.304348 0.413043 0.043478 0.130435 0.153846 0.246154 0.410256 0.097436 0.071795 0.133929 0.339286 0.375000 0.080357 0.062500 0.158046 0.286398 0.396552 0.090038 0.059387 0.117647 0.176471 0.529412 0.176471 0.000000 0.180272 0.238095
3 0.058921 0.090909 0.454545 0.363636 0.090909 0.000000 0.130435 0.260870 0.434783 0.000000 0.130435 0.090909 0.454545 0.363636 0.090909 0.000000 0.086957 0.304348 0.413043 0.043478 0.130435 0.153846 0.246154 0.410256 0.097436 0.071795 0.133929 0.339286 0.375000 0.080357 0.062500 0.158046 0.286398 0.396552 0.090038 0.059387 0.117647 0.176471 0.529412 0.176471 0.000000 0.180272 0.238095
4 0.058824 0.127572 0.308642 0.382716 0.090535 0.082305 0.159851 0.284387 0.390335 0.100372 0.055762 0.129032 0.318548 0.383065 0.084677 0.076613 0.151515 0.271132 0.408293 0.095694 0.066986 0.161996 0.282837 0.401051 0.087566 0.057793 0.152203 0.292390 0.397864 0.089453 0.061415 0.196581 0.188034 0.410256 0.102564 0.085470 0.120930 0.316279 0.376744 0.093023 0.083721 0.154088 0.273585
B6_intTarget_2.0_mean B6_intTarget_3.0_mean B6_intTarget_4.0_mean B7_intTarget_0.0_mean B7_intTarget_1.0_mean B7_intTarget_2.0_mean B7_intTarget_3.0_mean B7_intTarget_4.0_mean B8_intTarget_0.0_mean B8_intTarget_1.0_mean B8_intTarget_2.0_mean B8_intTarget_3.0_mean B8_intTarget_4.0_mean B9_intTarget_0.0_mean B9_intTarget_1.0_mean B9_intTarget_2.0_mean B9_intTarget_3.0_mean B9_intTarget_4.0_mean B10_intTarget_0.0_mean B10_intTarget_1.0_mean B10_intTarget_2.0_mean B10_intTarget_3.0_mean B10_intTarget_4.0_mean B11_intTarget_0.0_mean B11_intTarget_1.0_mean B11_intTarget_2.0_mean B11_intTarget_3.0_mean B11_intTarget_4.0_mean B12_intTarget_0.0_mean B12_intTarget_1.0_mean B12_intTarget_2.0_mean B12_intTarget_3.0_mean B12_intTarget_4.0_mean B14_intTarget_0.0_mean B14_intTarget_1.0_mean B14_intTarget_2.0_mean B14_intTarget_3.0_mean B14_intTarget_4.0_mean
0 0.394558 0.102041 0.071429 0.111111 0.333333 0.444444 0.111111 0.000000 0.151206 0.286642 0.397032 0.089981 0.064007 0.167331 0.285857 0.392430 0.085657 0.060757 0.15762 0.286013 0.395616 0.087683 0.063674 0.171463 0.264988 0.411271 0.087530 0.049161 0.179443 0.270035 0.397213 0.090592 0.047038 0.171196 0.305707 0.366848 0.088315 0.058424
1 0.394558 0.102041 0.071429 0.111111 0.333333 0.444444 0.111111 0.000000 0.151206 0.286642 0.397032 0.089981 0.064007 0.167331 0.285857 0.392430 0.085657 0.060757 0.15762 0.286013 0.395616 0.087683 0.063674 0.171463 0.264988 0.411271 0.087530 0.049161 0.179443 0.270035 0.397213 0.090592 0.047038 0.171196 0.305707 0.366848 0.088315 0.058424
2 0.394558 0.102041 0.071429 0.111111 0.333333 0.444444 0.111111 0.000000 0.151206 0.286642 0.397032 0.089981 0.064007 0.167331 0.285857 0.392430 0.085657 0.060757 0.15762 0.286013 0.395616 0.087683 0.063674 0.171463 0.264988 0.411271 0.087530 0.049161 0.179443 0.270035 0.397213 0.090592 0.047038 0.171196 0.305707 0.366848 0.088315 0.058424
3 0.394558 0.102041 0.071429 0.111111 0.333333 0.444444 0.111111 0.000000 0.151206 0.286642 0.397032 0.089981 0.064007 0.167331 0.285857 0.392430 0.085657 0.060757 0.15762 0.286013 0.395616 0.087683 0.063674 0.171463 0.264988 0.411271 0.087530 0.049161 0.179443 0.270035 0.397213 0.090592 0.047038 0.171196 0.305707 0.366848 0.088315 0.058424
4 0.410377 0.091195 0.064465 0.117647 0.176471 0.411765 0.117647 0.176471 0.232558 0.255814 0.325581 0.139535 0.046512 0.000000 0.333333 0.666667 0.000000 0.000000 0.16318 0.255230 0.447699 0.071130 0.046025 0.145349 0.312016 0.374031 0.098837 0.065891 0.150065 0.289780 0.401035 0.089263 0.063389 0.148936 0.240122 0.431611 0.106383 0.063830
14,评估结果
他使用lgb可以直接跑结果,但是我使用cross_val_cross却不行(lgb没fit方法,有train),
此时:x_train,y_train无法直接用于sklearn算法,(报错,数据类型错误)
cross_value_score也无法作用在lgb上(没fit方法)
清除异常数据:
np.any(np.isnan(X_train),axis=1)
=> np.logical_not(np.any(np.isnan(X_train),axis=1))
=> X_train,y_train=X_train[np.logical_not(np.any(np.isnan(X_train),axis=1))],y_train[np.logical_not(np.any(np.isnan(X_train),axis=1))]

DecisionTreeRegressor -0.00029905497736250873 3.00256432609409e-05
RandomForestRegressor -0.00020760366902823285 2.6380383953972225e-05
XGBRegressor -0.0001981767607924398 2.808226334622625e-05
SVR -0.0008958782405495207 5.1208639280237106e-05
LinearRegression -48030001685423.28 96060003370846.56


修复文中bug后效果
bug1,第11步中的
df[col_name] = df[f].map(order_label) # 正是此处逻辑错误
修改为: df[col_name] = df[f1].map(order_label) # 正是此处逻辑错误
修改后第14步结果:
DecisionTreeRegressor -0.0003388445203694478 2.5746253315096493e-05
RandomForestRegressor -0.00022420747637809046 2.0379324751299676e-05
XGBRegressor -0.00018904401979792253 2.739032634668878e-05
SVR -0.0009419628520902001 4.666016023199918e-05
LinearRegression -2.9286245046440532e+16 5.85713898888668e+16
可见:可见dtr,svr等效果均变差

bug2(基于bug01修改好情况下),
第1步,删除类别唯一
['B3', 'B13', 'A13', 'A18', 'A23']
A23 float 1393 0.998 4 [(5.0, 1391), (10.0, 1), (4.0, 1)] 5.002872 0.136638 4.000 5.000 5.000 5.000 10.0000
A18 float 1396 1.000 2 [(0.2, 1395), (0.1, 1)] 0.199928 0.002676 0.100 0.200 0.200 0.200 0.2000
A13 float 1396 1.000 3 [(0.2, 1394), (0.15, 1), (0.12, 1)] 0.199907 0.002524 0.120 0.200 0.200 0.200 0.2000
B13 float 1395 0.999 4 [(0.15, 1388), (0.03, 6), (0.06, 1)] 0.149419 0.008213 0.030 0.150 0.150 0.150 0.1500
B3 float 1394 0.999 3 [(3.5, 1393), (3.6, 1)] 3.500072 0.002678 3.500 3.500 3.500 3.500 3.6000


第2步删除的特征,意思是删除90%缺失特征,实际删除了类似单变量特征(大部分取值集中1个)
type count count_rate unique_count unique_set mean std min 25% 50% 75% max
B2 float 1394 0.999 4 [(3.5, 1374), (0.15, 19), (3.6, 1)] 3.454412 0.388585 0.150 3.500 3.500 3.500 3.6000
A3 float 1354 0.970 4 [(405.0, 1336), (270.0, 12), (340.0, 6)] 403.515510 13.348093 270.000 405.000 405.000 405.000 405.0000
A4 int 1396 1.000 4 [(700, 1336), (980, 42), (470, 12), (590, 6)] 705.974212 53.214754 470.000 700.000 700.000 700.000 980.0000
A2 float 42 0.030 2 [(125.0, 42)] 125.000000 0.000000 125.000 125.000 125.000 125.000 125.0000
A1 int 1396 1.000 3 [(300, 1377), (200, 13), (250, 6)] 298.853868 10.130552 200.000 300.000 300.000 300.000 300.0000

这一步,删除的其实第1步相同,应当还删除
自己删除的:['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23', 'B3', 'B13']
['A2', 'A7', 'A8', 'B11']

比较下:
'A7', 'A8', 'B11'
B11 object 547 0.392 38 [(20:00-21:00, 154), (12:00-13:00, 140), (4:00... NaN NaN NaN NaN NaN NaN NaN
A7 object 149 0.107 76 [(12:40:00, 13), (15:40:00, 7), (7:00:00, 5), ... NaN NaN NaN NaN NaN NaN NaN
A8 float 149 0.107 9 [(80.0, 118), (73.0, 16), (74.0, 8), (82.0, 3)... 78.818792 2.683920 70.000 80.000 80.000 80.000 82.0000

自己认为也应该删除
修改后第14步效果
DecisionTreeRegressor -0.0003162703674004905 3.201639535961476e-05
RandomForestRegressor -0.0002238364737549146 1.9140884255549184e-05
XGBRegressor -0.00019055478967780663 2.6732679575302225e-05
SVR -0.0009419628520902001 4.666016023199918e-05
LinearRegression -8.346887378848443e+16 1.6606005966762346e+17

整体看基本无影响

特征07_双策略比对self

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
基础策略01(ref01_bug2版本)
1,丢弃损失率>90%的特征:['A2', 'A7', 'A8', 'B10', 'B11']
2,丢弃取值单一:['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23', 'B3', 'B13']
以上对照ref02的:1,2步骤,
差异,多丢弃了B10特征
3,缺失值填充众数:['A21', 'A23', 'A24', 'A26', 'A3', 'B1', 'B2', 'B3', 'B12', 'B13', 'B5', 'B8']
4,target<0.87认为outline点
以上对照ref02的:3,4步骤,
差异,缺失值填充用众数而非-1
5,特殊处理A25脏数据
6,oneEncoder:['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12']
7,处理各object特征的脏数据
8,
time_features = ['B7', 'B5', 'A9', 'A5', 'A26', 'A24', 'A16', 'A14', 'A11']
period_features = ['B9', 'B4', 'A28', 'A20']
时间特征:类别转int(10分钟一个单位)
时间差计算(10分钟一个单位)
time_features:1->1
period_features:1->3,开始,结束,持续时间

以上对应ref,5,6,7,
差异:ref01的period_features只转为1个特征,时间差,这里转成3个特征.
ref01的B10也做了处理,但此处B10被丢弃了,

效果:
feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'B7_factor_hh', 'B5_factor_hh', 'A9_factor_hh', 'A5_factor_hh', 'A26_factor_hh', 'A24_factor_hh', 'A16_factor_hh', 'A14_factor_hh', 'A11_factor_hh', 'B9_factor_sh', 'B9_factor_eh', 'B9_factor_pd', 'B4_factor_sh', 'B4_factor_eh', 'B4_factor_pd', 'A28_factor_sh', 'A28_factor_eh', 'A28_factor_pd', 'A20_factor_sh', 'A20_factor_eh', 'A20_factor_pd']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
name model mean std
0 31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000809 0.000056
1 31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000585 0.000030
2 31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000543 0.000023
3 31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000943 0.000045
4 31['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000641 0.000079



##########缩小差距
1,period_features,特征转化变为1-1
效果:
feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'B7_factor_hh', 'B5_factor_hh', 'A9_factor_hh', 'A5_factor_hh', 'A26_factor_hh', 'A24_factor_hh', 'A16_factor_hh', 'A14_factor_hh', 'A11_factor_hh', 'B9_factor_pd', 'B4_factor_pd', 'A28_factor_pd', 'A20_factor_pd']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
name model mean std
0 23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000800 0.000042
1 23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000572 0.000017
2 23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000537 0.000031
3 23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000943 0.000045
4 23['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000649 0.000088



2,保留B10,并且fillna(00:00-00:00)

feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'B7_factor_hh', 'B5_factor_hh', 'A9_factor_hh', 'A5_factor_hh', 'A26_factor_hh', 'A24_factor_hh', 'A16_factor_hh', 'A14_factor_hh', 'A11_factor_hh', 'B9_factor_pd', 'B4_factor_pd', 'A28_factor_pd', 'A20_factor_pd', 'B10_factor_pd']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
name model mean std
0 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000787 0.000053
1 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000563 0.000034
2 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000537 0.000029
3 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000943 0.000045
4 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000640 0.000083

可见效果稍稍变好,

除了这2点,自己程序和ref01逻辑应该完全一致了,
但是ref01第11步,以及取得0.0002的成绩,自己目前是0.0005,

细节:ref01时间是float h.m形式,自己是整数形式
1,空值处理,fillna(-1),之后会编码为特定label
2,编码方式上,自己是10分的int,ref01是onelable ncoding

参考这2个细节,在做改造(忽略,第1点,由于空值其实非常少,基本可忽略,故不考虑)

feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'B7_factor_hh', 'B5_factor_hh', 'A9_factor_hh', 'A5_factor_hh', 'A26_factor_hh', 'A24_factor_hh', 'A16_factor_hh', 'A14_factor_hh', 'A11_factor_hh', 'B9_factor_pd', 'B4_factor_pd', 'A28_factor_pd', 'A20_factor_pd', 'B10_factor_pd']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
name model mean std
0 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000785 0.000035
1 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000582 0.000032
2 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000537 0.000029
3 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000943 0.000045
4 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000611 0.000052

额,还是没啥变化,干脆使用ref01的特征处理函数对这2个特征做处理

feature_columns
['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A27', 'B1', 'B8', 'B12', 'A5', 'A9', 'A14', 'A16', 'A11', 'A24', 'A26', 'B5', 'B7', 'A20', 'A28', 'B4', 'B9', 'B10']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
name model mean std
0 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000838 0.000029
1 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000585 0.000031
2 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000565 0.000045
3 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000943 0.000045
4 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000667 0.000056

可见依然无变化.
那就在向前面倒推?看哪一步出的差异?
继续,fillna方式和oneenbleencodeing方式
所有细节四楼都一致了,貌似还是不行
name model mean std
0 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000827 0.000066
1 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000587 0.000025
2 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000562 0.000043
3 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000942 0.000047
4 24['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000668 0.000055

原因找到了,之前的feature少了这几个特征,补充上去达到0.0002了
feature_columns.extend(['A10','A19','A25','B14','B6'])
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
Out[2]:
name model mean std
0 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000341 0.000047
1 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000228 0.000032
2 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000217 0.000024
3 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000942 0.000047
4 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000645 0.000042

这5特征都是int类型,所以没作处理
相关特征处理回滚到自己代码部分
Out[4]:
name model mean std
0 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000285 0.000033
1 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000200 0.000023
2 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000206 0.000021
3 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000942 0.000047
4 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000542 0.000092

取消float转为oneencoding
Out[2]:
name model mean std
0 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000280 0.000025
1 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000206 0.000023
2 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000205 0.000021
3 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000942 0.000047
4 29['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000544 0.000087


回复时间段特征的1转3,以及取消pd特征的encoding
Out[2]:
name model mean std
0 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000280 0.000040
1 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000205 0.000019
2 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000204 0.000017
3 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000942 0.000047
4 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000550 0.000108

去掉单个实际特征的onecodindg操作:Out[2]:
name model mean std
0 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000282 0.000046
1 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000207 0.000023
2 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000204 0.000017
3 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000942 0.000047
4 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000551 0.000109


xgb的rfe特征筛选。
list(zip(feature_columns,rfecv.support_))
Out[18]:
[('B7_factor_hh', True),
('B5_factor_hh', False),
('A9_factor_hh', True),
('A5_factor_hh', True),
('A26_factor_hh', False),
('A24_factor_hh', True),
('A16_factor_hh', True),
('A14_factor_hh', False),
('A11_factor_hh', False),
('B9_factor_sh', False),
('B9_factor_eh', True),
('B9_factor_pd', True),
('B4_factor_sh', False),
('B4_factor_eh', False),
('B4_factor_pd', True),
('A28_factor_sh', False),
('A28_factor_eh', True),
('A28_factor_pd', False),
('A20_factor_sh', False),
('A20_factor_eh', False),
('A20_factor_pd', False),
('B10_factor_sh', True),
('B10_factor_eh', False),
('B10_factor_pd', False),
('A10', True),
('A19', False),
('A25', True),
('B14', True),
('B6', True)]

特征08_重新整理出基准版本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
初始成绩:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns])
Out[2]:
name model mean std
0 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... DecisionTreeRegressor -0.000278 0.000043
1 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... RandomForestRegressor -0.000207 0.000023
2 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... XGBRegressor -0.000204 0.000017
3 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... SVR -0.000942 0.000047
4 39['A6', 'A12', 'A15', 'A17', 'A21', 'A22', 'A... LinearRegression -0.000551 0.000109

修改点:
1,int类型的处理放到融入流程中
2,规范feature_org和feature_handle的用法
得分:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1... DecisionTreeRegressor -0.000285 0.000047
1 39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1... RandomForestRegressor -0.000204 0.000030
2 39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1... XGBRegressor -0.000199 0.000021
3 39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1... SVR -0.000942 0.000047
4 39['A10', 'A19', 'A25', 'B6', 'B14', 'A6', 'A1... LinearRegression -0.000551 0.000109


测试点01:取值单一的,不丢弃,用mode()數填充
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[4]:
name model mean std
0 57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'... DecisionTreeRegressor -0.000287 0.000032
1 57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'... RandomForestRegressor -0.000201 0.000029
2 57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'... XGBRegressor -0.000200 0.000021
3 57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'... SVR -0.000942 0.000047
4 57['A1', 'A3', 'A4', 'B2', 'A13', 'A18', 'A23'... LinearRegression -0.002282 0.003443

可見效果降低,所以對取值單一的丟棄較好

測試點02:int取值較小的dummy轉換。
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... DecisionTreeRegressor -0.000282 0.000034
1 49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... RandomForestRegressor -0.000206 0.000024
2 49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... XGBRegressor -0.000196 0.000022
3 49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... SVR -0.000942 0.000047
4 49['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... LinearRegression -0.000546 0.000111

所有int特徵轉dummy
Out[2]:
name model mean std
0 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... DecisionTreeRegressor -0.000295 0.000026
1 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... RandomForestRegressor -0.000205 0.000013
2 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... XGBRegressor -0.000190 0.000020
3 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... SVR -0.000942 0.000047
4 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... LinearRegression -0.000244 0.000026

修改int填充fillna方式爲-1,不合適
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... DecisionTreeRegressor -0.000285 0.000025
1 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... RandomForestRegressor -0.000209 0.000017
2 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... XGBRegressor -0.000190 0.000020
3 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... SVR -0.000942 0.000047
4 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... LinearRegression -0.000244 0.000026


测试点03:float,<10,dummy,>10,保留原样
Out[2]:
name model mean std
0 132['A10_dummy_100', 'A10_dummy_101', 'A10_dum... DecisionTreeRegressor -2.909480e-04 3.179915e-05
1 132['A10_dummy_100', 'A10_dummy_101', 'A10_dum... RandomForestRegressor -2.110361e-04 1.625678e-05
2 132['A10_dummy_100', 'A10_dummy_101', 'A10_dum... XGBRegressor -1.904408e-04 2.114660e-05
3 132['A10_dummy_100', 'A10_dummy_101', 'A10_dum... SVR -9.419629e-04 4.666016e-05
4 132['A10_dummy_100', 'A10_dummy_101', 'A10_dum... LinearRegression -1.537115e+15 2.166613e+15
结论:效果下降


全部丢弃(不处理float特征)
Out[2]:
name model mean std
0 107['A10_dummy_100', 'A10_dummy_101', 'A10_dum... DecisionTreeRegressor -2.905839e-04 2.038019e-05
1 107['A10_dummy_100', 'A10_dummy_101', 'A10_dum... RandomForestRegressor -2.293711e-04 1.696688e-05
2 107['A10_dummy_100', 'A10_dummy_101', 'A10_dum... XGBRegressor -2.014942e-04 2.127065e-05
3 107['A10_dummy_100', 'A10_dummy_101', 'A10_dum... SVR -9.419629e-04 4.666016e-05
4 107['A10_dummy_100', 'A10_dummy_101', 'A10_dum... LinearRegression -1.587048e+13 2.039310e+13

恢复默认方式,直接使用
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... DecisionTreeRegressor -0.000291 0.000036
1 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... RandomForestRegressor -0.000210 0.000024
2 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... XGBRegressor -0.000190 0.000020
3 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... SVR -0.000942 0.000047
4 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... LinearRegression -0.000244 0.000026

特征09_特殊特征处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
######  B14,clip(380,max)
处理方式:前置分析,没问题替换(没问题了替换原始B14特征)

FeatureTools.get_column_corr(train_data,target_column,feature_columns=['B14','B14_factor_clip'])
INFO:root:column_corr:
type count count_rate unique_count unique_set mean std min 25% 50% 75% max
rate float 1381 1.000 64 [(0.902, 305), (0.93, 128), (0.890999999999999... 0.924277 0.028407 0.871 0.902 0.925 0.943 1.0008
B14 int 1381 1.000 19 [(400, 736), (420, 329), (440, 226), (460, 35)... 410.913831 25.222039 40.000 400.000 400.000 420.000 460.0000
B14_factor_clip int 1381 1.000 12 [(400, 736), (420, 329), (440, 226), (460, 35)... 412.260681 17.557032 380.000 400.000 400.000 420.000 460.0000
Out[2]:
pearson spearman mine
rate 1.000000 1.000000 0.997135
B14 0.462216 0.667185 0.835802
B14_factor_clip 0.635610 0.667125 0.835802
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['B14'],['B14_factor_clip']])
Out[3]:
name model mean std
0 1['B14'] DecisionTreeRegressor -0.000364 0.000036
1 1['B14'] RandomForestRegressor -0.000364 0.000033
2 1['B14'] XGBRegressor -0.000364 0.000036
3 1['B14'] SVR -0.000942 0.000047
4 1['B14'] LinearRegression -0.000648 0.000123
5 1['B14_factor_clip'] DecisionTreeRegressor -0.000369 0.000031
6 1['B14_factor_clip'] RandomForestRegressor -0.000369 0.000030
7 1['B14_factor_clip'] XGBRegressor -0.000369 0.000031
8 1['B14_factor_clip'] SVR -0.000942 0.000047
9 1['B14_factor_clip'] LinearRegression -0.000481 0.000040


课件,新特征较好,使用B14_factor_clip

###### B6,clip(reverse(9),max)
print(FeatureTools.get_column_corr(train_data,target_column,feature_columns=['B6','B6_factor_clip']))
print(FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['B6'],['B6_factor_clip']]))
pearson spearman mine
rate 1.000000 1.000000 0.997135
B6 0.375929 0.401743 0.466027
B6_factor_clip 0.404206 0.402282 0.466027
name model mean std
0 1['B6'] DecisionTreeRegressor -0.000670 0.000037
1 1['B6'] RandomForestRegressor -0.000671 0.000040
2 1['B6'] XGBRegressor -0.000666 0.000039
3 1['B6'] SVR -0.000942 0.000047
4 1['B6'] LinearRegression -0.000694 0.000054
5 1['B6_factor_clip'] DecisionTreeRegressor -0.000674 0.000042
6 1['B6_factor_clip'] RandomForestRegressor -0.000673 0.000043
7 1['B6_factor_clip'] XGBRegressor -0.000673 0.000043
8 1['B6_factor_clip'] SVR -0.000942 0.000047
9 1['B6_factor_clip'] LinearRegression -0.000676 0.000049

结论:采用

A19,clip(200,400)
print(FeatureTools.get_column_corr(train_data,target_column,feature_columns=['A19','A19_factor_clip']))
print(FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['A19'],['A19_factor_clip']]))

pearson spearman mine
rate 1.000000 1.000000 0.997135
A19 -0.220994 -0.250017 0.376382
A19_factor_clip -0.283522 -0.295258 0.376382
name model mean std
0 1['A19'] DecisionTreeRegressor -0.000718 0.000053
1 1['A19'] RandomForestRegressor -0.000717 0.000051
2 1['A19'] XGBRegressor -0.000717 0.000052
3 1['A19'] SVR -0.000942 0.000047
4 1['A19'] LinearRegression -0.000769 0.000060
5 1['A19_factor_clip'] DecisionTreeRegressor -0.000743 0.000055
6 1['A19_factor_clip'] RandomForestRegressor -0.000743 0.000055
7 1['A19_factor_clip'] XGBRegressor -0.000743 0.000055
8 1['A19_factor_clip'] SVR -0.000942 0.000047
9 1['A19_factor_clip'] LinearRegression -0.000743 0.000055

结论:基本无影响,LR提高了,采用


A27,clip(reverse(13),reverse(19))
print(FeatureTools.get_column_corr(train_data,target_column,feature_columns=['A27','A27_factor_clip']))
print(FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['A27'],['A27_factor_clip']]))
pearson spearman mine
rate 1.000000 1.000000 0.997135
A27 -0.174947 -0.251648 0.355126
A27_factor_clip -0.215349 -0.251601 0.355126
name model mean std
0 1['A27'] DecisionTreeRegressor -0.000685 0.000046
1 1['A27'] RandomForestRegressor -0.000687 0.000045
2 1['A27'] XGBRegressor -0.000685 0.000046
3 1['A27'] SVR -0.000942 0.000047
4 1['A27'] LinearRegression -0.000783 0.000063
5 1['A27_factor_clip'] DecisionTreeRegressor -0.000685 0.000046
6 1['A27_factor_clip'] RandomForestRegressor -0.000684 0.000047
7 1['A27_factor_clip'] XGBRegressor -0.000684 0.000046
8 1['A27_factor_clip'] SVR -0.000942 0.000047
9 1['A27_factor_clip'] LinearRegression -0.000770 0.000060



最终结果(末尾):
Out[2]:
name model mean std
0 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... DecisionTreeRegressor -0.000294 0.000025
1 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... RandomForestRegressor -0.000209 0.000014
2 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... XGBRegressor -0.000191 0.000019
3 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... SVR -0.000942 0.000047
4 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... LinearRegression -0.000248 0.000023

参照比对,均不替换结果得分:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... DecisionTreeRegressor -0.000290 0.000030
1 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... RandomForestRegressor -0.000215 0.000026
2 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... XGBRegressor -0.000190 0.000020
3 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... SVR -0.000942 0.000047
4 117['A10_dummy_100', 'A10_dummy_101', 'A10_dum... LinearRegression -0.000244 0.000026
目测差异不大,但特征少了较多,可以替换下

B14,clip(380,max)
处理前后

B6,clip(reverse(9),max)

A19,clip(200,400)

A27,clip(reverse(13),reverse(19))

替换结果

参照比对,均不替换结果得分:

特征10_收率分箱(玄幻特征)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
1,常规处理,只对unique<15且非object处理。
train_data = all_data.loc[train_ids]
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['B14'],mean_columns])
Out[35]:
name model mean std
0 1['B14'] DecisionTreeRegressor -0.000369 0.000031
1 1['B14'] RandomForestRegressor -0.000369 0.000032
2 1['B14'] XGBRegressor -0.000369 0.000031
3 1['B14'] SVR -0.000942 0.000047
4 1['B14'] LinearRegression -0.000481 0.000040
5 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... DecisionTreeRegressor -0.000371 0.000031
6 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... RandomForestRegressor -0.000369 0.000031
7 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... XGBRegressor -0.000369 0.000031
8 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... SVR -0.000942 0.000047
9 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... LinearRegression -0.000369 0.000033

LR的确降低了,其他影响不大

综合效果:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... DecisionTreeRegressor -2.972214e-04 3.689761e-05
1 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... RandomForestRegressor -2.131609e-04 1.459160e-05
2 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... XGBRegressor -1.929303e-04 2.213503e-05
3 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... SVR -9.419629e-04 4.666016e-05
4 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... LinearRegression -4.433353e+14 6.487895e+14

参照比对:
未处理前
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... DecisionTreeRegressor -0.000295 0.000030
1 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... RandomForestRegressor -0.000219 0.000024
2 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... XGBRegressor -0.000191 0.000019
3 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... SVR -0.000942 0.000047
4 69['A10_dummy_100', 'A10_dummy_101', 'A10_dumm... LinearRegression -0.000248 0.000023

2.处理全部特征(train-data)中的特征
train_data = all_data.loc[train_ids]
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['B14'],mean_columns])
Out[9]:
name model mean std
0 1['B14'] DecisionTreeRegressor -0.000369 0.000031
1 1['B14'] RandomForestRegressor -0.000370 0.000031
2 1['B14'] XGBRegressor -0.000369 0.000031
3 1['B14'] SVR -0.000942 0.000047
4 1['B14'] LinearRegression -0.000481 0.000040
5 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... DecisionTreeRegressor -0.000370 0.000031
6 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... RandomForestRegressor -0.000369 0.000030
7 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... XGBRegressor -0.000369 0.000031
8 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... SVR -0.000942 0.000047
9 5['B14_to_B14_intTarget_0_mean', 'B14_to_B14_i... LinearRegression -0.000369 0.000033

最终结果:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... DecisionTreeRegressor -2.919802e-04 3.105376e-05
1 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... RandomForestRegressor -2.144352e-04 2.369516e-05
2 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... XGBRegressor -1.929303e-04 2.213503e-05
3 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... SVR -9.419629e-04 4.666016e-05
4 74['B14_to_B14_intTarget_0_mean', 'B14_to_B14_... LinearRegression -4.433353e+14 6.487895e+14
和未处理相比差异并不大,除了LR变很差之外

结论:不采用


插播:
03,玄幻特征中的错误映射修复回来,观察效果
train_data[col_name] = train_data['B14'].map(order_label)
=>train_data[col_name] = train_data[f1].map(order_label)
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[3]:
name model mean std
0 209['B14_to_B14_intTarget_0_mean', 'B14_to_B14... DecisionTreeRegressor -2.911757e-04 3.106674e-05
1 209['B14_to_B14_intTarget_0_mean', 'B14_to_B14... RandomForestRegressor -2.093248e-04 2.677998e-05
2 209['B14_to_B14_intTarget_0_mean', 'B14_to_B14... XGBRegressor -1.929303e-04 2.213503e-05
3 209['B14_to_B14_intTarget_0_mean', 'B14_to_B14... SVR -9.419629e-04 4.666016e-05
4 209['B14_to_B14_intTarget_0_mean', 'B14_to_B14... LinearRegression -2.595957e+15 2.975308e+15

可见,也没有很大变化,说明这种处理方式并不有效。

04,依然进行<15和object的筛选,不过聚类函数也保留,新增每个类别的mean信息
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me... DecisionTreeRegressor -0.000296 0.000032
1 80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me... RandomForestRegressor -0.000208 0.000019
2 80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me... XGBRegressor -0.000194 0.000022
3 80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me... SVR -0.000942 0.000047
4 80['B14_to_A19_rate_mean', 'B14_to_A21_rate_me... LinearRegression -0.000379 0.000158

可见:整体和未处理差别不大

特征11_样本id

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
01,mode500后观察有一定周期性
FeatureTools.get_column_corr(train_data,target_column,feature_columns=['id','id_mode500','id_div500','id_mode500_diff250'])
Out[3]:
pearson spearman mine
rate 1.000000 1.000000 0.997135
id 0.063930 0.067572 0.571742
id_mode500 -0.036691 -0.042827 0.225065
id_mode500_diff250 -0.204026 -0.153917 0.176602
id_div500 0.075036 0.079958 0.059548

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['id'],['id','id_mode500','id_div500'],['id','id_mode500','id_div500','id_mode500_diff250'],['id','id_mode500','id_mode500_diff250']])
Out[5]:
name model mean std
0 1['id'] DecisionTreeRegressor -0.000978 0.000056
1 1['id'] RandomForestRegressor -0.000719 0.000034
2 1['id'] XGBRegressor -0.000498 0.000039
3 1['id'] SVR -0.000942 0.000047
4 1['id'] LinearRegression -0.000805 0.000059
5 3['id', 'id_mode500', 'id_div500'] DecisionTreeRegressor -0.000962 0.000057
6 3['id', 'id_mode500', 'id_div500'] RandomForestRegressor -0.000719 0.000046
7 3['id', 'id_mode500', 'id_div500'] XGBRegressor -0.000502 0.000031
8 3['id', 'id_mode500', 'id_div500'] SVR -0.000942 0.000047
9 3['id', 'id_mode500', 'id_div500'] LinearRegression -0.000806 0.000056
10 4['id', 'id_mode500', 'id_div500', 'id_mode500... DecisionTreeRegressor -0.000961 0.000056
11 4['id', 'id_mode500', 'id_div500', 'id_mode500... RandomForestRegressor -0.000699 0.000046
12 4['id', 'id_mode500', 'id_div500', 'id_mode500... XGBRegressor -0.000502 0.000031
13 4['id', 'id_mode500', 'id_div500', 'id_mode500... SVR -0.000942 0.000047
14 4['id', 'id_mode500', 'id_div500', 'id_mode500... LinearRegression -0.000687 0.000049
15 3['id', 'id_mode500', 'id_mode500_diff250'] DecisionTreeRegressor -0.000962 0.000051
16 3['id', 'id_mode500', 'id_mode500_diff250'] RandomForestRegressor -0.000716 0.000033
17 3['id', 'id_mode500', 'id_mode500_diff250'] XGBRegressor -0.000502 0.000031
18 3['id', 'id_mode500', 'id_mode500_diff250'] SVR -0.000942 0.000047
19 3['id', 'id_mode500', 'id_mode500_diff250'] LinearRegression -0.000687 0.000049

15 2['id_mode500', 'id_mode500_diff250'] DecisionTreeRegressor -0.001172 0.000114
16 2['id_mode500', 'id_mode500_diff250'] RandomForestRegressor -0.001027 0.000093
17 2['id_mode500', 'id_mode500_diff250'] XGBRegressor -0.000709 0.000054
18 2['id_mode500', 'id_mode500_diff250'] SVR -0.000942 0.000047
19 2['id_mode500', 'id_mode500_diff250'] LinearRegression -0.000688 0.000058

可见,最佳的是4个全部保留
最终效果:

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 73['id', 'id_mode500', 'id_div500', 'id_mode50... DecisionTreeRegressor -0.000251 0.000022
1 73['id', 'id_mode500', 'id_div500', 'id_mode50... RandomForestRegressor -0.000161 0.000012
2 73['id', 'id_mode500', 'id_div500', 'id_mode50... XGBRegressor -0.000150 0.000016
3 73['id', 'id_mode500', 'id_div500', 'id_mode50... SVR -0.000942 0.000047
4 73['id', 'id_mode500', 'id_div500', 'id_mode50... LinearRegression -0.000225 0.000015


插播:
02:保留3个'id', 'id_mode500', 'id_mode500_diff250'
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 72['id', 'id_mode500', 'id_mode500_diff250', '... DecisionTreeRegressor -0.000260 0.000027
1 72['id', 'id_mode500', 'id_mode500_diff250', '... RandomForestRegressor -0.000159 0.000014
2 72['id', 'id_mode500', 'id_mode500_diff250', '... XGBRegressor -0.000150 0.000016
3 72['id', 'id_mode500', 'id_mode500_diff250', '... SVR -0.000942 0.000047
4 72['id', 'id_mode500', 'id_mode500_diff250', '... LinearRegression -0.000225 0.000015

差不多,还是保留4个吧

04,只保留id
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1... DecisionTreeRegressor -0.000259 0.000027
1 70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1... RandomForestRegressor -0.000156 0.000018
2 70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1... XGBRegressor -0.000148 0.000017
3 70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1... SVR -0.000942 0.000047
4 70['id', 'A10_dummy_100', 'A10_dummy_101', 'A1... LinearRegression -0.000248 0.000022



05:在4个都保留情况下,测试特征的clip处理屏蔽操作
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 121['id', 'id_mode500', 'id_div500', 'id_mode5... DecisionTreeRegressor -2.435147e-04 1.663540e-05
1 121['id', 'id_mode500', 'id_div500', 'id_mode5... RandomForestRegressor -1.541371e-04 9.708604e-06
2 121['id', 'id_mode500', 'id_div500', 'id_mode5... XGBRegressor -1.504820e-04 1.578562e-05
3 121['id', 'id_mode500', 'id_div500', 'id_mode5... SVR -9.419629e-04 4.666016e-05
4 121['id', 'id_mode500', 'id_div500', 'id_mode5... LinearRegression -2.666657e+12 4.283183e+12

相比,clip确实稍稍提升,但考虑到特征数量的膨胀,暂时保留clip操作

特征12_工序时间差特征

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
特征序列:
time_series_list = ['A5', 'A9', 'A11', 'A14', 'A16', 'A20', 'A24', 'A26', 'A28', 'B4', 'B5', 'B7', 'B9', 'B10']
train_data=all_data.loc[train_ids]
FeatureTools.get_column_corr(train_data,target_column,feature_columns=diff_columns)
Out[3]:
pearson spearman mine
rate 1.000000 1.000000 0.997135
B9_B10_diff 0.225574 -0.048804 0.501285
A24_A26_diff 0.103007 0.406307 0.480982
A5_A9_diff 0.070042 0.255253 0.363899
A16_A20_diff -0.165845 -0.323325 0.362730
A26_A28_diff -0.252735 -0.307826 0.355320
B5_B7_diff -0.132855 -0.189066 0.307381
B4_B5_diff -0.088838 -0.227567 0.247909
A20_A24_diff -0.024127 0.225474 0.198716
B7_B9_diff -0.142460 -0.259364 0.176202
A28_B4_diff 0.069055 0.165902 0.155110
A9_A11_diff -0.024438 -0.060434 0.025192
A11_A14_diff -0.000033 -0.018616 0.018966
A14_A16_diff 0.010135 -0.016737 0.011144

最终效果:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[4]:
name model mean std
0 86['id', 'id_mode500', 'id_div500', 'id_mode50... DecisionTreeRegressor -0.000269 0.000038
1 86['id', 'id_mode500', 'id_div500', 'id_mode50... RandomForestRegressor -0.000159 0.000019
2 86['id', 'id_mode500', 'id_div500', 'id_mode50... XGBRegressor -0.000155 0.000018
3 86['id', 'id_mode500', 'id_div500', 'id_mode50... SVR -0.000942 0.000047
4 86['id', 'id_mode500', 'id_div500', 'id_mode50... LinearRegression -0.000247 0.000029

参考比对:不做处理
Out[1]:
name model mean std
0 73['id', 'id_mode500', 'id_div500', 'id_mode50... DecisionTreeRegressor -0.000258 0.000017
1 73['id', 'id_mode500', 'id_div500', 'id_mode50... RandomForestRegressor -0.000153 0.000016
2 73['id', 'id_mode500', 'id_div500', 'id_mode50... XGBRegressor -0.000150 0.000016
3 73['id', 'id_mode500', 'id_div500', 'id_mode50... SVR -0.000942 0.000047
4 73['id', 'id_mode500', 'id_div500', 'id_mode50... LinearRegression -0.000225 0.000015

参考比对:只使用['B9_B10_diff','A24_A26_diff']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 75['id', 'id_mode500', 'id_div500', 'id_mode50... DecisionTreeRegressor -0.000264 0.000035
1 75['id', 'id_mode500', 'id_div500', 'id_mode50... RandomForestRegressor -0.000162 0.000019
2 75['id', 'id_mode500', 'id_div500', 'id_mode50... XGBRegressor -0.000153 0.000018
3 75['id', 'id_mode500', 'id_div500', 'id_mode50... SVR -0.000942 0.000047
4 75['id', 'id_mode500', 'id_div500', 'id_mode50... LinearRegression -0.000226 0.000015

结论:无用,不作处理是最佳的

特征13_训练预测数据分布差异

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
关注21,22

train_data['A21'].value_counts()
test_data['A21'].value_counts()

train_data['A21'].value_counts()
Out[6]:
50.0 1254
40.0 63
30.0 42
35.0 15
20.0 7
60.0 4
45.0 2
55.0 2
80.0 1
70.0 1
25.0 1
90.0 1
Name: A21, dtype: int64
test_data['A21'].value_counts()
Out[7]:
50 135
40 8
30 3
35 2
25 2
Name: A21, dtype: int64


train_data['A22'].value_counts()
test_data['A22'].value_counts()

train_data['A22'].value_counts()
Out[8]:
9.0 1216
10.0 174
8.0 5
3.5 1
Name: A22, dtype: int64
test_data['A22'].value_counts()
Out[9]:
9 131
10 19
Name: A22, dtype: int64

查看数据均值信息
pd.DataFrame(train_data[['A22',target_column]].groupby(by=['A22'])[target_column].mean()).join(train_data['A22'].value_counts())
Out[16]:
rate A22
A22
3.5 0.902000 1
8.0 0.939800 5
9.0 0.925468 1216
10.0 0.907345 174








特征14_第二批数据转化测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
1,填充中位数的特征查看
fillna_columns = ['A21', 'A24', 'A26', 'B1', 'B12', 'B5', 'B8']
FeatureTools.get_columns_info(all_data,fillna_columns)
Out[3]:
type count nan_count count_rate unique_count unique_set mean std min 25% 50% 75% max
A21 float 1543 3 0.998 13 [(50.0, 1389), (40.0, 71), (30.0, 45), (35.0, ... 48.690862 4.954759 20.0 50.0 50.0 50.0 90.0
A24 object 1545 1 0.999 94 [(12:00:00, 289), (20:00:00, 268), (4:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
A26 object 1544 2 0.999 91 [(13:00:00, 297), (21:00:00, 271), (5:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
B1 float 1535 11 0.993 22 [(320.0, 835), (300.0, 142), (350.0, 130), (37... 335.336482 107.244534 3.5 320.0 320.0 330.0 1200.0
B12 float 1545 1 0.999 5 [(1200.0, 859), (800.0, 646), (900.0, 24), (40... 1019.805825 206.114344 400.0 800.0 1200.0 1200.0 1200.0
B5 object 1545 1 0.999 63 [(15:00:00, 276), (23:00:00, 236), (7:00:00, 1... NaN NaN NaN NaN NaN NaN NaN
B8 float 1545 1 0.999 26 [(45.0, 1204), (40.0, 157), (50.0, 47), (28.0,... 43.711327 4.328276 20.0 45.0 45.0 45.0 73.0

可见缺失数据其实比较少的,所以没必要做其他特殊处理

2,其他使用fillna地方,空值区分填充-1。查看相关空值,占比非常低,无必要做处理

3,clip取消
B14,B6, A19,A27
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[3]:
name model mean std
0 121['id', 'id_mode500', 'id_div500', 'id_mode5... DecisionTreeRegressor -2.528032e-04 2.835551e-05
1 121['id', 'id_mode500', 'id_div500', 'id_mode5... RandomForestRegressor -1.578689e-04 1.122595e-05
2 121['id', 'id_mode500', 'id_div500', 'id_mode5... XGBRegressor -1.504820e-04 1.578562e-05
3 121['id', 'id_mode500', 'id_div500', 'id_mode5... SVR -9.419629e-04 4.666016e-05
4 121['id', 'id_mode500', 'id_div500', 'id_mode5... LinearRegression -9.161214e+12 1.511062e+13

保留取消
测试全部float特征dummy

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 264['id', 'id_mode500', 'id_div500', 'id_mode5... DecisionTreeRegressor -2.517986e-04 2.809433e-05
1 264['id', 'id_mode500', 'id_div500', 'id_mode5... RandomForestRegressor -1.574401e-04 1.260128e-05
2 264['id', 'id_mode500', 'id_div500', 'id_mode5... XGBRegressor -1.495908e-04 1.611149e-05
3 264['id', 'id_mode500', 'id_div500', 'id_mode5... SVR -9.419629e-04 4.666016e-05
4 264['id', 'id_mode500', 'id_div500', 'id_mode5... LinearRegression -9.540508e+12 1.439639e+13

float<10 dummy,>10保持不动
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 136['id', 'id_mode500', 'id_div500', 'id_mode5... DecisionTreeRegressor -2.381848e-04 1.889620e-05
1 136['id', 'id_mode500', 'id_div500', 'id_mode5... RandomForestRegressor -1.573015e-04 1.784003e-05
2 136['id', 'id_mode500', 'id_div500', 'id_mode5... XGBRegressor -1.499761e-04 1.690648e-05
3 136['id', 'id_mode500', 'id_div500', 'id_mode5... SVR -9.419629e-04 4.666016e-05
4 136['id', 'id_mode500', 'id_div500', 'id_mode5... LinearRegression -6.445922e+13 1.014251e+14

4,在3基础上,不做rfecv处理
mode mse r2 best_estimator
0 cv 6.776363e-04 1.596780e-01 LinearSVR(C=0.1, dual=True, epsilon=0.01, fit_...
1 cv 1.414319e-04 8.246133e-01 RandomForestRegressor(bootstrap=False, criteri...
2 cv 7.663601e+20 -9.503465e+23 LinearRegression(copy_X=True, fit_intercept=Tr...
3 cv 1.504837e-04 8.133884e-01 GradientBoostingRegressor(alpha=0.9, criterion...
4 cv 3.571991e-04 5.570452e-01 ElasticNetCV(alphas=None, copy_X=True, cv='war...
5 cv 1.381657e-04 8.286637e-01 XGBRegressor(base_score=0.5, booster='gbtree',...

基于此建立提交08,09,10


5,4基础上恢复rfecv且算法的调参探测次数改为100次
rfecv算法,xgb
INFO:common.gscvTools:ret_df:
mode mse r2 best_estimator
0 cv 0.000255 0.684052 LinearRegression(copy_X=True, fit_intercept=Tr...
1 cv 0.000136 0.831958 XGBRegressor(base_score=0.5, booster='gbtree',...
2 cv 0.000752 0.067833 LinearSVR(C=25.0, dual=True, epsilon=0.01, fit...
3 cv 0.000376 0.533659 ElasticNetCV(alphas=None, copy_X=True, cv=None...
4 cv 0.000133 0.835663 RandomForestRegressor(bootstrap=True, criterio...
5 cv 0.000139 0.827897 GradientBoostingRegressor(alpha=0.9, criterion...

rfecv算法,rfr
mode mse r2 best_estimator
0 cv 0.000142 0.823842 XGBRegressor(base_score=0.5, booster='gbtree',...
1 cv 0.000140 0.825954 RandomForestRegressor(bootstrap=True, criterio...
2 cv 0.000220 0.727425 LinearRegression(copy_X=True, fit_intercept=Tr...
3 cv 0.000138 0.829087 GradientBoostingRegressor(alpha=0.99, criterio...
4 cv 0.000900 -0.115566 LinearSVR(C=5.0, dual=True, epsilon=0.01, fit_...
5 cv 0.000357 0.557045 ElasticNetCV(alphas=None, copy_X=True, cv=None...

不做rfecv
mode mse r2 best_estimator
0 cv 1.378900e-04 8.290056e-01 RandomForestRegressor(bootstrap=False, criteri...
1 cv 1.463699e-04 8.184899e-01 GradientBoostingRegressor(alpha=0.95, criterio...
2 cv 1.182782e+22 -1.466742e+25 LinearRegression(copy_X=True, fit_intercept=Tr...
3 cv 1.409546e-04 8.252052e-01 XGBRegressor(base_score=0.5, booster='gbtree',...
4 cv 7.924101e-04 1.734945e-02 LinearSVR(C=5.0, dual=True, epsilon=0.01, fit_...
5 cv 3.571991e-04 5.570452e-01 ElasticNetCV(alphas=None, copy_X=True, cv=None...

不做rfecv最好

6,不做rfecv
添加特征id的knnreg算法的预测替换id特征
INFO:common.gscvTools:ret_df:
mode mse r2 best_estimator
0 cv 1.252570e-04 8.446715e-01 RandomForestRegressor(bootstrap=False, criteri...
1 cv 2.016451e+22 -2.500557e+25 LinearRegression(copy_X=True, fit_intercept=Tr...
2 cv 2.929551e-04 6.367128e-01 ElasticNetCV(alphas=None, copy_X=True, cv=None...
3 cv 1.431421e-04 8.224926e-01 XGBRegressor(base_score=0.5, booster='gbtree',...
4 cv 6.831870e-04 1.527946e-01 LinearSVR(C=1.0, dual=True, epsilon=0.01, fit_...
5 cv 1.286914e-04 8.404125e-01 GradientBoostingRegressor(alpha=0.99, criterio...

3算法集成,提交12
xgb,gbr,rfr
INFO:__main__:StackingRegressor meta_model_scores:{'XGBRegressor': -0.000137, 'SVR': -0.000942, 'LinearRegression': -0.000134, 'Ridge': -0.000183, 'LinearSVR': -0.000129}
INFO:__main__:StackingCVRegressor meta_model_scores:{'XGBRegressor': -0.000133, 'SVR': -0.000942, 'LinearRegression': -0.000125, 'Ridge': -0.000195, 'LinearSVR': -0.000128}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'XGBRegressor': -0.000131, 'SVR': -0.000942, 'LinearRegression': -0.000126, 'Ridge': -0.000197, 'LinearSVR': -0.000132}

cv,Lr,0.000125,线上0.00008351

特征15_别人算法自己算法比对验证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
他人:ref02
1,他人特征+他人算法
lgb:CV score: 0.00012217
xgb:CV score: 0.00012019
stcking score 0.0001159681469760565

2,自己特征+他人算法
lgb:CV score: 0.00014090
xgb:CV score: 0.00012429
stcking score 0.00012452678699482447

定制算法评估打分
scoring = 'neg_mean_squared_error'
cv = 5
score_df = pd.DataFrame(columns=['name', 'model', 'mean', 'std'])
models = [DecisionTreeRegressor(), RandomForestRegressor(), XGBRegressor(), LinearRegression()]
for model in models:
print(model.__class__.__name__)
scores = cross_val_score(model, X_train, y_train, scoring=scoring, cv=cv)
tmp_series = pd.Series(
{'name': 'xxxx', 'model': model.__class__.__name__,
'mean': scores.mean(),
'std': scores.std()})
score_df = score_df.append(tmp_series, ignore_index=True)
print('score_df:\n%s' % score_df)

3,他人特征+自己打分算法
score_df:
name model mean std
0 xxxx DecisionTreeRegressor -0.000219 0.000049
1 xxxx RandomForestRegressor -0.000140 0.000016
2 xxxx XGBRegressor -0.000137 0.000017
3 xxxx LinearRegression -0.008746 0.001601

4,自己算法+自己打分算法
score_df:
name model mean std
0 xxxx DecisionTreeRegressor -2.645024e-04 2.672804e-05
1 xxxx RandomForestRegressor -1.488141e-04 7.269487e-06
2 xxxx XGBRegressor -1.408158e-04 5.988981e-06
3 xxxx LinearRegression -2.253568e+13 2.939562e+13

1,2=>他人特征更适合他人算法
3,4=>他人特征更适合自己评估算法
以上2条=>他人特征优于自己特征

采用他人特征+自己参数探测和集成方法尝试
5,采用他人特征+自己参数探测和集成方法尝试

自己快捷评估代码
score_df:
name model mean std
0 xxxx DecisionTreeRegressor -0.000213 0.000035
1 xxxx RandomForestRegressor -0.000139 0.000014
2 xxxx XGBRegressor -0.000137 0.000017
3 xxxx LinearRegression -0.008746 0.001601

未rfecv的最佳算法和参数
mode mse r2 \
0 cv 0.042856 -52.144454
1 cv 0.008748 -9.847654
2 cv 0.000126 0.844056
3 cv 0.000128 0.841737
4 cv 0.000365 0.547830
5 cv 0.000123 0.847924
best_estimator
0 LinearSVR(C=25.0, dual=True, epsilon=0.01, fit_intercept=True,\n intercept_scaling=1.0, loss...
1 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
2 GradientBoostingRegressor(alpha=0.99, criterion='friedman_mse', init=None,\n learnin...
3 RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n max_features=...
4 ElasticNetCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,\n l1_ratio=...
5 XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n colsample_bytree=1, ...

rfecv方法
太吵,暂停

特征16_误差分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
连续-连续的数据关系观察
all_data_sort=all_data.sort_values(by=['id'])
all_data_sort['%s_tmp1'%target_column]=all_data_sort[target_column].replace(0,np.nan).fillna(method='ffill')

plt.scatter(all_data_sort['id'],all_data_sort['%s_tmp1'%target_column].rolling(20).mean());plt.show()
plt.plot(all_data_sort['%s_tmp1'%target_column].rolling(20).mean());plt.show()

plt.plot(all_data_sort[target_column]);plt.show()
plt.plot(all_data_sort[all_data_sort[target_column]>0][target_column]);plt.show()
plt.plot(all_data_sort[all_data_sort[target_column]>0][target_column].rolling(20).mean());plt.show()


误差分析-训练集合观察
需要分析的特征排序
train_data['y_predict']=y_predict
train_data_sort=train_data.sort_values(by=['id'])

查看待分析特征和target关系
plt.scatter(train_data_sort['id'],train_data_sort[target_column]);plt.show()
plt.scatter(train_data_sort['id'],train_data_sort['y_predict']);plt.show()

点关系,平滑处理的点关系
plt.plot(train_data_sort[target_column]);plt.show()
plt.plot(train_data_sort['y_predict']);plt.show()
plt.plot(train_data_sort['y_diff']);plt.show()

线关系,所有数据,合法数据,平滑处理后的合法数据视图
plt.plot(train_data_sort[target_column].rolling(20).mean());plt.show()
plt.plot(train_data_sort['y_predict'].rolling(20).mean());plt.show()
plt.plot(train_data_sort['y_diff'].rolling(20).mean());plt.show()

散点关系,target_column和误差项diff
plt.scatter(train_data_sort[target_column],train_data_sort['y_diff']);plt.show()


先观察id_predict效果
mean_squared_error(train_data[target_column],train_data[target_column])
Out[70]: 0.0

mean_squared_error(train_data[target_column],train_data[target_column].rolling(5).mean().fillna(method='bfill'))
Out[71]: 0.000640307076611152
mean_squared_error(train_data[target_column],train_data['id_predict'])
Out[72]: 0.00038063561661347604
mean_squared_error(train_data[target_column],train_data[target_column].rolling(2).mean().fillna(method='bfill'))
Out[74]: 0.0003902581028240407
也就是说预测的id_predict还是颇为靠谱的,基本等价于2日移动平均,需要注意的事rollingmean肯定会相对元数据增大mse

新思路:
1,采用插值方法,填充test数据的id_predict
插值法:效果反而变差导致,20折,3折基本都是0.0015左右,mean均值0008,
所以不可取,ffill填充房事,20折,0009也不好

2,依然采用knn,不过id_predict只填充test数据,train的id_predict=target

3,b14采用类似处理方式
特征简易打分(特征15的简易评估代码)
score_df:
name model mean std
0 xxxx DecisionTreeRegressor -2.495737e-04 1.781087e-05
1 xxxx RandomForestRegressor -1.430621e-04 1.691274e-05
2 xxxx XGBRegressor -1.296319e-04 9.291542e-06
3 xxxx LinearRegression -1.645531e+13 2.947468e+13
算法参数寻优
mode mse r2 best_estimator
0 cv 8.359894e-04 -3.669229e-02 LinearSVR(C=0.1, dual=True, epsilon=0.01, fit_...
1 cv 2.929551e-04 6.367127e-01 ElasticNetCV(alphas=None, copy_X=True, cv=None...
2 cv 1.248960e-04 8.451191e-01 XGBRegressor(base_score=0.5, booster='gbtree',...
3 cv 1.644659e+13 -2.039506e+16 LinearRegression(copy_X=True, fit_intercept=Tr...
4 cv 1.168506e-04 8.550961e-01 GradientBoostingRegressor(alpha=0.99, criterio...
5 cv 1.163944e-04 8.556618e-01 RandomForestRegressor(bootstrap=False, criteri...

集成参数寻优
mode mse r2 best_estimator
0 cv 0.000035 0.956183 ElasticNetCV(alphas=None, copy_X=True, cv=None...
1 cv 0.000034 0.957520 LinearRegression(copy_X=True, fit_intercept=Tr...
2 cv 0.000037 0.953728 GradientBoostingRegressor(alpha=0.8, criterion...
3 cv 0.000048 0.940388 LinearSVR(C=5.0, dual=True, epsilon=0.01, fit_...
4 cv 0.000038 0.952951 RandomForestRegressor(bootstrap=True, criterio...
5 cv 0.000037 0.954310 XGBRegressor(base_score=0.5, booster='gbtree',...

最终选择LR集成,本机:0.000034(对应提交16),提交线上

集成01_mlxtend

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
单算法表现:
mode mse r2 best_estimator
0 cv 0.000188 0.766592 GradientBoostingRegressor(alpha=0.85, criterio...
--1 cv 0.001061 -0.315957 LinearSVR(C=0.01, dual=True, epsilon=0.01, fit...
--2 cv 0.000403 0.500398 ElasticNetCV(alphas=None, copy_X=True, cv='war...
3 cv 0.000192 0.762168 RandomForestRegressor(bootstrap=False, criteri...
4 cv 0.000241 0.700835 LinearRegression(copy_X=True, fit_intercept=Tr...
5 cv 0.000191 0.762554 XGBRegressor(base_score=0.5, booster='gbtree',...

集成算法:(比较好的4个)
len(candidate_models)
Out[17]: 4
INFO:__main__:StackingRegressor meta_model_scores:{'Ridge': -0.00022, 'LinearSVR': -0.000204, 'LinearRegression': -0.000207, 'XGBRegressor': -0.000205, 'SVR': -0.000942}
INFO:__main__:StackingCVRegressor meta_model_scores:{'Ridge': -0.000231, 'LinearSVR': -0.000189, 'LinearRegression': -0.000185, 'XGBRegressor': -0.000188, 'SVR': -0.000942}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'LinearSVR': -0.000187, 'LinearRegression': -0.00018, 'Ridge': -0.000232, 'SVR': -0.000942, 'XGBRegressor': -0.000186}

最终采用get_StackingCV_OOF的LR方法
生成结果:提交02.txt

提交

提交01_单算法_RFECV(XGB)_XGB_000187

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
文件:提交01_单算法_RFECV(XGB)_XGB_000187.csv
思路:单算法,rfecv特征过滤XGB,算法XGB
rfecv = RFECV(estimator=XGBRegressor(), # 学习器
step=1, # 移除特征个数
cv=5,
scoring='neg_mean_squared_error', # 学习器的评价标准
verbose=0,
n_jobs=-1
).fit(X_train, y_train)
best_rfe_model=rfecv
X_train = best_rfe_model.transform(X_train)
X_test = best_rfe_model.transform(X_test)
tmp_df = GSCVTools.best_modelAndParam_reg(X_train, None, y_train, None)
best_model = tmp_df.loc[2, 'best_estimator']#XGBRegressor,0.000187
线上:0.000138

提交02_集成_RFECV(XGB)_4算法_集成CVOOF(LR)_000185

提交03_单算法_RFECV(XGB)_RFR_000138

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
INFO:root:各rfecv特征筛选后在各模型上评分:
LinearRegression LinearSVR RandomForestRegressor XGBRegressor
XGBRegressor -0.000267 -0.001267 -0.000150 -0.000146
RandomForestRegressor -0.000225 -0.001068 -0.000170 -0.000146
LinearSVR -0.000282 -0.000307 -0.000274 -0.000276
LinearRegression -0.000224 -0.001264 -0.000191 -0.000180

选择:XGBRegressor

INFO:common.gscvTools:ret_df:
mode mse r2 best_estimator
0 cv 0.000267 0.669200 LinearRegression(copy_X=True, fit_intercept=Tr...
1 cv 0.000144 0.821609 XGBRegressor(base_score=0.5, booster='gbtree',...
2 cv 0.000138 0.828979 RandomForestRegressor(bootstrap=False, criteri...
--3 cv 0.000797 0.012188 LinearSVR(C=10.0, dual=True, epsilon=0.01, fit...
--4 cv 0.000406 0.496977 ElasticNetCV(alphas=None, copy_X=True, cv='war...
5 cv 0.000148 0.816161 GradientBoostingRegressor(alpha=0.95, criterio...

单个最优RandomForestRegressor,0.000138,线上成绩:0.00009482

集成:选择4个最优算法
INFO:__main__:StackingRegressor meta_model_scores: {'LinearSVR': -0.000144, 'SVR': -0.000942, 'Ridge': -0.000174, 'LinearRegression': -0.000149, 'XGBRegressor': -0.000156}
INFO:__main__:StackingCVRegressor meta_model_scores:{'LinearSVR': -0.000141, 'SVR': -0.000942, 'Ridge': -0.000186, 'LinearRegression': -0.000135, 'XGBRegressor': -0.000135}
INFO:__main__:get_StackingCV_OOF meta_model_scores: {'LinearSVR': -0.000138, 'SVR': -0.000942, 'Ridge': -0.000184, 'LinearRegression': -0.000132, 'XGBRegressor': -0.000131}

最优:CVoof,XGB,0.000131,线上成绩0.00008971

提交05_单算法_rfecv(XGB)_RF(HYTools)_000134

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92

INFO:common.hyperoptTools:ret_df: mode mse r2 best_estimator other
4 cv 0.000148 0.816798 Pipeline(memory=None,\n steps=[('randomfor... {'learner': (DecisionTreeRegressor(criterion='...
5 cv 0.000151 0.812444 Pipeline(memory=None,\n steps=[('minmaxsca... {'learner': XGBRegressor(base_score=0.5, boost...
2 cv 0.000267 0.669208 Pipeline(memory=None,\n steps=[('standards... {'learner': ElasticNet(alpha=6.740345196335565...
3 cv 0.000573 0.289710 Pipeline(memory=None,\n steps=[('pca', PCA... {'learner': SVR(C=412.7656895152976, cache_siz...
0 cv 0.000683 0.153145 Pipeline(memory=None,\n steps=[('normalize... {'learner': SVR(C=0.6522903726264077, cache_si...
1 cv 0.011210 -12.901622 Pipeline(memory=None,\n steps=[('normalize... {'learner': SGDRegressor(alpha=1.2944476618533...

分析xgb,没有自己的grid随机搜索效果好
tmp_df['best_estimator'][1]
:Pipeline(memory=None,
steps=[('normalizer', Normalizer(copy=True, norm='l1')), ('sgdregressor', SGDRegressor(alpha=1.2944476618533275e-06, average=False,
early_stopping=False, epsilon=24.444822037999185,
eta0=1.2861442205377238e-05, fit_intercept=True,
l1_ratio=0.3394045112936167, learning_rate='invs...True, tol=5.4490942553738105e-05, validation_fraction=0.1,
verbose=False, warm_start=False))])
单独预测: preprocessing=[]
结果:000276

单独预测: #preprocessing=None
结果:000159

xgb
Pipeline(memory=None,
steps=[('normalizer', Normalizer(copy=True, norm='l1')), ('svr', SVR(C=0.6522903726264077, cache_size=512, coef0=0.0, degree=1,
epsilon=0.00841190976846431, gamma='auto', kernel='linear',
max_iter=10852978.0, shrinking=False, tol=0.00014583123854438414,
verbose=False))])

单独预测: preprocessing=[]
结果:0.000140
单独预测:#preprocessing=None
结果:0.000139

可见其执行结果偶然因素较大,扩大hy搜索时间和次数
INFO:common.hyperoptTools:ret_df: mode mse r2 best_estimator other
3 cv 0.000134 0.833866 Pipeline(memory=None,\n steps=[('randomfor... {'learner': (DecisionTreeRegressor(criterion='...
4 cv 0.000138 0.829081 Pipeline(memory=None,\n steps=[('xgbregres... {'learner': XGBRegressor(base_score=0.5, boost...
2 cv 0.000267 0.669218 Pipeline(memory=None,\n steps=[('elasticne... {'learner': ElasticNet(alpha=3.807842541114131...
1 cv 0.000267 0.669214 Pipeline(memory=None,\n steps=[('lasso', L... {'learner': Lasso(alpha=6.933484177963561e-06,...
0 cv 0.000413 0.488276 Pipeline(memory=None,\n steps=[('svr', SVR... {'learner': SVR(C=0.0001510031509641054, cache...

tmp_df['best_estimator'][3]
Out[4]:
Pipeline(memory=None,
steps=[('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=0.6445046040302925, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=3, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=699, n_jobs=1,
oob_score=False, random_state=1, verbose=False,
warm_start=False))])
tmp_df['best_estimator'][4]
Out[5]:
Pipeline(memory=None,
steps=[('xgbregressor', XGBRegressor(base_score=0.5, booster='gbtree',
colsample_bylevel=0.9827719764498459,
colsample_bytree=0.7267517612763672, gamma=0.001208164431091291,
learning_rate=0.005837587884215306, max_delta_step=0, max_depth=10,
min_child_weight=9, missing=na...590563218164,
scale_pos_weight=1, seed=4, silent=True,
subsample=0.7183586790188525))])

单算法选择:randomforestregressor,000134,
线上成绩:0.00009485

集成算法
array([Pipeline(memory=None,
steps=[('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=0.6445046040302925, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=3, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=699, n_jobs=1,
oob_score=False, random_state=1, verbose=False,
warm_start=False))]),
Pipeline(memory=None,
steps=[('xgbregressor', XGBRegressor(base_score=0.5, booster='gbtree',
colsample_bylevel=0.9827719764498459,
colsample_bytree=0.7267517612763672, gamma=0.001208164431091291,
learning_rate=0.005837587884215306, max_delta_step=0, max_depth=10,
min_child_weight=9, missing=na...590563218164,
scale_pos_weight=1, seed=4, silent=True,
subsample=0.7183586790188525))])], dtype=object)

INFO:__main__:StackingRegressor meta_model_scores:{'XGBRegressor': -0.000146, 'LinearSVR': -0.00014, 'Ridge': -0.000237, 'SVR': -0.000942, 'LinearRegression': -0.000152}
INFO:__main__:StackingCVRegressor meta_model_scores:{'XGBRegressor': -0.000134, 'LinearSVR': -0.000135, 'Ridge': -0.000252, 'SVR': -0.000942, 'LinearRegression': -0.000132}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'XGBRegressor': -0.000136, 'LinearSVR': -0.000139, 'Ridge': -0.000376, 'SVR': -0.000942, 'LinearRegression': -0.000136}

最优:StackingCVRegressor,LinearRegression,000132,
提交版本的自测成绩0.000134(算法的score返回结果,和交叉验证可能不同)
线上:0.00009062

提交08_单算法_XGBRegressor0.000138

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
单算法
INFO:common.gscvTools:ret_df:
mode mse r2 best_estimator
0 cv 6.776363e-04 1.596780e-01 LinearSVR(C=0.1, dual=True, epsilon=0.01, fit_...
1 cv 1.414319e-04 8.246133e-01 RandomForestRegressor(bootstrap=False, criteri...
2 cv 7.663601e+20 -9.503465e+23 LinearRegression(copy_X=True, fit_intercept=Tr...
3 cv 1.504837e-04 8.133884e-01 GradientBoostingRegressor(alpha=0.9, criterion...
4 cv 3.571991e-04 5.570452e-01 ElasticNetCV(alphas=None, copy_X=True, cv='war...
5 cv 1.381657e-04 8.286637e-01 XGBRegressor(base_score=0.5, booster='gbtree',...


array([RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=None,
max_features=0.5, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=5, min_samples_split=5,
min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
oob_score=False, random_state=None, verbose=0, warm_start=False),
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=6,
max_features=0.6000000000000001, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=8, min_samples_split=3,
min_weight_fraction_leaf=0.0, n_estimators=200,
n_iter_no_change=None, presort='auto', random_state=None,
subsample=0.7000000000000002, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False),
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=5, min_child_weight=8, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.7500000000000001)], dtype=object)


INFO:__main__:StackingRegressor meta_model_scores:{'Ridge': -0.000186, 'XGBRegressor': -0.000155, 'LinearSVR': -0.000151, 'LinearRegression': -0.000169, 'SVR': -0.000942}
INFO:__main__:StackingCVRegressor meta_model_scores:{'Ridge': -0.000201, 'XGBRegressor': -0.000138, 'LinearSVR': -0.000138, 'LinearRegression': -0.000136, 'SVR': -0.000942}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'Ridge': -0.0002, 'XGBRegressor': -0.000139, 'LinearSVR': -0.000135, 'LinearRegression': -0.000132, 'SVR': -0.000942}
最优oof,lineregister,线上0.00008888



修改算法的参数搜索次数100次
基于此建立提交10,
INFO:common.gscvTools:ret_df:
mode mse r2 best_estimator
0 cv 1.234367e-03 -5.307122e-01 LinearSVR(C=0.5, dual=True, epsilon=0.01, fit_...
1 cv 1.372586e-04 8.297886e-01 XGBRegressor(base_score=0.5, booster='gbtree',...
2 cv 7.663601e+20 -9.503465e+23 LinearRegression(copy_X=True, fit_intercept=Tr...
3 cv 1.390731e-04 8.275385e-01 RandomForestRegressor(bootstrap=True, criterio...
4 cv 1.372350e-04 8.298178e-01 GradientBoostingRegressor(alpha=0.95, criterio...
5 cv 3.571991e-04 5.570452e-01 ElasticNetCV(alphas=None, copy_X=True, cv='war...


INFO:__main__:StackingRegressor meta_model_scores:{'LinearRegression': -0.000147, 'Ridge': -0.00019, 'SVR': -0.000942, 'LinearSVR': -0.000135, 'XGBRegressor': -0.000143}
INFO:__main__:StackingCVRegressor meta_model_scores:{'LinearRegression': -0.000133, 'Ridge': -0.000203, 'SVR': -0.000942, 'LinearSVR': -0.000137, 'XGBRegressor': -0.000134}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'LinearRegression': -0.000133, 'Ridge': -0.000205, 'SVR': -0.000942, 'LinearSVR': -0.000136, 'XGBRegressor': -0.000131}

最优oof,xgb,自评测0.0001317283727497388,线上:0.00008679

提交12,
id替换为knn预测的id取值

提交14_集成(参数测试)_3算法_rfr集成0.0005

1
2
3
4
5
6
7
8
9
10
11
12

增加集成算法的参数调优测试
mode mse r2 best_estimator
0 cv 0.000052 0.935504 RandomForestRegressor(bootstrap=True, criterio...
1 cv 0.000053 0.934192 XGBRegressor(base_score=0.5, booster='gbtree',...
2 cv 0.000054 0.932460 LinearRegression(copy_X=True, fit_intercept=Tr...
3 cv 0.000056 0.930052 LinearSVR(C=0.5, dual=True, epsilon=0.01, fit_...
4 cv 0.000055 0.932196 ElasticNetCV(alphas=None, copy_X=True, cv=None...
5 cv 0.000052 0.935667 GradientBoostingRegressor(alpha=0.99, criterio...

最终选择:RandomForestRegressor集成
提交结果14,线上0.00007808

成绩(TOP2.5%,68/2682)


Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×