项目_天池工业蒸汽量预测特征工程

01_columns_info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
print (columns_info)
type count count_rate unique_count unique_set mean std min 25% 50% 75% max
target float 2888 1.000 1916 [(0.669, 7), (0.8170000000000001, 7), (0.451, ... 0.126353 0.983966 -3.044 -0.35025 0.3130 0.79325 2.538
V0 float 2888 1.000 1801 [(0.875, 7), (0.425, 7), (0.65, 6), (0.4679999... 0.123048 0.928031 -4.335 -0.29700 0.3590 0.72600 2.121
V1 float 2888 1.000 1759 [(0.38, 8), (0.402, 7), (0.449, 7), (0.47, 7),... 0.056068 0.941515 -5.122 -0.22625 0.2725 0.59900 1.918
V2 float 2888 1.000 1948 [(0.066, 6), (0.777, 6), (0.887, 5), (0.858, 5... 0.289720 0.911236 -3.420 -0.31300 0.3860 0.91825 2.828
V3 float 2888 1.000 1820 [(-0.321, 9), (-0.196, 8), (-0.218, 7), (-0.07... -0.067790 0.970298 -3.956 -0.65225 -0.0445 0.62400 2.457
V4 float 2888 1.000 1824 [(-0.049, 7), (0.125, 6), (0.01300000000000000... 0.012921 0.888377 -4.742 -0.38500 0.1100 0.55025 2.689
V5 float 2888 1.000 1452 [(-0.452, 8), (-0.475, 7), (-0.196, 7), (-0.22... -0.558565 0.517957 -2.182 -0.85300 -0.4660 -0.15400 0.489
V6 float 2888 1.000 1834 [(0.517, 6), (0.861, 6), (0.506, 6), (-0.02799... 0.182892 0.918054 -4.576 -0.31000 0.3880 0.83125 1.895
V7 float 2888 1.000 1353 [(0.474, 9), (0.75, 8), (0.64, 8), (1.077, 8),... 0.116155 0.955116 -5.048 -0.29500 0.3440 0.78225 1.918
V8 float 2888 1.000 1807 [(0.857, 9), (0.75, 7), (0.595, 7), (0.348, 7)... 0.177856 0.895444 -4.692 -0.15900 0.3620 0.72600 2.245
V9 float 2888 1.000 72 [(0.042, 1219), (-0.39, 355), (0.904, 308), (0... -0.169452 0.953813 -12.891 -0.39000 0.0420 0.04200 1.335
V10 float 2888 1.000 1888 [(0.26899999999999996, 7), (-2.583, 7), (-2.58... 0.034319 0.968272 -2.584 -0.42050 0.1570 0.61925 4.830
V11 float 2888 1.000 1766 [(0.282, 23), (0.28800000000000003, 7), (0.408... -0.364465 0.858504 -3.160 -0.80325 -0.1120 0.24700 1.455
V12 float 2888 1.000 1841 [(-0.151, 7), (0.374, 6), (-0.125, 5), (0.718,... 0.023177 0.894092 -5.165 -0.41900 0.1230 0.61600 2.657
V13 float 2888 1.000 1935 [(0.20199999999999999, 6), (-0.535, 6), (0.839... 0.195738 0.922757 -3.675 -0.39800 0.2895 0.86425 2.475
V14 float 2888 1.000 1935 [(0.204, 6), (-0.484, 6), (-0.264, 6), (-0.261... 0.016081 1.015585 -2.455 -0.66800 -0.1610 0.82975 2.558
V15 float 2888 1.000 1781 [(-0.847, 7), (0.055, 6), (-0.662, 6), (-0.223... 0.096146 1.033048 -2.903 -0.66225 -0.0005 0.73000 4.314
V16 float 2888 1.000 1773 [(0.35100000000000003, 8), (0.466, 6), (0.081,... 0.113505 0.983128 -5.981 -0.30000 0.3060 0.77425 2.861
V17 float 2888 1.000 99 [(0.165, 721), (-0.366, 473), (0.43, 456), (-0... -0.043458 0.655857 -2.224 -0.36600 0.1650 0.43000 2.023
V18 float 2888 1.000 940 [(0.069, 30), (0.07400000000000001, 29), (0.07... 0.055034 0.953466 -3.582 -0.36750 0.0820 0.51325 4.441
V19 float 2888 1.000 2066 [(-1.361, 21), (0.20199999999999999, 5), (0.67... -0.114884 1.108859 -3.704 -0.98750 -0.0005 0.73725 3.431
V20 float 2888 1.000 1676 [(0.414, 8), (-0.319, 8), (0.147, 7), (0.71900... -0.186226 0.788511 -3.402 -0.67550 -0.1565 0.30400 3.525
V21 float 2888 1.000 1815 [(-0.135, 6), (0.285, 6), (0.237, 6), (0.066, ... -0.056556 0.781471 -2.643 -0.51700 -0.0565 0.43150 2.259
V22 float 2888 1.000 73 [(0.133, 456), (0.314, 336), (-0.063, 333), (0... 0.302893 0.639186 -1.375 -0.06300 0.2165 0.87200 2.018
V23 float 2888 1.000 760 [(0.342, 52), (0.34299999999999997, 50), (0.34... 0.155978 0.978757 -5.542 0.09725 0.3380 0.36825 1.906
V24 float 2888 1.000 682 [(-1.1909999999999998, 120), (-1.3219999999999... -0.021813 1.033403 -1.344 -1.19100 0.0950 0.93125 2.423
V25 float 2888 1.000 1781 [(-0.006, 9), (-0.14, 7), (-0.153, 7), (0.221,... -0.051679 0.915957 -3.808 -0.55725 -0.0760 0.35600 7.284
V26 float 2888 1.000 1898 [(0.474, 6), (0.05, 6), (0.127, 6), (-0.265, 5... 0.072092 0.889771 -5.131 -0.45200 0.0750 0.64425 2.980
V27 float 2888 1.000 983 [(0.312, 14), (0.337, 14), (0.439, 13), (0.275... 0.272407 0.270374 -1.164 0.15775 0.3250 0.44200 0.925
V28 float 2888 1.000 576 [(-0.45799999999999996, 126), (-0.456, 124), (... 0.137712 0.929899 -2.435 -0.45500 -0.4470 0.73000 4.671
V29 float 2888 1.000 1851 [(-0.20600000000000002, 7), (-0.654, 6), (0.26... 0.097648 1.061200 -2.912 -0.66400 -0.0230 0.74525 4.580
V30 float 2888 1.000 1636 [(-4.497, 9), (-0.022000000000000002, 7), (-0.... 0.055477 0.901934 -4.507 -0.28300 0.0535 0.48800 2.689
V31 float 2888 1.000 1703 [(0.498, 7), (0.478, 7), (0.114, 7), (0.513, 6... 0.127791 0.873028 -5.859 -0.17025 0.2995 0.63500 2.013
V32 float 2888 1.000 1748 [(-4.0489999999999995, 9), (-0.065, 8), (0.734... 0.020806 0.902584 -4.053 -0.40725 0.0390 0.55700 2.395
V33 float 2888 1.000 429 [(-0.04, 351), (0.419, 278), (-0.843, 205), (0... 0.007801 1.006995 -4.627 -0.49900 -0.0400 0.46200 5.465
V34 float 2888 1.000 419 [(0.16, 450), (0.273, 373), (-0.29, 324), (0.0... 0.006715 1.003291 -4.789 -0.29000 0.1600 0.27300 5.110
V35 float 2888 1.000 224 [(0.364, 1157), (0.8390000000000001, 316), (-0... 0.197764 0.985675 -5.695 -0.20250 0.3640 0.60200 2.324
V36 float 2888 1.000 1847 [(-2.608, 17), (-2.5639999999999996, 13), (0.4... 0.030658 0.970812 -2.608 -0.41300 0.1370 0.64425 5.238
V37 float 2888 1.000 2009 [(-0.677, 6), (0.18100000000000002, 6), (-0.81... -0.130330 1.017196 -3.630 -0.79825 -0.1855 0.49525 3.000

02_column_corr

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
FeatureTools.get_column_corr(train_data,target_column)
pearson spearman mine
target 1.000000 1.000000 1.000000
V0 0.873212 0.866709 0.589606
V1 0.871846 0.832457 0.529013
V8 0.831904 0.799280 0.487379
V27 0.812585 0.765133 0.431041
V31 0.750297 0.749034 0.428427
V2 0.638878 0.630160 0.332839
V4 0.603984 0.574775 0.276028
V12 0.594189 0.542429 0.256378
V37 -0.565795 -0.497162 0.253475
V16 0.536748 0.510025 0.239708
V3 0.512074 0.501114 0.221762
V10 0.394767 0.371067 0.195347
V20 0.444965 0.420424 0.180910
V36 0.319309 0.287696 0.167771
V24 -0.264815 -0.296056 0.162487
V25 -0.019373 0.049352 0.161249
V5 -0.314676 -0.345683 0.158044
V29 0.123329 0.198244 0.146577
V15 0.154020 0.213490 0.146141
V6 0.370037 0.264778 0.144643
V7 0.287815 0.164981 0.143749
V11 -0.263988 -0.293261 0.134465
V23 0.226331 0.043597 0.127452
V18 0.170721 0.162710 0.124676
V9 0.139704 -0.054385 0.124243
V19 -0.114976 -0.171120 0.120048
V28 0.100080 0.081030 0.119935
V14 0.008424 0.011385 0.112018
V22 -0.107813 -0.112248 0.107193
V33 0.077273 0.090670 0.105531
V26 -0.046724 -0.083314 0.105282
V13 0.203373 0.177317 0.100499
V32 0.066606 -0.034094 0.099469
V30 0.187311 0.123423 0.099304
V34 -0.006034 -0.001032 0.095402
V21 -0.010063 0.034762 0.095052
V17 0.104605 0.084335 0.093713
V35 0.140294 0.033038 0.088233

03_all_data_corr

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
train_data.corr()
target V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37
target 1.000000 0.873212 0.871846 0.638878 0.512074 0.603984 -0.314676 0.370037 0.287815 0.831904 0.139704 0.394767 -0.263988 0.594189 0.203373 0.008424 0.154020 0.536748 0.104605 0.170721 -0.114976 0.444965 -0.010063 -0.107813 0.226331 -0.264815 -0.019373 -0.046724 0.812585 0.100080 0.123329 0.187311 0.750297 0.066606 0.077273 -0.006034 0.140294 0.319309 -0.565795
V0 0.873212 1.000000 0.908607 0.463643 0.409576 0.781212 -0.327028 0.189267 0.141294 0.794013 0.077888 0.298443 -0.295420 0.751830 0.185144 -0.004144 0.314520 0.347357 0.044722 0.148622 -0.100294 0.462493 -0.029285 -0.105643 0.231136 -0.324959 -0.200706 -0.125140 0.733198 0.035119 0.302145 0.156968 0.675003 0.050951 0.056439 -0.019342 0.138933 0.231417 -0.494076
V1 0.871846 0.908607 1.000000 0.506514 0.383924 0.657790 -0.227289 0.276805 0.205023 0.874650 0.138849 0.310120 -0.197317 0.656186 0.157518 -0.006268 0.164702 0.435606 0.072619 0.123862 -0.092673 0.459795 -0.012911 -0.102421 0.222574 -0.233556 -0.070627 -0.043012 0.824198 0.077346 0.147096 0.175997 0.769745 0.085604 0.035129 -0.029115 0.146329 0.235299 -0.494043
V2 0.638878 0.463643 0.506514 1.000000 0.410148 0.057697 -0.322417 0.615938 0.477114 0.703431 0.047874 0.346006 -0.256407 0.059941 0.204762 -0.106282 -0.224573 0.782474 -0.019008 0.132105 -0.161802 0.298385 -0.030932 -0.212023 0.065509 0.010225 0.481785 0.035370 0.726250 0.229575 -0.275764 0.175943 0.653764 0.033942 0.050309 -0.025620 0.043648 0.316462 -0.734956
V3 0.512074 0.409576 0.383924 0.410148 1.000000 0.315046 -0.206307 0.233896 0.197836 0.411946 -0.063717 0.321262 -0.100489 0.306397 -0.003636 -0.232677 0.143457 0.394517 0.123900 0.022868 -0.246008 0.289594 0.114373 -0.291236 0.081374 -0.237326 -0.100569 -0.027685 0.392006 0.159039 0.117610 0.043966 0.421954 -0.092423 -0.007159 -0.031898 0.080034 0.324475 -0.229613
V4 0.603984 0.781212 0.657790 0.057697 0.315046 1.000000 -0.233959 -0.117529 -0.052370 0.449542 -0.031816 0.141129 -0.162507 0.927685 0.075993 0.023853 0.615704 0.023818 0.044803 0.136022 -0.205729 0.291309 0.174025 -0.028534 0.196530 -0.529866 -0.444375 -0.080487 0.412083 -0.044620 0.659093 0.022807 0.447016 -0.026186 0.062367 0.028659 0.100010 0.113609 -0.031054
V5 -0.314676 -0.327028 -0.227289 -0.322417 -0.206307 -0.233959 1.000000 -0.028995 0.081069 -0.182281 0.038810 0.054060 0.863890 -0.306672 -0.414517 -0.015671 -0.195037 -0.044543 0.348211 -0.190197 0.171611 -0.073232 0.115553 0.146545 -0.158441 0.275480 0.045551 0.294934 -0.218495 -0.042210 -0.175836 -0.074214 -0.121290 -0.061886 -0.132727 -0.105801 -0.075191 0.026596 0.404799
V6 0.370037 0.189267 0.276805 0.615938 0.233896 -0.117529 -0.028995 1.000000 0.917502 0.468233 0.450096 0.415660 -0.147990 -0.087312 0.138367 0.072911 -0.431542 0.847119 0.134715 0.110570 0.215290 0.136091 -0.051806 -0.068158 0.069901 0.072418 0.438610 0.106055 0.474441 0.093427 -0.467980 0.188907 0.546535 0.144550 0.054210 -0.002914 0.044992 0.433804 -0.404817
V7 0.287815 0.141294 0.205023 0.477114 0.197836 -0.052370 0.081069 0.917502 1.000000 0.389987 0.446611 0.310982 -0.064402 -0.036791 0.110973 0.163931 -0.291272 0.752683 0.239448 0.098691 0.158371 0.089399 -0.065300 0.077358 0.125180 -0.030292 0.316744 0.160566 0.424185 0.058800 -0.311363 0.170113 0.475254 0.122707 0.034508 -0.019103 0.111166 0.340479 -0.292285
V8 0.831904 0.794013 0.874650 0.703431 0.411946 0.449542 -0.182281 0.468233 0.389987 1.000000 0.100672 0.419703 -0.146689 0.420557 0.153299 0.008138 0.018366 0.680031 0.112053 0.093682 -0.144693 0.412868 -0.047839 -0.097908 0.174124 -0.136898 0.173320 0.015724 0.901100 0.122050 -0.011091 0.150258 0.878072 0.038430 0.026843 -0.036297 0.179167 0.326586 -0.553121
V9 0.139704 0.077888 0.138849 0.047874 -0.063717 -0.031816 0.038810 0.450096 0.446611 0.100672 1.000000 0.120208 -0.114374 -0.011889 -0.040705 0.118176 -0.199159 0.193681 0.167310 0.260079 0.358149 0.116111 -0.018681 0.098401 0.380050 -0.008549 0.078928 0.128494 0.114315 -0.064595 -0.221623 0.293026 0.121712 0.289891 0.115655 0.094856 0.141703 0.129542 -0.112503
V10 0.394767 0.298443 0.310120 0.346006 0.321262 0.141129 0.054060 0.415660 0.310982 0.419703 0.120208 1.000000 0.052378 0.140462 -0.059553 -0.077543 -0.046737 0.546975 0.273876 -0.024693 0.074903 0.207612 0.082288 -0.127544 -0.066537 -0.029420 0.079805 0.072366 0.246085 0.056484 -0.105042 -0.036705 0.560213 -0.093213 0.016739 -0.026994 0.026846 0.922190 -0.045851
V11 -0.263988 -0.295420 -0.197317 -0.256407 -0.100489 -0.162507 0.863890 -0.147990 -0.064402 -0.146689 -0.114374 0.052378 1.000000 -0.236665 -0.436294 -0.038986 -0.092213 -0.064671 0.219873 -0.189103 -0.125301 -0.125197 0.246936 0.072084 -0.168470 0.158754 0.009678 0.268350 -0.189870 0.018931 -0.084938 -0.153304 -0.084298 -0.153126 -0.095359 -0.053865 -0.032951 0.003413 0.459867
V12 0.594189 0.751830 0.656186 0.059941 0.306397 0.927685 -0.306672 -0.087312 -0.036791 0.420557 -0.011889 0.140462 -0.236665 1.000000 0.098771 0.020069 0.642081 0.025736 0.013940 0.119833 -0.148319 0.271559 0.144371 -0.045522 0.180049 -0.550881 -0.448877 -0.124111 0.374380 -0.062193 0.666775 0.028866 0.441963 -0.007658 0.046674 0.010122 0.081963 0.112150 -0.054827
V13 0.203373 0.185144 0.157518 0.204762 -0.003636 0.075993 -0.414517 0.138367 0.110973 0.153299 -0.040705 -0.059553 -0.436294 0.098771 1.000000 0.546881 0.028476 0.131708 -0.073260 0.255564 -0.099820 0.014640 -0.501143 0.334983 0.239335 -0.078634 -0.052681 0.046338 0.177295 0.052531 0.008235 0.027328 0.113743 0.130598 0.157513 0.116944 0.219906 -0.024751 -0.379714
V14 0.008424 -0.004144 -0.006268 -0.106282 -0.232677 0.023853 -0.015671 0.072911 0.163931 0.008138 0.118176 -0.077543 -0.038986 0.020069 0.546881 1.000000 0.068812 0.020016 0.081153 0.089314 0.048184 -0.183244 -0.180023 0.687064 0.199934 -0.051725 -0.163607 -0.028209 0.007920 -0.059098 0.056814 -0.004057 0.010989 0.106581 0.073535 0.043218 0.233523 -0.086217 0.010553
V15 0.154020 0.314520 0.164702 -0.224573 0.143457 0.615704 -0.195037 -0.431542 -0.291272 0.018366 -0.199159 -0.046737 -0.092213 0.642081 0.028476 0.068812 1.000000 -0.301427 0.009384 0.064751 -0.210738 0.037809 0.137895 0.052120 0.124758 -0.654452 -0.592635 -0.162995 -0.019880 -0.176732 0.951314 -0.111311 0.011768 -0.104618 0.050254 0.048602 0.100817 -0.051861 0.245635
V16 0.536748 0.347357 0.435606 0.782474 0.394517 0.023818 -0.044543 0.847119 0.752683 0.680031 0.193681 0.546975 -0.064671 0.025736 0.131708 0.020016 -0.301427 1.000000 0.205478 0.080302 -0.001260 0.200339 -0.029517 -0.074705 0.054720 0.012726 0.499362 0.090949 0.650810 0.177611 -0.342210 0.154794 0.778538 0.041474 0.028878 -0.054775 0.082293 0.551880 -0.420053
V17 0.104605 0.044722 0.072619 -0.019008 0.123900 0.044803 0.348211 0.134715 0.239448 0.112053 0.167310 0.273876 0.219873 0.013940 -0.073260 0.081153 0.009384 0.205478 1.000000 -0.075920 0.199407 0.045276 0.013992 0.362668 0.012545 0.039153 -0.043306 0.120531 0.067457 0.053196 0.004855 -0.010787 0.150118 -0.051377 -0.055996 -0.064533 0.072320 0.312751 0.045842
V18 0.170721 0.148622 0.123862 0.132105 0.022868 0.136022 -0.190197 0.110570 0.098691 0.093682 0.260079 -0.024693 -0.189103 0.119833 0.255564 0.089314 0.064751 0.080302 -0.075920 1.000000 0.002867 0.075928 -0.159924 -0.019774 0.469207 -0.155036 0.011666 -0.026248 0.129602 0.033263 0.053958 0.470341 0.079718 0.411967 0.512139 0.365410 0.152088 0.019603 -0.181937
V19 -0.114976 -0.100294 -0.092673 -0.161802 -0.246008 -0.205729 0.171611 0.215290 0.158371 -0.144693 0.358149 0.074903 -0.125301 -0.148319 -0.099820 0.048184 -0.210738 -0.001260 0.199407 0.002867 1.000000 0.043169 -0.061507 0.076857 -0.093989 0.218265 0.132375 -0.207373 -0.171929 -0.028678 -0.205409 0.100133 -0.131542 0.144018 -0.021517 -0.079753 -0.220737 0.087605 0.012115
V20 0.444965 0.462493 0.459795 0.298385 0.289594 0.291309 -0.073232 0.136091 0.089399 0.412868 0.116111 0.207612 -0.125197 0.271559 0.014640 -0.183244 0.037809 0.200339 0.045276 0.075928 0.043169 1.000000 -0.031573 -0.120368 0.126863 -0.045432 0.174834 0.003042 0.389767 -0.005863 0.016233 0.086165 0.326863 0.050699 0.009358 -0.000979 0.048981 0.161315 -0.322006
V21 -0.010063 -0.029285 -0.012911 -0.030932 0.114373 0.174025 0.115553 -0.051806 -0.065300 -0.047839 -0.018681 0.082288 0.246936 0.144371 -0.501143 -0.180023 0.137895 -0.029517 0.013992 -0.159924 -0.061507 -0.031573 1.000000 -0.125745 -0.209717 -0.118741 -0.060683 0.014606 -0.071355 0.032875 0.157097 -0.077945 0.053025 -0.159128 -0.087561 -0.053707 -0.199398 0.047340 0.315470
V22 -0.107813 -0.105643 -0.102421 -0.212023 -0.291236 -0.028534 0.146545 -0.068158 0.077358 -0.097908 0.098401 -0.127544 0.072084 -0.045522 0.334983 0.687064 0.052120 -0.074705 0.362668 -0.019774 0.076857 -0.120368 -0.125745 1.000000 0.150300 0.061481 -0.165004 0.100445 -0.100171 -0.012907 0.053349 -0.039953 -0.108088 0.057179 -0.019107 -0.002095 0.205423 -0.130607 0.099282
V23 0.226331 0.231136 0.222574 0.065509 0.081374 0.196530 -0.158441 0.069901 0.125180 0.174124 0.380050 -0.066537 -0.168470 0.180049 0.239335 0.199934 0.124758 0.054720 0.012545 0.469207 -0.093989 0.126863 -0.209717 0.150300 1.000000 -0.232245 -0.088441 0.015688 0.211179 -0.034477 0.116122 0.363963 0.129783 0.367086 0.183666 0.196681 0.635252 -0.035949 -0.187582
V24 -0.264815 -0.324959 -0.233556 0.010225 -0.237326 -0.529866 0.275480 0.072418 -0.030292 -0.136898 -0.008549 -0.029420 0.158754 -0.550881 -0.078634 -0.051725 -0.654452 0.012726 0.039153 -0.155036 0.218265 -0.045432 -0.118741 0.061481 -0.232245 1.000000 0.383885 0.175273 -0.135937 0.141918 -0.642370 0.033532 -0.202097 0.060608 -0.134320 -0.095588 -0.243738 -0.041325 -0.137614
V25 -0.019373 -0.200706 -0.070627 0.481785 -0.100569 -0.444375 0.045551 0.438610 0.316744 0.173320 0.078928 0.079805 0.009678 -0.448877 -0.052681 -0.163607 -0.592635 0.499362 -0.043306 0.011666 0.132375 0.174834 -0.060683 -0.165004 -0.088441 0.383885 1.000000 0.111695 0.218115 0.132823 -0.575154 0.088238 0.201243 0.065501 -0.013312 -0.030747 -0.093948 0.069302 -0.246742
V26 -0.046724 -0.125140 -0.043012 0.035370 -0.027685 -0.080487 0.294934 0.106055 0.160566 0.015724 0.128494 0.072366 0.268350 -0.124111 0.046338 -0.028209 -0.162995 0.090949 0.120531 -0.026248 -0.207373 0.003042 0.014606 0.100445 0.015688 0.175273 0.111695 1.000000 0.023140 0.199076 -0.133694 -0.057247 0.062879 -0.004545 -0.034596 0.051294 0.085576 0.064963 0.010880
V27 0.812585 0.733198 0.824198 0.726250 0.392006 0.412083 -0.218495 0.474441 0.424185 0.901100 0.114315 0.246085 -0.189870 0.374380 0.177295 0.007920 -0.019880 0.650810 0.067457 0.129602 -0.171929 0.389767 -0.071355 -0.100171 0.211179 -0.135937 0.218115 0.023140 1.000000 0.125607 -0.032772 0.208074 0.790239 0.095127 0.030135 -0.036123 0.159884 0.226713 -0.617771
V28 0.100080 0.035119 0.077346 0.229575 0.159039 -0.044620 -0.042210 0.093427 0.058800 0.122050 -0.064595 0.056484 0.018931 -0.062193 0.052531 -0.059098 -0.176732 0.177611 0.053196 0.033263 -0.028678 -0.005863 0.032875 -0.012907 -0.034477 0.141918 0.132823 0.199076 0.125607 1.000000 -0.154572 0.054546 0.123403 0.013142 -0.024866 -0.058462 -0.080237 0.061601 -0.149326
V29 0.123329 0.302145 0.147096 -0.275764 0.117610 0.659093 -0.175836 -0.467980 -0.311363 -0.011091 -0.221623 -0.105042 -0.084938 0.666775 0.008235 0.056814 0.951314 -0.342210 0.004855 0.053958 -0.205409 0.016233 0.157097 0.053349 0.116122 -0.642370 -0.575154 -0.133694 -0.032772 -0.154572 1.000000 -0.122817 -0.004364 -0.110699 0.035272 0.035392 0.078588 -0.099309 0.285581
V30 0.187311 0.156968 0.175997 0.175943 0.043966 0.022807 -0.074214 0.188907 0.170113 0.150258 0.293026 -0.036705 -0.153304 0.028866 0.027328 -0.004057 -0.111311 0.154794 -0.010787 0.470341 0.100133 0.086165 -0.077945 -0.039953 0.363963 0.033532 0.088238 -0.057247 0.208074 0.054546 -0.122817 1.000000 0.114318 0.695725 0.083693 -0.028573 -0.027987 0.006961 -0.256814
V31 0.750297 0.675003 0.769745 0.653764 0.421954 0.447016 -0.121290 0.546535 0.475254 0.878072 0.121712 0.560213 -0.084298 0.441963 0.113743 0.010989 0.011768 0.778538 0.150118 0.079718 -0.131542 0.326863 0.053025 -0.108088 0.129783 -0.202097 0.201243 0.062879 0.790239 0.123403 -0.004364 0.114318 1.000000 0.016782 0.016733 -0.047273 0.152314 0.510851 -0.357785
V32 0.066606 0.050951 0.085604 0.033942 -0.092423 -0.026186 -0.061886 0.144550 0.122707 0.038430 0.289891 -0.093213 -0.153126 -0.007658 0.130598 0.106581 -0.104618 0.041474 -0.051377 0.411967 0.144018 0.050699 -0.159128 0.057179 0.367086 0.060608 0.065501 -0.004545 0.095127 0.013142 -0.110699 0.695725 0.016782 1.000000 0.105255 0.069300 0.016901 -0.054411 -0.162417
V33 0.077273 0.056439 0.035129 0.050309 -0.007159 0.062367 -0.132727 0.054210 0.034508 0.026843 0.115655 0.016739 -0.095359 0.046674 0.157513 0.073535 0.050254 0.028878 -0.055996 0.512139 -0.021517 0.009358 -0.087561 -0.019107 0.183666 -0.134320 -0.013312 -0.034596 0.030135 -0.024866 0.035272 0.083693 0.016733 0.105255 1.000000 0.719126 0.167597 0.031586 -0.062715
V34 -0.006034 -0.019342 -0.029115 -0.025620 -0.031898 0.028659 -0.105801 -0.002914 -0.019103 -0.036297 0.094856 -0.026994 -0.053865 0.010122 0.116944 0.043218 0.048602 -0.054775 -0.064533 0.365410 -0.079753 -0.000979 -0.053707 -0.002095 0.196681 -0.095588 -0.030747 0.051294 -0.036123 -0.058462 0.035392 -0.028573 -0.047273 0.069300 0.719126 1.000000 0.233616 -0.019032 -0.006854
V35 0.140294 0.138933 0.146329 0.043648 0.080034 0.100010 -0.075191 0.044992 0.111166 0.179167 0.141703 0.026846 -0.032951 0.081963 0.219906 0.233523 0.100817 0.082293 0.072320 0.152088 -0.220737 0.048981 -0.199398 0.205423 0.635252 -0.243738 -0.093948 0.085576 0.159884 -0.080237 0.078588 -0.027987 0.152314 0.016901 0.167597 0.233616 1.000000 0.025401 -0.077991
V36 0.319309 0.231417 0.235299 0.316462 0.324475 0.113609 0.026596 0.433804 0.340479 0.326586 0.129542 0.922190 0.003413 0.112150 -0.024751 -0.086217 -0.051861 0.551880 0.312751 0.019603 0.087605 0.161315 0.047340 -0.130607 -0.035949 -0.041325 0.069302 0.064963 0.226713 0.061601 -0.099309 0.006961 0.510851 -0.054411 0.031586 -0.019032 0.025401 1.000000 -0.039478
V37 -0.565795 -0.494076 -0.494043 -0.734956 -0.229613 -0.031054 0.404799 -0.404817 -0.292285 -0.553121 -0.112503 -0.045851 0.459867 -0.054827 -0.379714 0.010553 0.245635 -0.420053 0.045842 -0.181937 0.012115 -0.322006 0.315470 0.099282 -0.187582 -0.137614 -0.246742 0.010880 -0.617771 -0.149326 0.285581 -0.256814 -0.357785 -0.162417 -0.062715 -0.006854 -0.077991 -0.039478 1.000000

04_column_pic

特征图:train_data[‘target’].hist()


















06_特征处理01_PCA消除共线性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
分组:[‘V0’,’V1’,’V8’,’V27’,’V31’]
pca阈值0.95
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V0_pca','V1_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V8', 'V27', 'V31'] DecisionTreeRegressor -0.337543 0.037634
1 ['V0', 'V1', 'V8', 'V27', 'V31'] RandomForestRegressor -0.193548 0.032157
2 ['V0', 'V1', 'V8', 'V27', 'V31'] XGBRegressor -0.169513 0.031840
3 ['V0', 'V1', 'V8', 'V27', 'V31'] SVR -0.164338 0.033965
4 ['V0', 'V1', 'V8', 'V27', 'V31'] LinearRegression -0.163307 0.030822
5 ['V0_pca', 'V1_pca'] DecisionTreeRegressor -0.348955 0.041524
6 ['V0_pca', 'V1_pca'] RandomForestRegressor -0.210917 0.029065
7 ['V0_pca', 'V1_pca'] XGBRegressor -0.175843 0.030416
8 ['V0_pca', 'V1_pca'] SVR -0.176460 0.031768
9 ['V0_pca', 'V1_pca'] LinearRegression -0.176673 0.028164

pca阈值0.99
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V0_pca','V1_pca','V8_pca','V27_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V8', 'V27', 'V31'] DecisionTreeRegressor -0.324587 0.033146
1 ['V0', 'V1', 'V8', 'V27', 'V31'] RandomForestRegressor -0.193224 0.033540
2 ['V0', 'V1', 'V8', 'V27', 'V31'] XGBRegressor -0.169513 0.031840
3 ['V0', 'V1', 'V8', 'V27', 'V31'] SVR -0.164338 0.033965
4 ['V0', 'V1', 'V8', 'V27', 'V31'] LinearRegression -0.163307 0.030822
5 ['V0_pca', 'V1_pca', 'V8_pca', 'V27_pca'] DecisionTreeRegressor -0.333768 0.039077
6 ['V0_pca', 'V1_pca', 'V8_pca', 'V27_pca'] RandomForestRegressor -0.206710 0.028182
7 ['V0_pca', 'V1_pca', 'V8_pca', 'V27_pca'] XGBRegressor -0.178987 0.030680
8 ['V0_pca', 'V1_pca', 'V8_pca', 'V27_pca'] SVR -0.171663 0.033606
9 ['V0_pca', 'V1_pca', 'V8_pca', 'V27_pca'] LinearRegression -0.175378 0.029132

pca阈值0.97
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V0_pca','V1_pca','V8_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V8', 'V27', 'V31'] DecisionTreeRegressor -0.336270 0.035375
1 ['V0', 'V1', 'V8', 'V27', 'V31'] RandomForestRegressor -0.193649 0.034210
2 ['V0', 'V1', 'V8', 'V27', 'V31'] XGBRegressor -0.169513 0.031840
3 ['V0', 'V1', 'V8', 'V27', 'V31'] SVR -0.164338 0.033965
4 ['V0', 'V1', 'V8', 'V27', 'V31'] LinearRegression -0.163307 0.030822
5 ['V0_pca', 'V1_pca', 'V8_pca'] DecisionTreeRegressor -0.351772 0.040581
6 ['V0_pca', 'V1_pca', 'V8_pca'] RandomForestRegressor -0.209066 0.029745
7 ['V0_pca', 'V1_pca', 'V8_pca'] XGBRegressor -0.177376 0.029603
8 ['V0_pca', 'V1_pca', 'V8_pca'] SVR -0.173976 0.035523
9 ['V0_pca', 'V1_pca', 'V8_pca'] LinearRegression -0.176499 0.029985

结论,不用pca进行处理


分组:['V4','V12']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V4_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V4', 'V12'] DecisionTreeRegressor -1.206150 0.155729
1 ['V4', 'V12'] RandomForestRegressor -0.748485 0.138379
2 ['V4', 'V12'] XGBRegressor -0.636875 0.151053
3 ['V4', 'V12'] SVR -0.654426 0.133919
4 ['V4', 'V12'] LinearRegression -0.629891 0.145093
5 ['V4_pca'] DecisionTreeRegressor -1.285653 0.186179
6 ['V4_pca'] RandomForestRegressor -0.969461 0.195460
7 ['V4_pca'] XGBRegressor -0.647809 0.160294
8 ['V4_pca'] SVR -0.669553 0.136717
9 ['V4_pca'] LinearRegression -0.624793 0.145420


结论:不使用pca处理


分组:[‘V6’,’V7’,’V16’]
阈值:0.95,实际:0.98
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V6_pca','V7_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V6', 'V7', 'V16'] DecisionTreeRegressor -1.590754 0.270260
1 ['V6', 'V7', 'V16'] RandomForestRegressor -0.852200 0.114089
2 ['V6', 'V7', 'V16'] XGBRegressor -0.700185 0.071235
3 ['V6', 'V7', 'V16'] SVR -0.683392 0.065502
4 ['V6', 'V7', 'V16'] LinearRegression -0.678508 0.088188
5 ['V6_pca', 'V7_pca'] DecisionTreeRegressor -1.377137 0.105200
6 ['V6_pca', 'V7_pca'] RandomForestRegressor -0.853283 0.098639
7 ['V6_pca', 'V7_pca'] XGBRegressor -0.688797 0.079210
8 ['V6_pca', 'V7_pca'] SVR -0.667027 0.083559
9 ['V6_pca', 'V7_pca'] LinearRegression -0.675531 0.088737
结论:使用阈值0.95的pca



分组:[‘V5’,’V11’]
阈值:0.95,实际1
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V5_pca','V11_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
name model mean std
0 ['V5', 'V11'] DecisionTreeRegressor -1.714300 0.217039
1 ['V5', 'V11'] RandomForestRegressor -1.050632 0.067146
2 ['V5', 'V11'] XGBRegressor -0.903512 0.073647
3 ['V5', 'V11'] SVR -0.885599 0.060962
4 ['V5', 'V11'] LinearRegression -0.881005 0.076114
5 ['V5_pca', 'V11_pca'] DecisionTreeRegressor -1.783053 0.073916
6 ['V5_pca', 'V11_pca'] RandomForestRegressor -1.087913 0.055959
7 ['V5_pca', 'V11_pca'] XGBRegressor -0.899808 0.078691
8 ['V5_pca', 'V11_pca'] SVR -0.885599 0.060962
9 ['V5_pca', 'V11_pca'] LinearRegression -0.881005 0.076114

阈值0.94,实际0.947
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V5_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V5', 'V11'] DecisionTreeRegressor -1.709464 0.205808
1 ['V5', 'V11'] RandomForestRegressor -1.048534 0.083421
2 ['V5', 'V11'] XGBRegressor -0.903512 0.073647
3 ['V5', 'V11'] SVR -0.885599 0.060962
4 ['V5', 'V11'] LinearRegression -0.881005 0.076114
5 ['V5_pca'] DecisionTreeRegressor -1.738627 0.089285
6 ['V5_pca'] RandomForestRegressor -1.328079 0.052006
7 ['V5_pca'] XGBRegressor -0.904260 0.077531
8 ['V5_pca'] SVR -0.902859 0.081526
9 ['V5_pca'] LinearRegression -0.898830 0.083833
结论:不适用pca

临时插入:
疑问:pca后效果变差,是否由于特征单一导致,如果和其他优势特征结合,可能出现pca后结果编号的情况
测试:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0', 'V1', 'V8', 'V27', 'V31']+target_high_rate_list,['V0', 'V1', 'V8', 'V27', 'V31','V5_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5', 'V11'] DecisionTreeRegressor -0.330471 0.034977
1 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5', 'V11'] RandomForestRegressor -0.189248 0.030511
2 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5', 'V11'] XGBRegressor -0.166913 0.035273
3 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5', 'V11'] SVR -0.162350 0.038070
4 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5', 'V11'] LinearRegression -0.160023 0.035017
5 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5_pca'] DecisionTreeRegressor -0.338560 0.045987
6 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5_pca'] RandomForestRegressor -0.194929 0.031102
7 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5_pca'] XGBRegressor -0.170267 0.033375
8 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5_pca'] SVR -0.167098 0.036562
9 ['V0', 'V1', 'V8', 'V27', 'V31', 'V5_pca'] LinearRegression -0.162314 0.032627

阈值0.98,实际1
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V5_pca','V11_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V5', 'V11'] DecisionTreeRegressor -1.724229 0.198293
1 ['V5', 'V11'] RandomForestRegressor -1.065805 0.096598
2 ['V5', 'V11'] XGBRegressor -0.903512 0.073647
3 ['V5', 'V11'] SVR -0.885599 0.060962
4 ['V5', 'V11'] LinearRegression -0.881005 0.076114
5 ['V5_pca', 'V11_pca'] DecisionTreeRegressor -1.780515 0.133518
6 ['V5_pca', 'V11_pca'] RandomForestRegressor -1.060753 0.065842
7 ['V5_pca', 'V11_pca'] XGBRegressor -0.899808 0.078691
8 ['V5_pca', 'V11_pca'] SVR -0.885599 0.060962
9 ['V5_pca', 'V11_pca'] LinearRegression -0.881005 0.076114

结论:不适用pca最好,上面猜想也不成立


分组:[‘V10’,‘V36’]
阈值:0.95,实际0.96
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V10_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V10', 'V36'] DecisionTreeRegressor -1.446965 0.145960
1 ['V10', 'V36'] RandomForestRegressor -0.936744 0.088230
2 ['V10', 'V36'] XGBRegressor -0.765859 0.071264
3 ['V10', 'V36'] SVR -0.756399 0.087264
4 ['V10', 'V36'] LinearRegression -0.830502 0.099861
5 ['V10_pca'] DecisionTreeRegressor -1.617832 0.087963
6 ['V10_pca'] RandomForestRegressor -1.215397 0.064845
7 ['V10_pca'] XGBRegressor -0.853255 0.069104
8 ['V10_pca'] SVR -0.849245 0.091274
9 ['V10_pca'] LinearRegression -0.865060 0.096112
阈值:0.98,实际1
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V10_pca','V36_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V10', 'V36'] DecisionTreeRegressor -1.435331 0.144751
1 ['V10', 'V36'] RandomForestRegressor -0.917391 0.094793
2 ['V10', 'V36'] XGBRegressor -0.765859 0.071264
3 ['V10', 'V36'] SVR -0.756399 0.087264
4 ['V10', 'V36'] LinearRegression -0.830502 0.099861
5 ['V10_pca', 'V36_pca'] DecisionTreeRegressor -1.379996 0.110129
6 ['V10_pca', 'V36_pca'] RandomForestRegressor -0.901878 0.074246
7 ['V10_pca', 'V36_pca'] XGBRegressor -0.742560 0.077888
8 ['V10_pca', 'V36_pca'] SVR -0.756399 0.087264
9 ['V10_pca', 'V36_pca'] LinearRegression -0.830502 0.099861
结论:准确率未有太多变化(xgb减低2%点),但std基本都有改善。
采用阈值0.98的pca



分组:[’V15‘,’V29‘]
阈值0.95,实际0.975
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V15_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
name model mean std
0 ['V15', 'V29'] DecisionTreeRegressor -1.761736 0.124785
1 ['V15', 'V29'] RandomForestRegressor -1.120907 0.078181
2 ['V15', 'V29'] XGBRegressor -0.930967 0.072820
3 ['V15', 'V29'] SVR -0.932632 0.081966
4 ['V15', 'V29'] LinearRegression -0.963071 0.108939
5 ['V15_pca'] DecisionTreeRegressor -1.771067 0.130386
6 ['V15_pca'] RandomForestRegressor -1.375482 0.124175
7 ['V15_pca'] XGBRegressor -0.947765 0.079940
8 ['V15_pca'] SVR -0.978946 0.087199
9 ['V15_pca'] LinearRegression -0.969395 0.100013

阈值:0.98,实际1
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[target_high_rate_list,['V15_pca','V29_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V15', 'V29'] DecisionTreeRegressor -1.753007 0.113313
1 ['V15', 'V29'] RandomForestRegressor -1.117463 0.075852
2 ['V15', 'V29'] XGBRegressor -0.930967 0.072820
3 ['V15', 'V29'] SVR -0.932632 0.081966
4 ['V15', 'V29'] LinearRegression -0.963071 0.108939
5 ['V15_pca', 'V29_pca'] DecisionTreeRegressor -1.810228 0.141944
6 ['V15_pca', 'V29_pca'] RandomForestRegressor -1.129080 0.116537
7 ['V15_pca', 'V29_pca'] XGBRegressor -0.925074 0.087392
8 ['V15_pca', 'V29_pca'] SVR -0.932632 0.081966
9 ['V15_pca', 'V29_pca'] LinearRegression -0.963071 0.108939

结论;不采用pca

综上:
需使用pca处理的:
[‘V6’,’V7’,’V16’]:0.95
[‘V10’,‘V36’]:0.98
查看效果:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V6','V7','V16','V10','V36'],['V6_pca','V7_pca','V10_pca','V36_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
name model mean std
0 ['V6', 'V7', 'V16', 'V10', 'V36'] DecisionTreeRegressor -1.257105 0.238803
1 ['V6', 'V7', 'V16', 'V10', 'V36'] RandomForestRegressor -0.733379 0.097895
2 ['V6', 'V7', 'V16', 'V10', 'V36'] XGBRegressor -0.639716 0.074017
3 ['V6', 'V7', 'V16', 'V10', 'V36'] SVR -0.624387 0.074436
4 ['V6', 'V7', 'V16', 'V10', 'V36'] LinearRegression -0.653913 0.082878
5 ['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca'] DecisionTreeRegressor -1.230450 0.162682
6 ['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca'] RandomForestRegressor -0.706818 0.129942
7 ['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca'] XGBRegressor -0.621447 0.076736
8 ['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca'] SVR -0.602430 0.086750
9 ['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca'] LinearRegression -0.649565 0.084521
整体效果得到提升

结合特征:['V0', 'V1', 'V8', 'V27', 'V31']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0', 'V1', 'V8', 'V27', 'V31']+['V6','V7','V16','V10','V36'],['V0', 'V1', 'V8', 'V27', 'V31']+['V6_pca','V7_pca','V10_pca','V36_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6', 'V7', '... DecisionTreeRegressor -0.327180 0.048550
1 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6', 'V7', '... RandomForestRegressor -0.177635 0.030780
2 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6', 'V7', '... XGBRegressor -0.158466 0.028713
3 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6', 'V7', '... SVR -0.151949 0.035026
4 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6', 'V7', '... LinearRegression -0.141915 0.027381
5 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6_pca', 'V7... DecisionTreeRegressor -0.324919 0.029349
6 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6_pca', 'V7... RandomForestRegressor -0.175699 0.025374
7 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6_pca', 'V7... XGBRegressor -0.156714 0.026560
8 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6_pca', 'V7... SVR -0.154009 0.034241
9 ['V0', 'V1', 'V8', 'V27', 'V31', 'V6_pca', 'V7... LinearRegression -0.141616 0.025737

除svr效果下降外,其他都稍有提升0.2%


补充:V0,V1的pca处理
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0','V1'],['V0_pca','V1_pca']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V0', 'V1'] DecisionTreeRegressor -0.357428 0.037233
1 ['V0', 'V1'] RandomForestRegressor -0.234839 0.031499
2 ['V0', 'V1'] XGBRegressor -0.190226 0.035271
3 ['V0', 'V1'] SVR -0.186778 0.034362
4 ['V0', 'V1'] LinearRegression -0.200945 0.036035
5 ['V0_pca', 'V1_pca'] DecisionTreeRegressor -0.358901 0.030301
6 ['V0_pca', 'V1_pca'] RandomForestRegressor -0.233183 0.034905
7 ['V0_pca', 'V1_pca'] XGBRegressor -0.188024 0.034054
8 ['V0_pca', 'V1_pca'] SVR -0.186778 0.034362
9 ['V0_pca', 'V1_pca'] LinearRegression -0.200945 0.036035

07_特征处理02_L1L2消除共线性和重要性筛选

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
lasso,消除参数为0的特征
V36 -0.219109
V7 -0.148457
V8 -0.147689
V5 -0.064422
V37 -0.053234
V24 -0.041133
V29 -0.035303
V25 -0.019774
V35 -0.015706
V28 -0.007174
V32 -0.002442
V34 -0.001933
V13 -0.001645
V31 -0.000000
V15 -0.000000
V21 0.000000
V20 0.005791
V18 0.011792
V23 0.012298
V22 0.012897
V30 0.013064
V19 0.014526
V33 0.015806
V26 0.018327
V16 0.023137
V11 0.026168
V4 0.030810
V9 0.034640
V14 0.052781
V17 0.084712
V12 0.095925
V6 0.115816
V3 0.124919
V2 0.166106
V1 0.187754
V10 0.293613
V0 0.345178
V27 0.952527
dtype: float64


去除:15,21,31特征
INFO:root:score_df:
name model mean std
0 ['V0' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' ... DecisionTreeRegressor -0.323648 0.059886
1 ['V0' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' ... RandomForestRegressor -0.160019 0.026342
2 ['V0' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' ... XGBRegressor -0.139043 0.024366
3 ['V0' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' ... SVR -0.159603 0.034162
4 ['V0' 'V1' 'V2' 'V3' 'V4' 'V5' 'V6' 'V7' 'V8' ... LinearRegression -0.118851 0.023679

消除参数<0.005特征:feature_columns = coef[np.abs(coef) >0.005].index.values
丢弃特征:coef[np.abs(coef) <0.005].index.values
['V32', 'V34', 'V13', 'V31', 'V15', 'V21']
效果:
INFO:root:score_df:
name model mean std
0 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... DecisionTreeRegressor -0.311194 0.061370
1 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... RandomForestRegressor -0.154084 0.022158
2 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... XGBRegressor -0.137108 0.022506
3 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... SVR -0.157262 0.031500
4 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... LinearRegression -0.118579 0.023579

消除0.01特征,
INFO:root:score_df:
name model mean std
0 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... DecisionTreeRegressor -0.320576 0.080739
1 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... RandomForestRegressor -0.160188 0.027430
2 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... XGBRegressor -0.138529 0.024386
3 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... SVR -0.156310 0.035335
4 ['V36' 'V7' 'V8' 'V5' 'V37' 'V24' 'V29' 'V25' ... LinearRegression -0.118228 0.023514

最终结论:
消除0.005特征,也就是['V32', 'V34', 'V13', 'V31', 'V15', 'V21']

继续,使用L2进行特征筛选,L1只判断是否为0
INFO:__main__:Ridge coef:V36 -0.260455
V8 -0.203721
V7 -0.182319
V5 -0.112276
V37 -0.057344
V24 -0.042646
V29 -0.037042
V25 -0.031488
V35 -0.017645
V28 -0.010704
V32 -0.008210
V13 -0.007708
V34 -0.006462
V20 0.012151
V18 0.013222
V22 0.015514
V30 0.016995
V23 0.017536
V33 0.019093
V19 0.020612
V26 0.026117
V9 0.029347
V4 0.034870
V16 0.042453
V11 0.048848
V14 0.054658
V17 0.096162
V12 0.096255
V3 0.121323
V6 0.150298
V2 0.152474
V1 0.181212
V10 0.333551
V0 0.343823
V27 1.124369
dtype: float64

阈值:0.01
需要被消除的为:V32,V13,V34,和L1消除的结论相同
INFO:__main__:drop_columns(l1):len(6)['V28' 'V32' 'V13' 'V34' 'V20' 'V18']
remain feature_columns:len(29)['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' 'V35' 'V22' 'V30' 'V23'
'V33' 'V19' 'V26' 'V9' 'V4' 'V16' 'V11' 'V14' 'V17' 'V12' 'V3' 'V6' 'V2'
'V1' 'V10' 'V0' 'V27']
INFO:root:score_df:
name model mean std
0 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... DecisionTreeRegressor -0.311387 0.063702
1 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... RandomForestRegressor -0.155467 0.024065
2 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... XGBRegressor -0.137706 0.025275
3 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... SVR -0.154197 0.035370
4 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... LinearRegression -0.118129 0.023470

和仅消除V32,V13,V34相比,稍差,LR小数点后4变好。其他都变差

阈值:0.011
INFO:__main__:drop_columns(l1):len(4)['V28' 'V32' 'V13' 'V34']
remain feature_columns:len(31)['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' 'V35' 'V20' 'V18' 'V22'
'V30' 'V23' 'V33' 'V19' 'V26' 'V9' 'V4' 'V16' 'V11' 'V14' 'V17' 'V12'
'V3' 'V6' 'V2' 'V1' 'V10' 'V0' 'V27']

INFO:root:score_df:
name model mean std
0 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... DecisionTreeRegressor -0.301161 0.046436
1 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... RandomForestRegressor -0.161671 0.030504
2 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... XGBRegressor -0.138263 0.023748
3 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... SVR -0.156915 0.033124
4 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... LinearRegression -0.118192 0.023716
和最优结果类似,暂保留吧

结论:
L1(abs)阈值0.000001(就是0)删除['V31' 'V15' 'V21']
L2(abs)阈值0.010,删除:['V32' 'V13' 'V34']
最终结果:
INFO:__main__:drop_columns(l1):len(3)['V32' 'V13' 'V34']
remain feature_columns:len(32)['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' 'V35' 'V28' 'V20' 'V18'
'V22' 'V30' 'V23' 'V33' 'V19' 'V26' 'V9' 'V4' 'V16' 'V11' 'V14' 'V17'
'V12' 'V3' 'V6' 'V2' 'V1' 'V10' 'V0' 'V27']
INFO:root:score_df:
name model mean std
0 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... DecisionTreeRegressor -0.308981 0.048053
1 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... RandomForestRegressor -0.161922 0.026347
2 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... XGBRegressor -0.137111 0.022508
3 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... SVR -0.157262 0.031500
4 ['V36' 'V8' 'V7' 'V5' 'V37' 'V24' 'V29' 'V25' ... LinearRegression -0.118579 0.023579

08_特征处理03_结合01pca和02的L1L2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
先pca在执行L1,L2

仅执行PCA
INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... DecisionTreeRegressor -0.300560 0.028233
1 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... RandomForestRegressor -0.162721 0.027562
2 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... XGBRegressor -0.135133 0.027914
3 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... SVR -0.165093 0.033164
4 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... LinearRegression -0.119697 0.025247


执行L1
Lasso coef:V36_pca -0.396163
V8 -0.179194
V7_pca -0.122315
V5 -0.119443
V37 -0.057500
V29 -0.051609
V24 -0.039309
V25 -0.033803
V35 -0.017655
V31 -0.014906
V28 -0.012344
V6_pca -0.002550
V34 -0.000000
V32 -0.000000
V22 -0.000000
V15 -0.000000
V13 0.000271
V21 0.004335
V20 0.009629
V30 0.009796
V23 0.010874
V18 0.012799
V33 0.015368
V26 0.022982
V19 0.028851
V4 0.031641
V9 0.041711
V11 0.044574
V14 0.051736
V10_pca 0.053542
V17 0.076241
V12 0.094225
V3 0.120042
V2 0.159088
V1 0.205517
V0 0.341014
V27 0.998343

INFO:__main__:drop_columns(l1):len(4)['V34' 'V32' 'V22' 'V15']
remain feature_columns:len(33)['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24' 'V25' 'V35' 'V31' 'V28'
'V6_pca' 'V13' 'V21' 'V20' 'V30' 'V23' 'V18' 'V33' 'V26' 'V19' 'V4' 'V9'
'V11' 'V14' 'V10_pca' 'V17' 'V12' 'V3' 'V2' 'V1' 'V0' 'V27']

INFO:root:score_df:
name model mean std
0 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... DecisionTreeRegressor -0.293691 0.035344
1 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... RandomForestRegressor -0.154106 0.025903
2 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... XGBRegressor -0.133265 0.026316
3 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... SVR -0.164783 0.031384
4 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... LinearRegression -0.117350 0.024908

执行L2
INFO:__main__:Ridge coef:V36_pca -0.420732
V8 -0.201090
V5 -0.145248
V7_pca -0.138410
V37 -0.058596
V29 -0.054312
V24 -0.040038
V25 -0.039798
V31 -0.022494
V35 -0.018770
V28 -0.014279
V6_pca -0.008476
V13 0.001115
V21 0.004544
V30 0.010439
V20 0.012407
V23 0.012421
V18 0.012587
V33 0.015457
V26 0.026000
V19 0.033021
V4 0.033940
V9 0.040915
V14 0.050854
V10_pca 0.054400
V11 0.057880
V17 0.079718
V12 0.096246
V3 0.117980
V2 0.153222
V1 0.203653
V0 0.338806
V27 1.077792
dtype: float64
INFO:__main__:drop_columns(l1):len(3)['V6_pca' 'V13' 'V21']
remain feature_columns:len(30)['V36_pca' 'V8' 'V5' 'V7_pca' 'V37' 'V29' 'V24' 'V25' 'V31' 'V35' 'V28'
'V30' 'V20' 'V23' 'V18' 'V33' 'V26' 'V19' 'V4' 'V9' 'V14' 'V10_pca' 'V11'
'V17' 'V12' 'V3' 'V2' 'V1' 'V0' 'V27']
INFO:root:score_df:
name model mean std
0 ['V36_pca' 'V8' 'V5' 'V7_pca' 'V37' 'V29' 'V24... DecisionTreeRegressor -0.302435 0.032792
1 ['V36_pca' 'V8' 'V5' 'V7_pca' 'V37' 'V29' 'V24... RandomForestRegressor -0.156806 0.022907
2 ['V36_pca' 'V8' 'V5' 'V7_pca' 'V37' 'V29' 'V24... XGBRegressor -0.134421 0.026439
3 ['V36_pca' 'V8' 'V5' 'V7_pca' 'V37' 'V29' 'V24... SVR -0.156116 0.034362
4 ['V36_pca' 'V8' 'V5' 'V7_pca' 'V37' 'V29' 'V24... LinearRegression -0.116577 0.025306

可见:L1后效果较好,L2后效果不佳,但LR,SVR,经过L2提高了。
所以暂时不执行L2流程。保留特征


另一种思路,PCA之后,剩余特征按最初思路执行,移除特征(等价于先执行L1,L2再执行PCA)
['V31' 'V15' 'V21']
['V32' 'V13' 'V34']
哼上面相比可以发现,差异在于,V6_pca,V15,V22的处理上

INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... DecisionTreeRegressor -0.303125 0.036104
1 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... RandomForestRegressor -0.157291 0.023390
2 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... XGBRegressor -0.135129 0.024343
3 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... SVR -0.161478 0.030688
4 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... LinearRegression -0.117744 0.025338


可见,结果依然不如最初先PCA,在L1,L2
采用先PCA,再L1,不执行L2
最终效果(第一个score_df)
INFO:root:score_df:
name model mean std
0 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... DecisionTreeRegressor -0.293691 0.035344
1 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... RandomForestRegressor -0.154106 0.025903
2 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... XGBRegressor -0.133265 0.026316
3 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... SVR -0.164783 0.031384
4 ['V36_pca' 'V8' 'V7_pca' 'V5' 'V37' 'V29' 'V24... LinearRegression -0.117350 0.024908

09_特种处理04_特征运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
高相关性特征对
high_relate_pair = [('V0', 'V1'), ('V1', 'V8'), ('V8', 'V27'), ('V8', 'V31')]
test_extend_columns(train_data, high_relate_pair, target_column)


V0,V1为例
先归一化,否二乘除法会报错(乘除法不单调,除法无穷大的问题)
data[column1] = MinMaxScaler(feature_range=(1, 2)).fit_transform(data[[column1]])
data[column2] = MinMaxScaler(feature_range=(1, 2)).fit_transform(data[[column2]])

先归一化,
FeatureTools.get_column_corr(data, target_column, column_list)
pearson spearman mine
target 1.000000 1.000000 1.000000
V0_V1_multi 0.897923 0.875195 0.615303
V0_V1_add 0.893173 0.875144 0.610832
V0_V1_divise 0.175642 0.193329 0.162765
V0_V1_minus 0.148422 0.157893 0.152119

FeatureTools.get_score_by_models(data, target_column, feature_lists=[[column1, column2], ['V0_V1_multi'],[column1, column2,'V0_V1_multi']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V0', 'V1'] DecisionTreeRegressor -0.356096 0.032305
1 ['V0', 'V1'] RandomForestRegressor -0.233660 0.034715
2 ['V0', 'V1'] XGBRegressor -0.190067 0.034968
3 ['V0', 'V1'] SVR -0.185141 0.034713
4 ['V0', 'V1'] LinearRegression -0.200945 0.036035
5 ['V0_V1_multi'] DecisionTreeRegressor -0.374331 0.027902
6 ['V0_V1_multi'] RandomForestRegressor -0.291238 0.037976
7 ['V0_V1_multi'] XGBRegressor -0.191586 0.038359
8 ['V0_V1_multi'] SVR -0.187534 0.037808
9 ['V0_V1_multi'] LinearRegression -0.190452 0.036744
10 ['V0', 'V1', 'V0_V1_multi'] DecisionTreeRegressor -0.369322 0.031049
11 ['V0', 'V1', 'V0_V1_multi'] RandomForestRegressor -0.221279 0.032529
12 ['V0', 'V1', 'V0_V1_multi'] XGBRegressor -0.188337 0.033998
13 ['V0', 'V1', 'V0_V1_multi'] SVR -0.183552 0.035001
14 ['V0', 'V1', 'V0_V1_multi'] LinearRegression -0.189452 0.035923

结论:V0_V1_multi作为新特征不错。


FeatureTools.get_score_by_models(data, target_column, feature_lists=[[column1, column2], ['V0_V1_multi','V0_V1_add'],[column1, column2,'V0_V1_multi','V0_V1_add']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V0', 'V1'] DecisionTreeRegressor -0.354313 0.037623
1 ['V0', 'V1'] RandomForestRegressor -0.229222 0.030077
2 ['V0', 'V1'] XGBRegressor -0.190067 0.034968
3 ['V0', 'V1'] SVR -0.185141 0.034713
4 ['V0', 'V1'] LinearRegression -0.200945 0.036035
5 ['V0_V1_multi', 'V0_V1_add'] DecisionTreeRegressor -0.367472 0.022726
6 ['V0_V1_multi', 'V0_V1_add'] RandomForestRegressor -0.267526 0.031264
7 ['V0_V1_multi', 'V0_V1_add'] XGBRegressor -0.191315 0.036703
8 ['V0_V1_multi', 'V0_V1_add'] SVR -0.187214 0.037378
9 ['V0_V1_multi', 'V0_V1_add'] LinearRegression -0.190216 0.036958
10 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] DecisionTreeRegressor -0.362111 0.042525
11 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] RandomForestRegressor -0.232095 0.033035
12 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] XGBRegressor -0.187517 0.034220
13 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] SVR -0.184084 0.035393
14 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] LinearRegression -0.189454 0.035921

结论:V0_V1_add不需要了,相比mulit提升不大

[V1,V8]
pearson spearman mine
target 1.000000 1.000000 1.000000
V1_V8_add 0.880251 0.848516 0.562550
V1_V8_multi 0.883604 0.848621 0.559336
V1_V8_divise 0.099952 -0.071685 0.184833
V1_V8_minus 0.139722 -0.041255 0.155289

INFO:root:score_df:
name model mean std
0 ['V0', 'V1'] DecisionTreeRegressor -0.356593 0.034993
1 ['V0', 'V1'] RandomForestRegressor -0.229738 0.036497
2 ['V0', 'V1'] XGBRegressor -0.190067 0.034968
3 ['V0', 'V1'] SVR -0.185141 0.034713
4 ['V0', 'V1'] LinearRegression -0.200945 0.036035
5 ['V0_V1_multi', 'V0_V1_add'] DecisionTreeRegressor -0.386711 0.028323
6 ['V0_V1_multi', 'V0_V1_add'] RandomForestRegressor -0.267109 0.031418
7 ['V0_V1_multi', 'V0_V1_add'] XGBRegressor -0.191315 0.036703
8 ['V0_V1_multi', 'V0_V1_add'] SVR -0.187214 0.037378
9 ['V0_V1_multi', 'V0_V1_add'] LinearRegression -0.190216 0.036958
10 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] DecisionTreeRegressor -0.365555 0.041061
11 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] RandomForestRegressor -0.230082 0.040307
12 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] XGBRegressor -0.187517 0.034220
13 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] SVR -0.184084 0.035393
14 ['V0', 'V1', 'V0_V1_multi', 'V0_V1_add'] LinearRegression -0.189454 0.035921

结论:提升不大

[V8,V27]
pearson spearman mine
target 1.000000 1.000000 1.000000
V8_V27_add 0.843348 0.805118 0.502860
V8_V27_multi 0.843656 0.805145 0.499674
V8_V27_divise 0.004108 0.053091 0.155982
V8_V27_minus 0.038498 0.064156 0.130330

INFO:root:score_df:
name model mean std
0 ['V8', 'V27'] DecisionTreeRegressor -0.542286 0.069032
1 ['V8', 'V27'] RandomForestRegressor -0.355172 0.048595
2 ['V8', 'V27'] XGBRegressor -0.280696 0.047153
3 ['V8', 'V27'] SVR -0.280820 0.049616
4 ['V8', 'V27'] LinearRegression -0.283327 0.043570
5 ['V8_V27_add'] DecisionTreeRegressor -0.567947 0.052520
6 ['V8_V27_add'] RandomForestRegressor -0.417035 0.042116
7 ['V8_V27_add'] XGBRegressor -0.291230 0.048234
8 ['V8_V27_add'] SVR -0.283485 0.052332
9 ['V8_V27_add'] LinearRegression -0.283745 0.045758
10 ['V8_V27_multi'] DecisionTreeRegressor -0.586547 0.035946
11 ['V8_V27_multi'] RandomForestRegressor -0.438554 0.036800
12 ['V8_V27_multi'] XGBRegressor -0.291766 0.047530
13 ['V8_V27_multi'] SVR -0.283685 0.051521
14 ['V8_V27_multi'] LinearRegression -0.282695 0.047602
15 ['V8', 'V27', 'V8_V27_add'] DecisionTreeRegressor -0.558820 0.035910
16 ['V8', 'V27', 'V8_V27_add'] RandomForestRegressor -0.344266 0.043386
17 ['V8', 'V27', 'V8_V27_add'] XGBRegressor -0.280546 0.044378
18 ['V8', 'V27', 'V8_V27_add'] SVR -0.280706 0.051028
19 ['V8', 'V27', 'V8_V27_add'] LinearRegression -0.283315 0.043601

结论:提升不大

为何mic计算提高,但是实际算法却没得到优化?怀疑是数据分布问题,特殊的分布导致mic计算时被夸大,而算法却未能捕捉到这种分布信息

10_特证处理05_基于方差的特征筛选

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
方差阈值:0.75
INFO:__main__:drop_columns:len(7)['V11' 'V17' 'V20' 'V21' 'V22' 'V27' 'V5']
feature_columns:len(31)['V0' 'V1' 'V10' 'V12' 'V13' 'V14' 'V15' 'V16' 'V18' 'V19' 'V2' 'V23'
'V24' 'V25' 'V26' 'V28' 'V29' 'V3' 'V30' 'V31' 'V32' 'V33' 'V34' 'V35'
'V36' 'V37' 'V4' 'V6' 'V7' 'V8' 'V9']
INFO:root:score_df:
name model mean std
0 ['V0' 'V1' 'V10' 'V12' 'V13' 'V14' 'V15' 'V16'... DecisionTreeRegressor -0.289822 0.051043
1 ['V0' 'V1' 'V10' 'V12' 'V13' 'V14' 'V15' 'V16'... RandomForestRegressor -0.158007 0.025817
2 ['V0' 'V1' 'V10' 'V12' 'V13' 'V14' 'V15' 'V16'... XGBRegressor -0.138493 0.025707
3 ['V0' 'V1' 'V10' 'V12' 'V13' 'V14' 'V15' 'V16'... SVR -0.167792 0.040511
4 ['V0' 'V1' 'V10' 'V12' 'V13' 'V14' 'V15' 'V16'... LinearRegression -0.134964 0.024082

方差阈值:0.80
注意这里的4是指在0.75基础上丢弃7再丢弃4,实际丢弃11个
INFO:__main__:drop_columns:len(4)['V12' 'V26' 'V31' 'V4']
feature_columns:len(27)['V0' 'V1' 'V10' 'V13' 'V14' 'V15' 'V16' 'V18' 'V19' 'V2' 'V23' 'V24'
'V25' 'V28' 'V29' 'V3' 'V30' 'V32' 'V33' 'V34' 'V35' 'V36' 'V37' 'V6'
'V7' 'V8' 'V9']
INFO:root:score_df:
name model mean std
0 ['V0' 'V1' 'V10' 'V13' 'V14' 'V15' 'V16' 'V18'... DecisionTreeRegressor -0.313083 0.060275
1 ['V0' 'V1' 'V10' 'V13' 'V14' 'V15' 'V16' 'V18'... RandomForestRegressor -0.155927 0.025921
2 ['V0' 'V1' 'V10' 'V13' 'V14' 'V15' 'V16' 'V18'... XGBRegressor -0.139670 0.024843
3 ['V0' 'V1' 'V10' 'V13' 'V14' 'V15' 'V16' 'V18'... SVR -0.170734 0.039895
4 ['V0' 'V1' 'V10' 'V13' 'V14' 'V15' 'V16' 'V18'... LinearRegression -0.135090 0.023313

方差阈值:0.90
INFO:__main__:drop_columns:len(10)['V0' 'V1' 'V13' 'V2' 'V25' 'V28' 'V30' 'V32' 'V6' 'V8']
feature_columns:len(17)['V10' 'V14' 'V15' 'V16' 'V18' 'V19' 'V23' 'V24' 'V29' 'V3' 'V33' 'V34'
'V35' 'V36' 'V37' 'V7' 'V9']
INFO:root:score_df:
name model mean std
0 ['V10' 'V14' 'V15' 'V16' 'V18' 'V19' 'V23' 'V2... DecisionTreeRegressor -0.582852 0.080416
1 ['V10' 'V14' 'V15' 'V16' 'V18' 'V19' 'V23' 'V2... RandomForestRegressor -0.303207 0.044482
2 ['V10' 'V14' 'V15' 'V16' 'V18' 'V19' 'V23' 'V2... XGBRegressor -0.253222 0.040596
3 ['V10' 'V14' 'V15' 'V16' 'V18' 'V19' 'V23' 'V2... SVR -0.287795 0.042762
4 ['V10' 'V14' 'V15' 'V16' 'V18' 'V19' 'V23' 'V2... LinearRegression -0.295447 0.048600


方差阈值:0.70
INFO:__main__:drop_columns:len(6)['V17' 'V20' 'V21' 'V22' 'V27' 'V5']
feature_columns:len(32)['V0' 'V1' 'V10' 'V11' 'V12' 'V13' 'V14' 'V15' 'V16' 'V18' 'V19' 'V2'
'V23' 'V24' 'V25' 'V26' 'V28' 'V29' 'V3' 'V30' 'V31' 'V32' 'V33' 'V34'
'V35' 'V36' 'V37' 'V4' 'V6' 'V7' 'V8' 'V9']
INFO:root:score_df:
name model mean std
0 ['V0' 'V1' 'V10' 'V11' 'V12' 'V13' 'V14' 'V15'... DecisionTreeRegressor -0.289383 0.056952
1 ['V0' 'V1' 'V10' 'V11' 'V12' 'V13' 'V14' 'V15'... RandomForestRegressor -0.157342 0.023655
2 ['V0' 'V1' 'V10' 'V11' 'V12' 'V13' 'V14' 'V15'... XGBRegressor -0.139033 0.026136
3 ['V0' 'V1' 'V10' 'V11' 'V12' 'V13' 'V14' 'V15'... SVR -0.165809 0.039976
4 ['V0' 'V1' 'V10' 'V11' 'V12' 'V13' 'V14' 'V15'... LinearRegression -0.136147 0.023294
和方差0.75比,相差一个,少了V11,结果上基本持平,各有上下,

结论:采用0.70的方差阈值

11_特征处理06_RFECV特征筛选

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
INFO:root:model_name:XGBRegressor
import sys; print('Python %s on %s' % (sys.version, sys.platform))
INFO:__main__:有效特征个数 : 19
INFO:__main__:全部特征等级 : [1, 1, 1, 1, 1, 4, 1, 1, 2, 1, 1, 19, 1, 20, 1, 1, 11, 13, 9, 1, 7, 5, 1, 16, 6, 1, 15, 1, 14, 3, 8, 18, 10, 17, 1, 12, 1, 1]
INFO:root:model_name:RandomForestRegressor
INFO:__main__:有效特征个数 : 23
INFO:__main__:全部特征等级 : [1, 1, 1, 1, 5, 11, 1, 1, 1, 15, 1, 1, 1, 7, 1, 1, 2, 12, 10, 1, 1, 1, 13, 1, 4, 1, 1, 1, 3, 9, 6, 1, 1, 8, 14, 16, 1, 1]
INFO:root:model_name:LinearSVR
INFO:__main__:有效特征个数 : 36
INFO:__main__:全部特征等级 : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1]
INFO:root:model_name:LinearRegression
INFO:__main__:有效特征个数 : 33
INFO:__main__:全部特征等级 : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 4, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 5, 1, 1, 1]
cv_score_df
LinearRegression LinearSVR RandomForestRegressor XGBRegressor
XGBRegressor -0.124106 -0.123530 -0.161114 -0.135132
RandomForestRegressor -0.127921 -0.126541 -0.160051 -0.138591
LinearSVR -0.119212 -0.119961 -0.155650 -0.137180
LinearRegression -0.118657 -0.119523 -0.159599 -0.138608

其中LinearSVR和LinearRegression特征筛选都不错,按照多保留原则采用LinearSVR的方式,丢弃1个特征


结合特征处理03,之后进行rfecv筛选。
['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9', 'V11', 'V12',
'V13', 'V14', 'V17', 'V18', 'V19', 'V20', 'V21', 'V23', 'V24', 'V25',
'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V33', 'V35', 'V37', 'V6_pca',
'V7_pca', 'V10_pca', 'V36_pca']
INFO:root:model_name:XGBRegressor
INFO:__main__:有效特征个数 : 27
INFO:__main__:全部特征等级 : [1, 1, 1, 1, 1, 1, 1, 1, 6, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 7, 4, 3, 1, 1, 1, 1, 1]
INFO:root:model_name:RandomForestRegressor
INFO:__main__:有效特征个数 : 14
INFO:__main__:全部特征等级 : [1, 1, 1, 1, 9, 17, 1, 18, 11, 3, 14, 1, 19, 2, 1, 4, 8, 10, 7, 1, 13, 1, 15, 12, 5, 1, 16, 20, 1, 1, 6, 1, 1]
INFO:root:model_name:LinearSVR
INFO:__main__:有效特征个数 : 23
INFO:__main__:全部特征等级 : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 5, 2, 11, 1, 1, 4, 1, 7, 1, 8, 1, 10, 9, 1, 6, 1, 1, 1]
INFO:root:model_name:LinearRegression
INFO:__main__:有效特征个数 : 23
INFO:__main__:全部特征等级 : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 11, 1, 1, 1, 1, 7, 10, 5, 1, 1, 2, 1, 3, 1, 8, 1, 6, 4, 1, 9, 1, 1, 1]
cv_score_df
LinearRegression LinearSVR RandomForestRegressor XGBRegressor
XGBRegressor -0.117958 -0.119107 -0.155186 -0.133451
RandomForestRegressor -0.125237 -0.126292 -0.156786 -0.140854
LinearSVR -0.115788 -0.116445 -0.154647 -0.136111
LinearRegression -0.115510 -0.117012 -0.156685 -0.134615

结论:效果略微提升

12_特征处理07_特征log正态化(其他算法尝试)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
尝试其他模型(算法)
FeatureTools.get_score_by_models(train_data, target_column, feature_lists=[['V0', 'V1', 'V8', 'V27', 'V31']],
models=[BayesianRidge(compute_score=True),
GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10,
alpha=0.1),
ElasticNetCV(alphas=[0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10],
l1_ratio=[.01, .1, .5, .9, .99], max_iter=5000)])

name model mean std
0 ['V0', 'V1', 'V8', 'V27', 'V31'] BayesianRidge -0.163311 0.030762
1 ['V0', 'V1', 'V8', 'V27', 'V31'] GaussianProcessRegressor -0.164119 0.034371
2 ['V0', 'V1', 'V8', 'V27', 'V31'] ElasticNetCV -0.163244 0.030637

通过观察,V1,V31左右sum对称,可转为正态分布
target,v0,V8,V27都不满足对称
INFO:root:score_df:
name model mean std
0 ['V1', 'V31'] DecisionTreeRegressor -0.427769 0.043110
1 ['V1', 'V31'] RandomForestRegressor -0.274793 0.047642
2 ['V1', 'V31'] XGBRegressor -0.214375 0.040099
3 ['V1', 'V31'] SVR -0.211947 0.043653
4 ['V1', 'V31'] LinearRegression -0.221655 0.035591
5 ['V1_exp', 'V31_exp'] DecisionTreeRegressor -0.424883 0.044053
6 ['V1_exp', 'V31_exp'] RandomForestRegressor -0.276417 0.046985
7 ['V1_exp', 'V31_exp'] XGBRegressor -0.214300 0.040064
8 ['V1_exp', 'V31_exp'] SVR -0.215313 0.044065
9 ['V1_exp', 'V31_exp'] LinearRegression -0.232610 0.042458
可见效果变差,考虑到进行过maxmin,是否maxmin到时效果变差?

添加对maxmin的测试
INFO:root:score_df:
name model mean std
0 ['V1', 'V31'] DecisionTreeRegressor -0.428160 0.041686
1 ['V1', 'V31'] RandomForestRegressor -0.268419 0.050182
2 ['V1', 'V31'] XGBRegressor -0.214375 0.040099
3 ['V1', 'V31'] SVR -0.211947 0.043653
4 ['V1', 'V31'] LinearRegression -0.221655 0.035591
5 ['V1_exp', 'V31_exp'] DecisionTreeRegressor -0.420133 0.046145
6 ['V1_exp', 'V31_exp'] RandomForestRegressor -0.277211 0.047870
7 ['V1_exp', 'V31_exp'] XGBRegressor -0.214300 0.040064
8 ['V1_exp', 'V31_exp'] SVR -0.215313 0.044065
9 ['V1_exp', 'V31_exp'] LinearRegression -0.232610 0.042458
10 ['V1_maxmin', 'V31_maxmin'] DecisionTreeRegressor -0.425427 0.039514
11 ['V1_maxmin', 'V31_maxmin'] RandomForestRegressor -0.274118 0.045047
12 ['V1_maxmin', 'V31_maxmin'] XGBRegressor -0.214416 0.040078
13 ['V1_maxmin', 'V31_maxmin'] SVR -0.213334 0.036041
14 ['V1_maxmin', 'V31_maxmin'] LinearRegression -0.221655 0.035591
结论:maxmin对各个算法基本都无影响,exp确实导致效果变差

是否是由于target本身非正态分布导致v1,v31效果变差?

插博:偶发现,target的一个sqrt变换和V1,相似,不放让v1的分布向target靠拢
train_data['V1_square']=np.square(MinMaxScaler(feature_range=(0, 1)).fit_transform(train_data[['V1']])[:, 0])
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V1'],['V1_square']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V1'] DecisionTreeRegressor -0.400508 0.042364
1 ['V1'] RandomForestRegressor -0.333950 0.037930
2 ['V1'] XGBRegressor -0.230583 0.043138
3 ['V1'] SVR -0.226020 0.043922
4 ['V1'] LinearRegression -0.238118 0.038222
5 ['V1_square'] DecisionTreeRegressor -0.401792 0.039568
6 ['V1_square'] RandomForestRegressor -0.334999 0.036464
7 ['V1_square'] XGBRegressor -0.230584 0.042827
8 ['V1_square'] SVR -0.222740 0.041922
9 ['V1_square'] LinearRegression -0.226618 0.042718

train_data['V31_square']=np.square(MinMaxScaler(feature_range=(0, 1)).fit_transform(train_data[['V31']])[:, 0])
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V1','V31'],['V1_square','V31_square']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 ['V1', 'V31'] DecisionTreeRegressor -0.420399 0.047095
1 ['V1', 'V31'] RandomForestRegressor -0.271898 0.044306
2 ['V1', 'V31'] XGBRegressor -0.214375 0.040099
3 ['V1', 'V31'] SVR -0.211947 0.043653
4 ['V1', 'V31'] LinearRegression -0.221655 0.035591
5 ['V1_square', 'V31_square'] DecisionTreeRegressor -0.425474 0.040900
6 ['V1_square', 'V31_square'] RandomForestRegressor -0.277752 0.046195
7 ['V1_square', 'V31_square'] XGBRegressor -0.214401 0.040019
8 ['V1_square', 'V31_square'] SVR -0.212134 0.037723
9 ['V1_square', 'V31_square'] LinearRegression -0.218316 0.039889
结论:有微小的提高,可忽略


使用V1和V31得到的假正态分布,v1_exp,v31_exp和target转化的正态分布
train_data['target_log']=np.exp(MinMaxScaler(feature_range=(0, 1)).fit_transform(train_data[[target_column]])[:, 0])
进行特征打分
INFO:root:score_df:
name model mean std
0 ['V1', 'V31'] DecisionTreeRegressor -0.042817 0.002237
1 ['V1', 'V31'] RandomForestRegressor -0.027296 0.004342
2 ['V1', 'V31'] XGBRegressor -0.021184 0.003640
3 ['V1', 'V31'] SVR -0.020691 0.003894
4 ['V1', 'V31'] LinearRegression -0.023324 0.003075
5 ['V1_exp', 'V31_exp'] DecisionTreeRegressor -0.043045 0.002020
6 ['V1_exp', 'V31_exp'] RandomForestRegressor -0.026970 0.004456
7 ['V1_exp', 'V31_exp'] XGBRegressor -0.021171 0.003624
8 ['V1_exp', 'V31_exp'] SVR -0.021025 0.003969
9 ['V1_exp', 'V31_exp'] LinearRegression -0.021113 0.003563
结论:基本无影响,lr优化,svr变差

13_特征处理08_maxmin及多项式回归

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
在使用pca和l1丢弃特征的情况下测试(减少无用特征干扰,多项式时特征膨胀)
1,多项式回归使用钱先maxmin到(0,1)。避免乘法产生同号异号带来的不搭调的拐点
tmp_df = FeatureTools.get_score_by_models(train_data_tmp, target_column_int, feature_lists=[feature_columns_int],
models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... DecisionTreeRegressor -0.295940 0.043340
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... RandomForestRegressor -0.141260 0.025087
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... XGBRegressor -0.128168 0.027043
3 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... SVR -0.127694 0.022072
4 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... LinearRegression -0.348927 0.122491

2,归一化到(1,2),避免乘法导致的单调性丧失
INFO:root:score_df:
name model mean std
0 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... DecisionTreeRegressor -0.290899 0.048220
1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... RandomForestRegressor -0.146798 0.024461
2 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... XGBRegressor -0.127410 0.024972
3 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... SVR -0.117449 0.024547
4 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,... LinearRegression -0.348927 0.122491


3,特征采用L1,L2筛选,避免特征爆炸
remain feature_columns:len(41)[ 34 35 45 67 75 99 105 108 118 121 134 137 147 148 151 170 182 207
212 230 238 243 244 250 274 307 345 373 405 408 447 480 487 498 505 527
529 542 563 580 589]
after l1:
name model mean std
0 [ 34 35 45 67 75 99 105 108 118 121 134 1... DecisionTreeRegressor -0.293998 0.031499
1 [ 34 35 45 67 75 99 105 108 118 121 134 1... RandomForestRegressor -0.146368 0.029795
2 [ 34 35 45 67 75 99 105 108 118 121 134 1... XGBRegressor -0.131009 0.028920
3 [ 34 35 45 67 75 99 105 108 118 121 134 1... SVR -0.114042 0.027338
4 [ 34 35 45 67 75 99 105 108 118 121 134 1... LinearRegression -0.117496 0.025198

INFO:__main__:drop_columns(l2):len(1)[28]
remain feature_columns:len(40)[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 29 30 31 32 33 34 35 36 37 38 39 40]

after l2:
name model mean std
0 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... DecisionTreeRegressor -0.289977 0.035529
1 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... RandomForestRegressor -0.150889 0.028751
2 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... XGBRegressor -0.133111 0.027415
3 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... SVR -0.120787 0.024142
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ... LinearRegression -0.116718 0.024732

结论:L2后效果变差,不执行L2,仅执行L1
最终结果为L1结果


补充:
仅maxmin归一化效果,
maxmin(1,2)
INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... DecisionTreeRegressor -0.304760 0.029997
1 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... RandomForestRegressor -0.157866 0.030015
2 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... XGBRegressor -0.133211 0.026132
3 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... SVR -0.124498 0.021788
4 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... LinearRegression -0.117350 0.024908

maxmin(0,1)
INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... DecisionTreeRegressor -0.297987 0.040978
1 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... RandomForestRegressor -0.153990 0.028065
2 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... XGBRegressor -0.133215 0.026177
3 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... SVR -0.124498 0.021788
4 ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9... LinearRegression -0.117350 0.024908

14_特征处理09_首尾异常点清理及和pcaL1Maxmin的结合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
此步骤执行后,后面才会执行pca和l1过滤。
score_df =FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0', 'V1', 'V8', 'V27', 'V31','V2', 'V12', 'V37', 'V16','V3']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
print(score_df)
INFO:root:score_df:
name model mean std
0 ['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12', ... DecisionTreeRegressor -0.289841 0.030571
1 ['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12', ... RandomForestRegressor -0.162636 0.029171
2 ['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12', ... XGBRegressor -0.146362 0.028829
3 ['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12', ... SVR -0.139642 0.032062
4 ['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12', ... LinearRegression -0.134779 0.027630




1,首尾点直接规约到阈值点
score_df =FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0_factor_trip', 'V1_factor_trip', 'V8_factor_trip', 'V27_factor_trip', 'V31_factor_trip','V2_factor_trip', 'V12_factor_trip', 'V37_factor_trip', 'V16_factor_trip','V3_factor_trip']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
print(score_df)
INFO:root:score_df:
name model mean std
0 ['V0_factor_trip', 'V1_factor_trip', 'V8_facto... DecisionTreeRegressor -0.291960 0.029490
1 ['V0_factor_trip', 'V1_factor_trip', 'V8_facto... RandomForestRegressor -0.155464 0.025455
2 ['V0_factor_trip', 'V1_factor_trip', 'V8_facto... XGBRegressor -0.147303 0.028138
3 ['V0_factor_trip', 'V1_factor_trip', 'V8_facto... SVR -0.139020 0.032732
4 ['V0_factor_trip', 'V1_factor_trip', 'V8_facto... LinearRegression -0.131265 0.028645

结论:转化后特征有微弱优势


2,新增维度首位点到阈值点的距离01234等
min, max = reverse_maxmin([min_box, max_box], 20, columns_info['max'][column_org],
columns_info['min'][column_org])
train_data['%s_factor_trip_split_0' % column_org] = train_data[column_org].apply(
lambda x: (min - x) if (min - x) > 0 else 0)
train_data['%s_factor_trip_split_1' % column_org] = train_data[column_org].apply(
lambda x: (x - max) if (x - max) > 0 else 0)
models = [DecisionTreeRegressor(), RandomForestRegressor(), XGBRegressor(), SVR(), LinearRegression()]
feature_lists=[org_feature,org_feature+add_feature]
FeatureTools.get_score_by_models(train_data, target_column,
feature_lists=feature_lists,
models=models)
INFO:root:score_df:
name model mean std
0 10['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... DecisionTreeRegressor -0.291840 0.029416
1 10['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... RandomForestRegressor -0.165230 0.029926
2 10['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... XGBRegressor -0.146362 0.028829
3 10['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... SVR -0.139642 0.032062
4 10['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... LinearRegression -0.134779 0.027630
5 30['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... DecisionTreeRegressor -0.284726 0.020562
6 30['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... RandomForestRegressor -0.159105 0.027762
7 30['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... XGBRegressor -0.146362 0.028829
8 30['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... SVR -0.132639 0.032518
9 30['V0', 'V1', 'V8', 'V27', 'V31', 'V2', 'V12'... LinearRegression -0.151084 0.028138

结论:对DTR,RFR,SVR微弱提升。对LR反而降低了准确率,这个和直觉不是很一致。

插播:对V0,1,8,27,31依次进行分析
FeatureTools.get_score_by_models(train_data, target_column,
feature_lists=[['V31'],['V31','V31_factor_trip_split_0','V31_factor_trip_split_1']],
models=models)
均证明处理后LR效果会提升,

最后:
FeatureTools.get_score_by_models(train_data, target_column,
feature_lists=[['V0','V1','V8','V27','V31'],['V0','V0_factor_trip_split_0','V0_factor_trip_split_1','V1','V1_factor_trip_split_0','V1_factor_trip_split_1','V8','V8_factor_trip_split_0','V8_factor_trip_split_1','V27','V27_factor_trip_split_0','V27_factor_trip_split_1','V31','V31_factor_trip_split_0','V31_factor_trip_split_1']],
models=models)
INFO:root:score_df:
name model mean std
0 5['V0', 'V1', 'V8', 'V27', 'V31'] DecisionTreeRegressor -0.332029 0.042951
1 5['V0', 'V1', 'V8', 'V27', 'V31'] RandomForestRegressor -0.191806 0.033341
2 5['V0', 'V1', 'V8', 'V27', 'V31'] XGBRegressor -0.169513 0.031840
3 5['V0', 'V1', 'V8', 'V27', 'V31'] SVR -0.164338 0.033965
4 5['V0', 'V1', 'V8', 'V27', 'V31'] LinearRegression -0.163307 0.030822
5 15['V0', 'V0_factor_trip_split_0', 'V0_factor_... DecisionTreeRegressor -0.332988 0.036594
6 15['V0', 'V0_factor_trip_split_0', 'V0_factor_... RandomForestRegressor -0.199520 0.033707
7 15['V0', 'V0_factor_trip_split_0', 'V0_factor_... XGBRegressor -0.169513 0.031840
8 15['V0', 'V0_factor_trip_split_0', 'V0_factor_... SVR -0.163018 0.033176
9 15['V0', 'V0_factor_trip_split_0', 'V0_factor_... LinearRegression -0.161437 0.031390
可见LR对处理后数据效果依然是提升的,
而上面多个特征放一起后效果下降可能是部分特征处理的有问题,或者数据中有其他干扰点导致LR被干扰。

一次查看剩余几个特征的转换前后评分
FeatureTools.get_score_by_models(train_data, target_column,
feature_lists=[['V2'],['V2','V2_factor_trip_split_0','V2_factor_trip_split_1'],['V12'],['V12','V12_factor_trip_split_0','V12_factor_trip_split_1'],['V37'],['V37','V37_factor_trip_split_0','V37_factor_trip_split_1'],['V16'],['V16','V16_factor_trip_split_0','V16_factor_trip_split_1'],['V3'],['V3','V3_factor_trip_split_0','V3_factor_trip_split_1']],
models=models)
INFO:root:score_df:
name model mean std
0 1['V2'] DecisionTreeRegressor -1.056743 0.033718
1 1['V2'] RandomForestRegressor -0.854399 0.034813
2 1['V2'] XGBRegressor -0.596280 0.066494
3 1['V2'] SVR -0.579717 0.089939
4 1['V2'] LinearRegression -0.586650 0.116199
5 3['V2', 'V2_factor_trip_split_0', 'V2_factor_t... DecisionTreeRegressor -1.056743 0.033718
6 3['V2', 'V2_factor_trip_split_0', 'V2_factor_t... RandomForestRegressor -0.849940 0.043134
7 3['V2', 'V2_factor_trip_split_0', 'V2_factor_t... XGBRegressor -0.596280 0.066494
8 3['V2', 'V2_factor_trip_split_0', 'V2_factor_t... SVR -0.581749 0.087866
9 3['V2', 'V2_factor_trip_split_0', 'V2_factor_t... LinearRegression -0.597143 0.107218
10 1['V12'] DecisionTreeRegressor -1.134138 0.152630
11 1['V12'] RandomForestRegressor -0.931918 0.163116
12 1['V12'] XGBRegressor -0.662974 0.162456
13 1['V12'] SVR -0.681987 0.145719
14 1['V12'] LinearRegression -0.644462 0.156669
15 3['V12', 'V12_factor_trip_split_0', 'V12_facto... DecisionTreeRegressor -1.134138 0.152630
16 3['V12', 'V12_factor_trip_split_0', 'V12_facto... RandomForestRegressor -0.909475 0.134845
17 3['V12', 'V12_factor_trip_split_0', 'V12_facto... XGBRegressor -0.662974 0.162456
18 3['V12', 'V12_factor_trip_split_0', 'V12_facto... SVR -0.679097 0.143162
19 3['V12', 'V12_factor_trip_split_0', 'V12_facto... LinearRegression -0.647601 0.158662
20 1['V37'] DecisionTreeRegressor -1.000137 0.084243
21 1['V37'] RandomForestRegressor -0.816063 0.087382
22 1['V37'] XGBRegressor -0.600524 0.074321
23 1['V37'] SVR -0.598275 0.063253
24 1['V37'] LinearRegression -0.672050 0.081303
25 3['V37', 'V37_factor_trip_split_0', 'V37_facto... DecisionTreeRegressor -1.000137 0.084243
26 3['V37', 'V37_factor_trip_split_0', 'V37_facto... RandomForestRegressor -0.825340 0.082795
27 3['V37', 'V37_factor_trip_split_0', 'V37_facto... XGBRegressor -0.600524 0.074321
28 3['V37', 'V37_factor_trip_split_0', 'V37_facto... SVR -0.596480 0.064296
29 3['V37', 'V37_factor_trip_split_0', 'V37_facto... LinearRegression -0.668819 0.082960
30 1['V16'] DecisionTreeRegressor -1.301747 0.115904
31 1['V16'] RandomForestRegressor -1.028127 0.096118
32 1['V16'] XGBRegressor -0.711924 0.077875
33 1['V16'] SVR -0.706045 0.098666
34 1['V16'] LinearRegression -0.714668 0.096363
35 3['V16', 'V16_factor_trip_split_0', 'V16_facto... DecisionTreeRegressor -1.302018 0.115554
36 3['V16', 'V16_factor_trip_split_0', 'V16_facto... RandomForestRegressor -1.026950 0.093788
37 3['V16', 'V16_factor_trip_split_0', 'V16_facto... XGBRegressor -0.711924 0.077875
38 3['V16', 'V16_factor_trip_split_0', 'V16_facto... SVR -0.711096 0.104044
39 3['V16', 'V16_factor_trip_split_0', 'V16_facto... LinearRegression -0.763959 0.207139
40 1['V3'] DecisionTreeRegressor -1.348183 0.154870
41 1['V3'] RandomForestRegressor -1.118868 0.140378
42 1['V3'] XGBRegressor -0.782377 0.138656
43 1['V3'] SVR -0.794189 0.156294
44 1['V3'] LinearRegression -0.726371 0.124292
45 3['V3', 'V3_factor_trip_split_0', 'V3_factor_t... DecisionTreeRegressor -1.347154 0.156108
46 3['V3', 'V3_factor_trip_split_0', 'V3_factor_t... RandomForestRegressor -1.097910 0.127703
47 3['V3', 'V3_factor_trip_split_0', 'V3_factor_t... XGBRegressor -0.782377 0.138656
48 3['V3', 'V3_factor_trip_split_0', 'V3_factor_t... SVR -0.802975 0.152541
49 3['V3', 'V3_factor_trip_split_0', 'V3_factor_t... LinearRegression -0.757352 0.162558

可见:只有V37在处理后LR算法的效果更好了,其他效果均更差,这就解释之前为何加工数据后LR效果反而变差的原因,
由于不打算推广这种特征处理方式,所以不在深究这些特征变差原因。

3,考虑pca和l1处理的综合效果
feature_columns
['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', 'V9', 'V11', 'V12', 'V13', 'V14', 'V17', 'V18', 'V19', 'V20', 'V21', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V33', 'V35', 'V37', 'V0_factor_trip_split_0', 'V0_factor_trip_split_1', 'V1_factor_trip_split_0', 'V1_factor_trip_split_1', 'V8_factor_trip_split_0', 'V8_factor_trip_split_1', 'V27_factor_trip_split_0', 'V27_factor_trip_split_1', 'V31_factor_trip_split_0', 'V31_factor_trip_split_1', 'V6_pca', 'V7_pca', 'V10_pca', 'V36_pca']
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... DecisionTreeRegressor -0.294613 0.027293
1 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... RandomForestRegressor -0.158844 0.024083
2 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... XGBRegressor -0.133216 0.026157
3 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... SVR -0.149661 0.029361
4 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... LinearRegression -0.116203 0.025682


4,在步骤3,最后预测之前,进行maxmin归一化尝试
(0,1)归一化
train_data_tmp=train_data.copy()
train_data_tmp[feature_columns]=MinMaxScaler((0,1)).fit_transform(train_data_tmp[feature_columns].values)
FeatureTools.get_score_by_models(train_data_tmp,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... DecisionTreeRegressor -0.303448 0.022633
1 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... RandomForestRegressor -0.157743 0.021598
2 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... XGBRegressor -0.133215 0.026177
3 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... SVR -0.125909 0.022025
4 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... LinearRegression -0.116203 0.025682

(1,2)归一化
train_data_tmp=train_data.copy()
train_data_tmp[feature_columns]=MinMaxScaler((1,2)).fit_transform(train_data_tmp[feature_columns].values)
FeatureTools.get_score_by_models(train_data_tmp,target_column,feature_lists=[feature_columns],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... DecisionTreeRegressor -0.299531 0.031394
1 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... RandomForestRegressor -0.159906 0.023090
2 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... XGBRegressor -0.133211 0.026132
3 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... SVR -0.125909 0.022025
4 43['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... LinearRegression -0.116203 0.025682

可见:归一化对SVR效果有弱提升,其他影响较小,(0,1)和(1,2)归一化也没大太差异。

15_train_test_dist









16_特征处理10_参考_特征正态化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
1,人工特征正态化
train_data['V0_exp']=np.exp(MinMaxScaler().fit_transform(train_data[['V0']])[:,0])
train_data['V1_exp']=np.exp(MinMaxScaler().fit_transform(train_data[['V1']])[:,0])
train_data['V6_exp']=np.exp(MinMaxScaler().fit_transform(train_data[['V6']])[:,0])
train_data['V30_exp']=np.log1p(MinMaxScaler().fit_transform(train_data[['V30']])[:,0])

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0','V1','V6','V30'],['V0_exp','V1_exp','V6_exp','V30_exp']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 4['V0', 'V1', 'V6', 'V30'] DecisionTreeRegressor -0.335311 0.044208
1 4['V0', 'V1', 'V6', 'V30'] RandomForestRegressor -0.205098 0.032117
2 4['V0', 'V1', 'V6', 'V30'] XGBRegressor -0.170536 0.029407
3 4['V0', 'V1', 'V6', 'V30'] SVR -0.170965 0.030396
4 4['V0', 'V1', 'V6', 'V30'] LinearRegression -0.177673 0.030028
5 4['V0_exp', 'V1_exp', 'V6_exp', 'V30_exp'] DecisionTreeRegressor -0.342167 0.039208
6 4['V0_exp', 'V1_exp', 'V6_exp', 'V30_exp'] RandomForestRegressor -0.202817 0.032076
7 4['V0_exp', 'V1_exp', 'V6_exp', 'V30_exp'] XGBRegressor -0.170561 0.029406
8 4['V0_exp', 'V1_exp', 'V6_exp', 'V30_exp'] SVR -0.164628 0.027604
9 4['V0_exp', 'V1_exp', 'V6_exp', 'V30_exp'] LinearRegression -0.165561 0.026854

效果只是在LR,SVR上提升,LR提升较大。

2,自动正态化处理,将特征正态化处理v0,v1,v6,v30
train_data['V0_boxcox'],_ =boxcox(MinMaxScaler((0,1)).fit_transform(train_data[['V0']])[:,0]+1)
train_data['V1_boxcox'],_ =boxcox(MinMaxScaler((0,1)).fit_transform(train_data[['V1']])[:,0]+1)
train_data['V6_boxcox'],_ =boxcox(MinMaxScaler((0,1)).fit_transform(train_data[['V6']])[:,0]+1)
train_data['V30_boxcox'],_ =boxcox(MinMaxScaler((0,1)).fit_transform(train_data[['V30']])[:,0]+1)

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0','V1','V6','V30'],['V0_boxcox','V1_boxcox','V6_boxcox','V30_boxcox']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 4['V0', 'V1', 'V6', 'V30'] DecisionTreeRegressor -0.338376 0.044145
1 4['V0', 'V1', 'V6', 'V30'] RandomForestRegressor -0.203333 0.033735
2 4['V0', 'V1', 'V6', 'V30'] XGBRegressor -0.170536 0.029407
3 4['V0', 'V1', 'V6', 'V30'] SVR -0.170965 0.030396
4 4['V0', 'V1', 'V6', 'V30'] LinearRegression -0.177673 0.030028
5 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... DecisionTreeRegressor -0.339478 0.041485
6 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... RandomForestRegressor -0.196820 0.033951
7 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... XGBRegressor -0.170578 0.029350
8 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... SVR -0.178230 0.032629
9 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... LinearRegression -0.193524 0.028727
可见:基本更差,svr,和lr弱差,其他不变

3,将目标targe也正态化处理
train_data['target_boxcox'],_ =boxcox(MinMaxScaler((0,1)).fit_transform(train_data[['target']])[:,0]+1)
FeatureTools.get_score_by_models(train_data,'target_boxcox',feature_lists=[['V0','V1','V6','V30'],['V0_boxcox','V1_boxcox','V6_boxcox','V30_boxcox']],models=[DecisionTreeRegressor(),RandomForestRegressor(),XGBRegressor(),SVR(),LinearRegression()])
INFO:root:score_df:
name model mean std
0 4['V0', 'V1', 'V6', 'V30'] DecisionTreeRegressor -0.109808 0.007635
1 4['V0', 'V1', 'V6', 'V30'] RandomForestRegressor -0.066104 0.008844
2 4['V0', 'V1', 'V6', 'V30'] XGBRegressor -0.057001 0.009568
3 4['V0', 'V1', 'V6', 'V30'] SVR -0.056226 0.010883
4 4['V0', 'V1', 'V6', 'V30'] LinearRegression -0.072965 0.009936
5 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... DecisionTreeRegressor -0.112020 0.008949
6 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... RandomForestRegressor -0.066155 0.009630
7 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... XGBRegressor -0.057059 0.009598
8 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... SVR -0.060291 0.011917
9 4['V0_boxcox', 'V1_boxcox', 'V6_boxcox', 'V30_... LinearRegression -0.055599 0.008160
可见:LR有较显著的降低,其他基本不变。

4,将所有特征都正态化
需要mse评估方法,mse计算的是转换后取值差异,我们需要计算原始预测差异,否则不好比对,转换前后真实效果的差异
验证函数正确性
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0']],scoring=make_scorer(boxcox_to_mse_loss,greater_is_better=False))
INFO:root:score_df:
name model mean std
0 1['V0'] LinearRegression -0.236525 0.050054
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0']])
INFO:root:score_df:
name model mean std
0 1['V0'] LinearRegression -0.236525 0.050054
传参测试:
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0']],scoring=make_scorer(boxcox_to_mse_loss,greater_is_better=False,bc_lambda=1))
INFO:root:score_df:
name model mean std
0 1['V0'] LinearRegression -0.236525 0.050054

测试所有特征都正态化后额表现
tmp_df1 = FeatureTools.get_score_by_models(train_data, '%s_maxmin' % target_column,
feature_lists=[maxmin_features, boxcox_features],
models=[DecisionTreeRegressor(), RandomForestRegressor(), XGBRegressor(),
SVR(), LinearRegression()])
print('ori target tmp_df1:\n%s' % tmp_df1)
# 评估原特征,新特征对新label的预测效果(直接mse)
tmp_df1 = FeatureTools.get_score_by_models(train_data, '%s_boxcox' % target_column,
feature_lists=[maxmin_features, boxcox_features],
models=[DecisionTreeRegressor(), RandomForestRegressor(), XGBRegressor(),
SVR(), LinearRegression()])
print('boxcox target tmp_df1:\n%s' % tmp_df1)
# 评估新特征对新label的预测效果(反向真实mse形式)
tmp_df1 = FeatureTools.get_score_by_models(train_data, '%s_boxcox' % target_column,
feature_lists=[maxmin_features, boxcox_features],
models=[DecisionTreeRegressor(), RandomForestRegressor(), XGBRegressor(),
SVR(), LinearRegression()],
scoring=make_scorer(boxcox_to_mse_loss, greater_is_better=False,
bc_lambda=bc_lambda))
print('boxcox target boxcox scoring tmp_df1:\n%s' % tmp_df1)
ori target tmp_df1:
name model mean std
0 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... DecisionTreeRegressor -7.414063e-06 5.186008e-06
1 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... RandomForestRegressor -1.315218e-05 1.208911e-05
2 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... XGBRegressor -6.840412e-06 2.218897e-06
3 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... SVR -2.368646e-03 5.793053e-04
4 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... LinearRegression -1.784238e-30 1.797323e-30
5 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... DecisionTreeRegressor -9.039408e-06 1.018367e-05
6 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... RandomForestRegressor -9.881853e-06 8.081113e-06
7 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... XGBRegressor -6.841682e-06 2.224286e-06
8 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... SVR -1.231152e-02 3.005906e-03
9 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... LinearRegression -1.697870e-05 3.533257e-06
可见:原始特征和原始target比较搭配

boxcox target tmp_df1:
name model mean std
0 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... DecisionTreeRegressor -1.007518e-05 9.916235e-06
1 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... RandomForestRegressor -8.887330e-06 8.948251e-06
2 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... XGBRegressor -3.391289e-06 2.562888e-06
3 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... SVR -2.591207e-03 5.152387e-04
4 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... LinearRegression -9.038032e-06 1.769326e-06
5 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... DecisionTreeRegressor -7.412290e-06 1.041329e-05
6 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... RandomForestRegressor -7.967786e-06 6.474672e-06
7 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... XGBRegressor -3.402686e-06 2.576434e-06
8 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... SVR -8.836725e-03 2.132070e-03
9 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... LinearRegression -7.337456e-30 7.570509e-30
可见:在结合boxcox后的target时,新特征大部分得到提升,不过LR在转换后表现提升不少。SVR反而下降了。

boxcox target boxcox scoring tmp_df1:
name model mean std
0 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... DecisionTreeRegressor -1.289235e-05 1.175630e-05
1 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... RandomForestRegressor -1.003189e-05 7.551921e-06
2 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... XGBRegressor -4.438732e-06 2.588488e-06
3 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... SVR -3.943498e-03 7.541397e-04
4 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... LinearRegression -1.330264e-05 2.005674e-06
5 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... DecisionTreeRegressor -1.590726e-05 1.184785e-05
6 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... RandomForestRegressor -1.124201e-05 9.010259e-06
7 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... XGBRegressor -4.451820e-06 2.602956e-06
8 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... SVR -1.320442e-02 3.049519e-03
9 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... LinearRegression -1.102203e-29 1.119338e-29
可见:依然是boxcox后target和boxcox后的特征比较搭配,LR变好,SVR变差。

将ori target tmp_df1 的上半部分和boxcox target boxcox scoring tmp_df1:的下半部分比对

name model mean std
0 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... DecisionTreeRegressor -7.414063e-06 5.186008e-06
1 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... RandomForestRegressor -1.315218e-05 1.208911e-05
2 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... XGBRegressor -6.840412e-06 2.218897e-06
3 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... SVR -2.368646e-03 5.793053e-04
4 38['V0_maxmin', 'V1_maxmin', 'V2_maxmin', 'V3_... LinearRegression -1.784238e-30 1.797323e-30

5 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... DecisionTreeRegressor -1.590726e-05 1.184785e-05
6 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... RandomForestRegressor -1.124201e-05 9.010259e-06
7 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... XGBRegressor -4.451820e-06 2.602956e-06
8 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... SVR -1.320442e-02 3.049519e-03
9 38['V0_boxcox', 'V1_boxcox', 'V2_boxcox', 'V3_... LinearRegression -1.102203e-29 1.119338e-29
可见,误差其实变大了,也就是说boxcox并未对结果提升

18_特征处理11_参考_丢弃不同分布特征

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
在执行pca和L1筛选后进行
丢弃前评分
INFO:root:score_df:
name model mean std
0 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... DecisionTreeRegressor -0.293784 0.025977
1 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... RandomForestRegressor -0.159338 0.022503
2 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... XGBRegressor -0.133216 0.026157
3 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... SVR -0.164783 0.031384
4 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... LinearRegression -0.117350 0.024908

丢弃后评分
丢弃后列:['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', 'V13', 'V14', 'V18', 'V19', 'V20', 'V21', 'V23', 'V24', 'V25', 'V26', 'V27', 'V29', 'V30', 'V31', 'V33', 'V35', 'V37', 'V6_pca', 'V7_pca', 'V10_pca', 'V36_pca']
INFO:root:score_df:
name model mean std
0 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... DecisionTreeRegressor -0.305993 0.026047
1 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... RandomForestRegressor -0.158522 0.030375
2 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... XGBRegressor -0.134680 0.027181
3 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... SVR -0.164861 0.038353
4 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... LinearRegression -0.122900 0.024525

结论:丢弃对结果有微弱影响,LR影响最大,下降0.005点。

19_特征处理12_参考_异常点清除

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
清理前
INFO:root:score_df:
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.303074 0.054441
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.155416 0.026407
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.136554 0.025657
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.161049 0.033630
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.119611 0.023264

清理后:
INFO:root:score_df:
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.260773 0.019046
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.134456 0.018420
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.113453 0.017832
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.138301 0.025517
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.097389 0.016397

可见异常点清理对提升效果的巨大优势

测试不同算法对特征异常点筛选效果.
INFO:__main__:train_data len:2888 index len:0
outlier model:DecisionTreeRegressor
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.298378 0.054090
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.162575 0.029197
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.136554 0.025657
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.161049 0.033630
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.119611 0.023264
INFO:__main__:train_data len:2888 index len:51
outlier model:RandomForestRegressor
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.271380 0.031380
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.134937 0.017557
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.115334 0.016620
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.141392 0.026413
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.102117 0.015856
INFO:__main__:train_data len:2888 index len:39
outlier model:XGBRegressor
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.251924 0.016581
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.132325 0.017074
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.110447 0.016339
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.140086 0.024203
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.098925 0.013474
INFO:__main__:train_data len:2888 index len:61
outlier model:SVR
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.254170 0.015434
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.126017 0.014498
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.107743 0.014618
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.128184 0.022065
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.093924 0.014470
INFO:__main__:train_data len:2888 index len:31
outlier model:LinearRegression
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.265263 0.018719
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.136412 0.016660
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.113453 0.017832
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.138301 0.025517
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.097389 0.016397
INFO:__main__:train_data len:2888 index len:31
outlier model:RidgeCV
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.266263 0.026159
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.138554 0.020745
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.113453 0.017832
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.138301 0.025517
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.097389 0.016397
INFO:__main__:train_data len:2888 index len:32
outlier model:LassoCV
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.269029 0.021648
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.136151 0.014470
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.111612 0.018509
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.137784 0.023715
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.097337 0.015386
INFO:__main__:train_data len:2888 index len:33
outlier model:LinearSVR
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.268280 0.026130
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.133375 0.016441
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.114092 0.018261
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.137797 0.023300
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.096242 0.016010

可见SVR筛选后特征最优的。基本在所有算法上都取得最好效果。
不过这些数据都是基于去除异常点后的数据集,可能无法反映完全真实训练集情况

20_特征处理13_多处理方式结合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
基础策略01:
1,先使用svr过滤方法,把异常行处理掉
INFO:__main__:train_data len:2888 index len:61
outlier model:SVR
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.254583 0.015420
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.126443 0.014312
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.107743 0.014618
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.128184 0.022065
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.093924 0.014470

2,pca过滤
name model mean std
0 37['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... DecisionTreeRegressor -0.246533 0.019663
1 37['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... RandomForestRegressor -0.129520 0.014540
2 37['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... XGBRegressor -0.106539 0.016954
3 37['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... SVR -0.131678 0.021323
4 37['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... LinearRegression -0.093725 0.015912

3,丢弃4特征
drop 4 feature:
name model mean std
0 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... DecisionTreeRegressor -0.240781 0.026782
1 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... RandomForestRegressor -0.128367 0.011586
2 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... XGBRegressor -0.105745 0.018051
3 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... SVR -0.131549 0.020833
4 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... LinearRegression -0.092577 0.015777

4,丢弃分布不同6特征(有个之前已经被丢弃,故实际减少5个)
name model mean std
0 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... DecisionTreeRegressor -0.232744 0.023978
1 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... RandomForestRegressor -0.124071 0.015044
2 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... XGBRegressor -0.106402 0.017246
3 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... SVR -0.133891 0.027492
4 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... LinearRegression -0.098457 0.016716


基础策略01_改造01
将基准策略中特征丢弃,放到一开始就执行,丢弃特征,在丢弃行
也就是按照3,4,1,2重新组装
tmp_02
name model mean std
0 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... DecisionTreeRegressor -0.238354 0.012864
1 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... RandomForestRegressor -0.130878 0.017220
2 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... XGBRegressor -0.106126 0.016203
3 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... SVR -0.133249 0.021006
4 33['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V8', '... LinearRegression -0.093221 0.016074

基础策略01_改造01_改造01
流程初始阶段添加maxmin(0,1)
最终效果:
name model mean std
0 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... DecisionTreeRegressor -0.240605 0.019270
1 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... RandomForestRegressor -0.126842 0.017003
2 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... XGBRegressor -0.109181 0.017691
3 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... SVR -0.106070 0.020124
4 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... LinearRegression -0.100764 0.017924


基础策略02
各个步骤重新走下
初始
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.301752 0.046047
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.161876 0.024799
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.136554 0.025657
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.161049 0.033630
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.119611 0.023264

L1
INFO:__main__:drop_columns(l1):len(3)['V31', 'V15', 'V21']
remain feature_columns:len(35)['V36', 'V7', 'V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V35', 'V28', 'V32', 'V34', 'V13', 'V20', 'V18', 'V23', 'V22', 'V30', 'V19', 'V33', 'V26', 'V16', 'V11', 'V4', 'V9', 'V14', 'V17', 'V12', 'V6', 'V3', 'V2', 'V1', 'V10', 'V0', 'V27']
name model mean std
0 35['V36', 'V7', 'V8', 'V5', 'V37', 'V24', 'V29... DecisionTreeRegressor -0.314407 0.055367
1 35['V36', 'V7', 'V8', 'V5', 'V37', 'V24', 'V29... RandomForestRegressor -0.156533 0.022986
2 35['V36', 'V7', 'V8', 'V5', 'V37', 'V24', 'V29... XGBRegressor -0.140072 0.024400
3 35['V36', 'V7', 'V8', 'V5', 'V37', 'V24', 'V29... SVR -0.159603 0.034162
4 35['V36', 'V7', 'V8', 'V5', 'V37', 'V24', 'V29... LinearRegression -0.118851 0.023679

L2
INFO:__main__:drop_columns(l2):len(3)['V32', 'V13', 'V34']
remain feature_columns:len(32)['V36', 'V8', 'V7', 'V5', 'V37', 'V24', 'V29', 'V25', 'V35', 'V28', 'V20', 'V18', 'V22', 'V30', 'V23', 'V33', 'V19', 'V26', 'V9', 'V4', 'V16', 'V11', 'V14', 'V17', 'V12', 'V3', 'V6', 'V2', 'V1', 'V10', 'V0', 'V27']
name model mean std
0 32['V36', 'V8', 'V7', 'V5', 'V37', 'V24', 'V29... DecisionTreeRegressor -0.304581 0.048247
1 32['V36', 'V8', 'V7', 'V5', 'V37', 'V24', 'V29... RandomForestRegressor -0.163034 0.026763
2 32['V36', 'V8', 'V7', 'V5', 'V37', 'V24', 'V29... XGBRegressor -0.137111 0.022508
3 32['V36', 'V8', 'V7', 'V5', 'V37', 'V24', 'V29... SVR -0.157262 0.031500
4 32['V36', 'V8', 'V7', 'V5', 'V37', 'V24', 'V29... LinearRegression -0.118579 0.023579

PCA
name model mean std
0 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... DecisionTreeRegressor -0.306925 0.035987
1 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... RandomForestRegressor -0.155268 0.027546
2 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... XGBRegressor -0.134823 0.024214
3 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... SVR -0.161478 0.030688
4 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... LinearRegression -0.117744 0.025338

SVR
name model mean std
0 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... DecisionTreeRegressor -0.246123 0.012659
1 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... RandomForestRegressor -0.126124 0.016573
2 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... XGBRegressor -0.105483 0.016860
3 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... SVR -0.130822 0.020550
4 31['V8', 'V5', 'V37', 'V24', 'V29', 'V25', 'V3... LinearRegression -0.093589 0.016626

丢弃6不同分布的特征
name model mean std
0 25['V8', 'V37', 'V24', 'V29', 'V25', 'V35', 'V... DecisionTreeRegressor -0.237193 0.008986
1 25['V8', 'V37', 'V24', 'V29', 'V25', 'V35', 'V... RandomForestRegressor -0.123084 0.015617
2 25['V8', 'V37', 'V24', 'V29', 'V25', 'V35', 'V... XGBRegressor -0.105656 0.016326
3 25['V8', 'V37', 'V24', 'V29', 'V25', 'V35', 'V... SVR -0.137090 0.029198
4 25['V8', 'V37', 'V24', 'V29', 'V25', 'V35', 'V... LinearRegression -0.099489 0.017874

基础策略02_改造01

初始之前添加对所有feature的maxmin归一化(0,1)
归一化后效果:
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.293681 0.055959
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.153561 0.025228
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.136521 0.025700
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.129900 0.021798
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.119611 0.023264

svr效果明显提高
L1
drop_columns(l1):len(8)['V20', 'V16', 'V13', 'V15', 'V31', 'V34', 'V32', 'V25']

name model mean std
0 30['V36', 'V7', 'V8', 'V37', 'V29', 'V24', 'V5... DecisionTreeRegressor -0.311610 0.059784
1 30['V36', 'V7', 'V8', 'V37', 'V29', 'V24', 'V5... RandomForestRegressor -0.160163 0.023681
2 30['V36', 'V7', 'V8', 'V37', 'V29', 'V24', 'V5... XGBRegressor -0.135760 0.025817
3 30['V36', 'V7', 'V8', 'V37', 'V29', 'V24', 'V5... SVR -0.125948 0.021718
4 30['V36', 'V7', 'V8', 'V37', 'V29', 'V24', 'V5... LinearRegression -0.118137 0.022044

L2无变化
rop_columns(l2):len(0)[]


IF01_分支01:由于V16不存在,pca关闭


IF01_分之02:在L1特殊处理保留V16,此处继续pca
name model mean std
0 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... DecisionTreeRegressor -0.306813 0.038430
1 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... RandomForestRegressor -0.158987 0.026039
2 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... XGBRegressor -0.133585 0.025091
3 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... SVR -0.126072 0.022261
4 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... LinearRegression -0.116638 0.024050


IF01_分之01:svr
outlier model:SVR
name model mean std
0 30['V36', 'V8', 'V7', 'V37', 'V5', 'V29', 'V24... DecisionTreeRegressor -0.250153 0.021936
1 30['V36', 'V8', 'V7', 'V37', 'V5', 'V29', 'V24... RandomForestRegressor -0.131978 0.016049
2 30['V36', 'V8', 'V7', 'V37', 'V5', 'V29', 'V24... XGBRegressor -0.112816 0.017442
3 30['V36', 'V8', 'V7', 'V37', 'V5', 'V29', 'V24... SVR -0.104641 0.016567
4 30['V36', 'V8', 'V7', 'V37', 'V5', 'V29', 'V24... LinearRegression -0.097441 0.013544

IF01_分之02:
outlier model:SVR
name model mean std
0 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... DecisionTreeRegressor -0.241877 0.016470
1 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... RandomForestRegressor -0.131061 0.015154
2 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... XGBRegressor -0.111531 0.016127
3 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... SVR -0.104441 0.017566
4 29['V8', 'V37', 'V5', 'V29', 'V24', 'V35', 'V2... LinearRegression -0.095736 0.016731

IF01_分之02:
丢弃分布不一致的特征列
name model mean std
0 23['V8', 'V37', 'V29', 'V24', 'V35', 'V18', 'V... DecisionTreeRegressor -0.243059 0.015813
1 23['V8', 'V37', 'V29', 'V24', 'V35', 'V18', 'V... RandomForestRegressor -0.128891 0.017159
2 23['V8', 'V37', 'V29', 'V24', 'V35', 'V18', 'V... XGBRegressor -0.108672 0.016404
3 23['V8', 'V37', 'V29', 'V24', 'V35', 'V18', 'V... SVR -0.104399 0.019012
4 23['V8', 'V37', 'V29', 'V24', 'V35', 'V18', 'V... LinearRegression -0.099833 0.018218

21_特征处理14_列边缘点清理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
清理前打分
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.294308 0.047980
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.155771 0.023292
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.136554 0.025657
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.161049 0.033630
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.119611 0.023264

0.025,0.975
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.298952 0.050870
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.162002 0.022290
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.136554 0.025657
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.161049 0.033630
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.119611 0.023264

0.001,0.999
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.304234 0.065717
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.160022 0.024145
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.136554 0.025657
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.161049 0.033630
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.119611 0.023264

可见基本没什么用

22_特征处理15_maxmin和分位切首尾结合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
先分位切分列首尾,在maxmin处理,查看效果
不做分位处理,直接maxmin(0,1)后打分
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.298398 0.053159
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.157217 0.025550
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.136521 0.025700
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.129900 0.021798
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.119611 0.023264

分位:0.0250,0.9750
minmin(0,1)后打分:
name model mean std
0 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... DecisionTreeRegressor -0.311182 0.060743
1 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... RandomForestRegressor -0.161535 0.026480
2 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... XGBRegressor -0.136521 0.025700
3 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... SVR -0.129900 0.021798
4 38['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', '... LinearRegression -0.119611 0.023264

可见:添加分位处理后效果反而变差,说明分位首尾信息可能含有较重要信息

23_特征处理13_多处理方式结合02

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
将分布不同的特征的移除放到前面连
移除后效果
name model mean std
0 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... DecisionTreeRegressor -0.313528 0.054179
1 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... RandomForestRegressor -0.152624 0.021431
2 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... XGBRegressor -0.136853 0.026886
3 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... SVR -0.164745 0.041024
4 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... LinearRegression -0.126327 0.021495

maxmin后效果
name model mean std
0 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... DecisionTreeRegressor -0.304595 0.061182
1 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... RandomForestRegressor -0.157570 0.029900
2 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... XGBRegressor -0.136763 0.026932
3 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... SVR -0.129541 0.023458
4 32['V0', 'V1', 'V2', 'V3', 'V4', 'V6', 'V7', '... LinearRegression -0.126327 0.021495

L1,L2后效果
L1:INFO:__main__:drop_columns(l1):len(7)['V25', 'V34', 'V32', 'V31', 'V13', 'V15', 'V20']
L2:0
name model mean std
0 25['V36', 'V8', 'V7', 'V37', 'V29', 'V24', 'V3... DecisionTreeRegressor -0.301248 0.049425
1 25['V36', 'V8', 'V7', 'V37', 'V29', 'V24', 'V3... RandomForestRegressor -0.158541 0.029336
2 25['V36', 'V8', 'V7', 'V37', 'V29', 'V24', 'V3... XGBRegressor -0.136325 0.025932
3 25['V36', 'V8', 'V7', 'V37', 'V29', 'V24', 'V3... SVR -0.126222 0.023398
4 25['V36', 'V8', 'V7', 'V37', 'V29', 'V24', 'V3... LinearRegression -0.124461 0.019800

pca后效果
name model mean std
0 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... DecisionTreeRegressor -0.295245 0.028296
1 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... RandomForestRegressor -0.158338 0.031556
2 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... XGBRegressor -0.132782 0.026935
3 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... SVR -0.126053 0.023664
4 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... LinearRegression -0.121502 0.023083

SVR后效果
name model mean std
0 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... DecisionTreeRegressor -0.248327 0.014599
1 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... RandomForestRegressor -0.126420 0.019059
2 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... XGBRegressor -0.109142 0.016907
3 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... SVR -0.103804 0.019733
4 24['V8', 'V37', 'V29', 'V24', 'V35', 'V21', 'V... LinearRegression -0.099841 0.018675
效果同"多处理方式结合"中的最后再丢弃不同分布特征相同的.

24_算法调参

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
1,将所有trian_data的处理都转为all_data的处理

返回给外部的特征评分
name model mean std
0 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... DecisionTreeRegressor -0.239078 0.020925
1 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... RandomForestRegressor -0.126839 0.016588
2 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... XGBRegressor -0.108165 0.015992
3 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... SVR -0.109138 0.020015
4 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... LinearRegression -0.100616 0.018068

相比直接在train数据上数据处理,准确率下降了.

插播:将第一步的maxmin(0,1)去掉后最终结果
name model mean std
0 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... DecisionTreeRegressor -0.239589 0.026468
1 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... RandomForestRegressor -0.129222 0.018301
2 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... XGBRegressor -0.105165 0.015818
3 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... SVR -0.134867 0.025223
4 28['V0', 'V1', 'V2', 'V3', 'V4', 'V8', 'V12', ... LinearRegression -0.098778 0.016635
继续采用maxmni归一化处理.

2,最优算法(GSCVTools)
mode mse r2 best_estimator
0 cv 0.113225 0.880167 RandomForestRegressor(bootstrap=True, criterio...
1 cv 0.111303 0.882202 ElasticNetCV(alphas=None, copy_X=True, cv=None...
2 cv 0.109897 0.883690 GradientBoostingRegressor(alpha=0.9, criterion...
3 cv 0.100619 0.893510 LinearRegression(copy_X=True, fit_intercept=Tr...
4 cv 0.102203 0.891833 LinearSVR(C=10.0, dual=True, epsilon=0.1, fit_...
5 cv 0.110920 0.882607 XGBRegressor(base_score=0.5, booster='gbtree',...
对应算法:
tmp_df['best_estimator'][0]
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=0.4, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=3, min_samples_split=3,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
tmp_df['best_estimator'][1]
ElasticNetCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
l1_ratio=0.11, max_iter=1000, n_alphas=100, n_jobs=1,
normalize=False, positive=False, precompute='auto',
random_state=None, selection='cyclic', tol=0.1, verbose=0)
tmp_df['best_estimator'][2]
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=6, max_features=0.4,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=6,
min_samples_split=6, min_weight_fraction_leaf=0.0,
n_estimators=100, presort='auto', random_state=None,
subsample=0.6000000000000001, verbose=0, warm_start=False)
tmp_df['best_estimator'][3]
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
temp_df['best_estimator'][4]
LinearSVR(C=10.0, dual=True, epsilon=0.1, fit_intercept=True,
intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
random_state=None, tol=0.01, verbose=0)

3,rfe特征过滤
LinearRegression LinearSVR RandomForestRegressor XGBRegressor
XGBRegressor -0.100340 -0.108244 -0.129722 -0.109598
RandomForestRegressor -0.102333 -0.111913 -0.128601 -0.107871
LinearSVR -0.101739 -0.107734 -0.126510 -0.108310
LinearRegression -0.099910 -0.108498 -0.129063 -0.108629

4,特征过滤后的最优算法(HyperoptTools)
mode mse r2 best_estimator
0 cv 0.108189 0.885498 GradientBoostingRegressor(alpha=0.95, criterio...
1 cv 0.105793 0.888033 XGBRegressor(base_score=0.5, booster='gbtree',...
2 cv 0.109604 0.884000 RandomForestRegressor(bootstrap=True, criterio...
3 cv 0.102343 0.891684 LinearRegression(copy_X=True, fit_intercept=Tr...
4 cv 0.104656 0.889237 LinearSVR(C=5.0, dual=True, epsilon=0.01, fit_...
5 cv 0.102552 0.891464 ElasticNetCV(alphas=None, copy_X=True, cv=None...

tmp_df.loc[0,'best_estimator']
GradientBoostingRegressor(alpha=0.95, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=4, max_features=0.4,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=7,
min_samples_split=3, min_weight_fraction_leaf=0.0,
n_estimators=100, presort='auto', random_state=None,
subsample=0.7000000000000002, verbose=0, warm_start=False)
tmp_df.loc[1,'best_estimator']
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=8, missing=None, n_estimators=200,
n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=0.5)
tmp_df.loc[2,'best_estimator']
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features=0.3, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=3, min_samples_split=8,
min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)
tmp_df.loc[3,'best_estimator']
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
tmp_df.loc[4,'best_estimator']
LinearSVR(C=5.0, dual=True, epsilon=0.01, fit_intercept=True,
intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
random_state=None, tol=0.01, verbose=0)
tmp_df.loc[5,'best_estimator']
ElasticNetCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
l1_ratio=0.81, max_iter=1000, n_alphas=100, n_jobs=1,
normalize=False, positive=False, precompute='auto',
random_state=None, selection='cyclic', tol=0.001, verbose=0)

25_结构调整_代码结构重构_提交11

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
特征评估
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[3]:
name model mean std
0 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... DecisionTreeRegressor -0.247148 0.022159
1 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... RandomForestRegressor -0.120660 0.017508
2 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... XGBRegressor -0.105817 0.015420
3 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... SVR -0.106799 0.017221
4 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... LinearRegression -0.097708 0.017362

rfecv特征筛选前,评估最优参数和算法
GSCVTools.best_modelAndParam_reg(X_train, None, y_train, None)
INFO:common.gscvTools:ret_df:
mode mse r2 best_estimator
0 cv 0.102335 0.891641 LinearSVR(C=10.0, dual=True, epsilon=0.1, fit_...
1 cv 0.097707 0.896541 LinearRegression(copy_X=True, fit_intercept=Tr...
2 cv 0.098402 0.895806 ElasticNetCV(alphas=None, copy_X=True, cv=None...
3 cv 0.109009 0.884575 RandomForestRegressor(bootstrap=False, criteri...
4 cv 0.111037 0.882428 XGBRegressor(base_score=0.5, booster='gbtree',...
5 cv 0.107018 0.886683 GradientBoostingRegressor(alpha=0.75, criterio...


rfecv过程
INFO:root:model_name:XGBRegressor: 28
INFO:root:model_name:RandomForestRegressor: 11
INFO:root:model_name:LinearSVR: 21
INFO:root:model_name:LinearRegression: 26
INFO:root:各rfecv特征筛选后在各模型上评分:
LinearRegression LinearSVR RandomForestRegressor XGBRegressor
XGBRegressor -0.097708 -0.105039 -0.122920 -0.105817
RandomForestRegressor -0.099386 -0.106292 -0.125183 -0.109702
LinearSVR -0.098815 -0.104662 -0.122127 -0.106191
LinearRegression -0.097279 -0.104937 -0.127173 -0.106121

选择XGB,X_train所有特征就是28个,也就是全部保留了

三算法集成LinearSVR,LinearRegression,ElasticNetCV
INFO:__main__:StackingCVRegressor meta_model_scores:{'Ridge': -0.097715, 'LinearSVR': -0.096427, 'LinearRegression': -0.098501, 'SVR': -0.0961, 'XGBRegressor': -0.100299}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'Ridge': -0.091734, 'LinearSVR': -0.0917, 'LinearRegression': -0.091849, 'SVR': -0.091524, 'XGBRegressor': -0.096136}

INFO:common.gscvTools:ret_df:
mode mse r2 best_estimator
0 cv 0.089245 0.905502 LinearRegression(copy_X=True, fit_intercept=Tr...
1 cv 0.092259 0.902311 RandomForestRegressor(bootstrap=True, criterio...
2 cv 0.090524 0.904147 GradientBoostingRegressor(alpha=0.99, criterio...
3 cv 0.089358 0.905383 ElasticNetCV(alphas=None, copy_X=True, cv=None...
4 cv 0.090305 0.904380 XGBRegressor(base_score=0.5, booster='gbtree',...
5 cv 0.094092 0.900370 LinearSVR(C=0.5, dual=True, epsilon=0.1, fit_i...

选择:LinearRegression
预测,提交11,线上成绩:2.2169

26_代码比对_ref02

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
1,自己特征+自评估
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[feature_columns_handle])
Out[2]:
name model mean std
0 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... DecisionTreeRegressor -0.236436 0.026856
1 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... RandomForestRegressor -0.126009 0.017412
2 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... XGBRegressor -0.105817 0.015420
3 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... SVR -0.106799 0.017221
4 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... LinearRegression -0.097708 0.017362



2,他人特征+自评估
tmp_list=df_train.columns
tmp_list=tmp_list.drop('V5')
tmp_list=tmp_list.drop('V27')
tmp_list=tmp_list.drop('target')
FeatureTools.get_score_by_models(df_train,'target',feature_lists=[tmp_list.values])

Out[22]:
name model mean std
0 36['V0' 'V1' 'V2' 'V3' 'V4' 'V6' 'V7' 'V8' 'V9... DecisionTreeRegressor -0.304975 0.063183
1 36['V0' 'V1' 'V2' 'V3' 'V4' 'V6' 'V7' 'V8' 'V9... RandomForestRegressor -0.160282 0.028651
2 36['V0' 'V1' 'V2' 'V3' 'V4' 'V6' 'V7' 'V8' 'V9... XGBRegressor -0.137417 0.025867
3 36['V0' 'V1' 'V2' 'V3' 'V4' 'V6' 'V7' 'V8' 'V9... SVR -0.165672 0.033844
4 36['V0' 'V1' 'V2' 'V3' 'V4' 'V6' 'V7' 'V8' 'V9... LinearRegression -0.133415 0.023612
可见自己特征工程还是不错的,将对方算法挪过来

3,自己特征,他人算法,提交12rfe,线上成绩0.1547

27_特征处理_添加v0,v1的mean方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
v0,v1观察
all_data[target_column].loc[test_ids] = np.nan
FeatureTools.detect_reg_target_column(all_data, target_column, feature_columns=['V1'])

未添加前评估
Out[2]:
name model mean std
0 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... DecisionTreeRegressor -0.243120 0.022309
1 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... RandomForestRegressor -0.121830 0.013058
2 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... XGBRegressor -0.105817 0.015420
3 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... SVR -0.106799 0.017221
4 28['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... LinearRegression -0.097708 0.017362

添加后评估
Out[6]:
name model mean std
0 30['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... DecisionTreeRegressor -0.231876 0.017706
1 30['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... RandomForestRegressor -0.120214 0.016816
2 30['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... XGBRegressor -0.097736 0.013530
3 30['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... SVR -0.091472 0.013535
4 30['V6_pca', 'V7_pca', 'V10_pca', 'V36_pca', '... LinearRegression -0.087384 0.015655

单算法评估
INFO:common.gscvTools:ret_df:
mode mse r2 best_estimator
0 cv 0.100832 0.893233 XGBRegressor(base_score=0.5, booster='gbtree',...
1 cv 0.094748 0.899675 LinearSVR(C=10.0, dual=True, epsilon=0.1, fit_...
2 cv 0.094950 0.899462 GradientBoostingRegressor(alpha=0.9, criterion...
3 cv 0.087384 0.907472 LinearRegression(copy_X=True, fit_intercept=Tr...
4 cv 0.094764 0.899659 ElasticNetCV(alphas=None, copy_X=True, cv=None...
5 cv 0.104864 0.888964 RandomForestRegressor(bootstrap=True, criterio...

选择LR,087384,提交13,线上1.1490,过拟合了


集成方法3算法,LinearSVR,GradientBoostingRegressor,LinearRegression
INFO:__main__:StackingCVRegressor meta_model_scores:{'Ridge': -0.086582, 'XGBRegressor': -0.08986, 'LinearRegression': -0.086853, 'SVR': -0.087477, 'LinearSVR': -0.087375}
INFO:__main__:get_StackingCV_OOF meta_model_scores:{'Ridge': -0.078527, 'XGBRegressor': -0.083848, 'LinearRegression': -0.078782, 'SVR': -0.079408, 'LinearSVR': -0.078675}
选择oof,Ridge
由于之前集成方法后结果频繁过拟合,故不再对集成算法进行参数调优了

提交14,0.078434,线上.0.3075

28_集成尝试_级连算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
1,集连算法,思路1,
target,corr,v0,v1,最优算法,SVR
svr_diff=target-svr(v0,v1->target)
svr_diff,corr,v16,v3,最优算法,Lr
lr(v16,v3->svr_diff)+svr(v0,v1->target)

for model in [SVR(), LinearRegression(), RANSACRegressor(), TheilSenRegressor()]:
y_predict = model.fit(train_data[['V0', 'V1']].values, train_data[target_column].values).predict(
train_data[['V0', 'V1']].values)
train_data['%s_predict' % model.__class__.__name__] = y_predict
train_data['%s_predict_diff' % model.__class__.__name__] = train_data[target_column]-train_data['%s_predict' % model.__class__.__name__]
print('123')
FeatureTools.show_reg_diff(train_data, target_column, '%s_predict' % SVR().__class__.__name__)
FeatureTools.show_reg_diff(train_data, target_column, '%s_predict' % LinearRegression().__class__.__name__)
FeatureTools.show_reg_diff(train_data, target_column, '%s_predict' % RANSACRegressor().__class__.__name__)
FeatureTools.show_reg_diff(train_data, target_column, '%s_predict' % TheilSenRegressor().__class__.__name__)

column_diff='%s_predict_diff' % SVR().__class__.__name__
tmp_df=FeatureTools.get_column_corr(train_data,column_diff,feature_columns=feature_columns)
#V2,V16
tmp_df=FeatureTools.get_score_by_models(train_data, 'SVR_predict_diff', feature_lists=[['V2', 'V16']])
#LR()
predict1 = SVR().fit(train_data[['V0', 'V1']].values, train_data[target_column].values).predict(
train_data[['V0', 'V1']].values)
train_data['SVR_predict_diff']=train_data[target_column]-predict1
predict2 = LinearRegression().fit(train_data[['V3', 'V16']].values, train_data['SVR_predict_diff'].values).predict(
train_data[['V3', 'V16']].values)
predict_sum = predict1 + predict2
print(mean_squared_error(train_data[target_column].values, predict_sum))
print(mean_squared_error(train_data[target_column].values, predict1))
print(mean_squared_error(train_data[target_column].values, predict2))
0.12569019948517646
0.15342296758436166
0.7889472360169055

直接采用原始特征,不做级联
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0','V1','V16','V3']])
Out[17]:
name model mean std
0 4['V0', 'V1', 'V16', 'V3'] DecisionTreeRegressor -0.251579 0.012851
1 4['V0', 'V1', 'V16', 'V3'] RandomForestRegressor -0.150692 0.011259
2 4['V0', 'V1', 'V16', 'V3'] XGBRegressor -0.129981 0.015396
3 4['V0', 'V1', 'V16', 'V3'] SVR -0.130179 0.019409
4 4['V0', 'V1', 'V16', 'V3'] LinearRegression -0.124888 0.017317
5 4['V0', 'V1', 'V16', 'V3'] RANSACRegressor -0.141129 0.011837
6 4['V0', 'V1', 'V16', 'V3'] TheilSenRegressor -0.126173 0.017355

比2个策略差DecisionTreeRegressor,LinearRegression

结论:几联不如直接使用算法


2,思路1改进
train_data['v0v1_svr_predict'] = SVR().fit(train_data[['V0', 'V1']].values, train_data[target_column].values).predict(
train_data[['V0', 'V1']].values)
train_data['v0v1_svr_predict_diff']=train_data[target_column]-train_data['v0v1_svr_predict']
predict2 = LinearRegression().fit(train_data[['V3', 'V16','v0v1_svr_predict']].values, train_data['v0v1_svr_predict_diff'].values).predict(
train_data[['V3', 'V16','v0v1_svr_predict']].values)
result_predict = predict1 + predict2
print(mean_squared_error(train_data[target_column].values, result_predict))
print(mean_squared_error(train_data[target_column].values, predict1))
print(mean_squared_error(train_data[target_column].values, predict2))
0.11634669194548065
0.15342296758436166
0.9276657978731919

结论:最佳

----------3,思路2,
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0']])
Out[23]:
name model mean std
0 1['V0'] DecisionTreeRegressor -0.363566 0.046211
1 1['V0'] RandomForestRegressor -0.290335 0.045171
2 1['V0'] XGBRegressor -0.207838 0.041876
3 1['V0'] SVR -0.201277 0.041952
4 1['V0'] LinearRegression -0.212035 0.038301
5 1['V0'] RANSACRegressor -0.231937 0.050335
6 1['V0'] TheilSenRegressor -0.215243 0.046469

choose model:SVR,V0

tmp_df=FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['v0_svr_predict'],['V0'],['V0','v0_svr_predict']])
Out[98]:
name model mean std
0 1['v0_svr_predict'] DecisionTreeRegressor -0.363456 0.042780
1 1['v0_svr_predict'] RandomForestRegressor -0.291032 0.045882
2 1['v0_svr_predict'] XGBRegressor -0.207703 0.041555
3 1['v0_svr_predict'] SVR -0.199969 0.041112
4 1['v0_svr_predict'] LinearRegression -0.196663 0.038813
5 1['v0_svr_predict'] RANSACRegressor -0.226800 0.041728
6 1['v0_svr_predict'] TheilSenRegressor -0.196189 0.037259
7 1['v0_svr_predict'] cal(mean,abs.min) -0.240259 0.196189
8 2['V0', 'v0_svr_predict'] DecisionTreeRegressor -0.365555 0.043402
9 2['V0', 'v0_svr_predict'] RandomForestRegressor -0.292444 0.041286
10 2['V0', 'v0_svr_predict'] XGBRegressor -0.206747 0.041528
11 2['V0', 'v0_svr_predict'] SVR -0.201076 0.042165
12 2['V0', 'v0_svr_predict'] LinearRegression -0.196914 0.038686
13 2['V0', 'v0_svr_predict'] RANSACRegressor -0.567830 0.462893
14 2['V0', 'v0_svr_predict'] TheilSenRegressor -0.200270 0.040422
15 2['V0', 'v0_svr_predict'] cal(mean,abs.min) -0.290120 0.196914

best:v0_svr_predict

tmp_df=get_score_by_models(train_data,'v0_svr_predict_diff',feature_lists=[['v0_svr_predict'],['V0','v0_svr_predict']])
Out[100]:
name model mean std
0 1['v0_svr_predict'] DecisionTreeRegressor -0.363486 0.042791
1 1['v0_svr_predict'] RandomForestRegressor -0.288932 0.044656
2 1['v0_svr_predict'] XGBRegressor -0.206600 0.040966
3 1['v0_svr_predict'] SVR -0.199621 0.040700
4 1['v0_svr_predict'] LinearRegression -0.196663 0.038813
5 1['v0_svr_predict'] RANSACRegressor -0.224353 0.071938
6 1['v0_svr_predict'] TheilSenRegressor -0.197327 0.038473
7 1['v0_svr_predict'] cal(mean,abs.min) -0.239569 0.196663
8 2['V0', 'v0_svr_predict'] DecisionTreeRegressor -0.363011 0.044443
9 2['V0', 'v0_svr_predict'] RandomForestRegressor -0.288492 0.045940
10 2['V0', 'v0_svr_predict'] XGBRegressor -0.206531 0.041072
11 2['V0', 'v0_svr_predict'] SVR -0.200554 0.041249
12 2['V0', 'v0_svr_predict'] LinearRegression -0.196914 0.038686
13 2['V0', 'v0_svr_predict'] RANSACRegressor -0.254002 0.066314
14 2['V0', 'v0_svr_predict'] TheilSenRegressor -0.197738 0.037650
15 2['V0', 'v0_svr_predict'] cal(mean,abs.min) -0.243892 0.196914

best:v0_svr_predict

tmp_df=FeatureTools.get_column_corr(train_data,'v0_svr_predict_diff',feature_columns)
tmp_df
Out[104]:
pearson spearman mine
v0_svr_predict_diff 1.000000 1.000000 1.000000
V2 0.457827 0.433365 0.200423
V16 0.459216 0.417436 0.194544
V6 0.438596 0.379474 0.170362
V3 0.316996 0.295157 0.151405

tmp_df = get_score_by_models(train_data, 'v0_svr_predict_diff',
feature_lists=[['V2'], ['V16'],['V6']])
tmp_df
Out[108]:
name model mean std
0 1['V2'] DecisionTreeRegressor -0.293368 0.032078
1 1['V2'] RandomForestRegressor -0.232855 0.026666
2 1['V2'] XGBRegressor -0.160976 0.027252
3 1['V2'] SVR -0.155710 0.026874
4 1['V2'] LinearRegression -0.153716 0.027008
5 1['V2'] RANSACRegressor -0.165102 0.029690
6 1['V2'] TheilSenRegressor -0.153646 0.027628
7 1['V2'] cal(mean,abs.min) -0.187910 0.153646
8 1['V16'] DecisionTreeRegressor -0.284740 0.017916
9 1['V16'] RandomForestRegressor -0.229113 0.013521
10 1['V16'] XGBRegressor -0.160495 0.021926
11 1['V16'] SVR -0.157742 0.023016
12 1['V16'] LinearRegression -0.154759 0.025295
13 1['V16'] RANSACRegressor -0.169428 0.022714
14 1['V16'] TheilSenRegressor -0.155666 0.024267
15 1['V16'] cal(mean,abs.min) -0.187421 0.154759
16 1['V6'] DecisionTreeRegressor -0.288668 0.030742
17 1['V6'] RandomForestRegressor -0.240561 0.028791
18 1['V6'] XGBRegressor -0.169208 0.027279
19 1['V6'] SVR -0.163992 0.025893
20 1['V6'] LinearRegression -0.160812 0.028050
21 1['V6'] RANSACRegressor -0.178296 0.023641
22 1['V6'] TheilSenRegressor -0.160108 0.026480
23 1['V6'] cal(mean,abs.min) -0.194521 0.160108

best:v2,TheilSenRegressor

largest_corr_column='V3'
tmp_df = get_score_by_models(train_data, this_column_diff,
feature_lists=[[this_column_predict], [largest_corr_column],
[this_column_predict, largest_corr_column]])
tmp_df
Out[148]:
name model mean std
0 1['V2_TheilSenRegressor_predict'] DecisionTreeRegressor -0.278492 0.019969
1 1['V2_TheilSenRegressor_predict'] RandomForestRegressor -0.211088 0.021690
2 1['V2_TheilSenRegressor_predict'] XGBRegressor -0.145769 0.021569
3 1['V2_TheilSenRegressor_predict'] SVR -0.140189 0.022860
4 1['V2_TheilSenRegressor_predict'] LinearRegression -0.140056 0.023351
5 1['V2_TheilSenRegressor_predict'] RANSACRegressor -0.153968 0.022227
6 1['V2_TheilSenRegressor_predict'] TheilSenRegressor -0.139979 0.022880
7 1['V2_TheilSenRegressor_predict'] cal(mean,abs.min) -0.172791 0.139979
8 1['V3'] DecisionTreeRegressor -0.238346 0.026239
9 1['V3'] RandomForestRegressor -0.191921 0.023376
10 1['V3'] XGBRegressor -0.134054 0.022177
11 1['V3'] SVR -0.131620 0.020382
12 1['V3'] LinearRegression -0.129346 0.019946
13 1['V3'] RANSACRegressor -0.132507 0.025866
14 1['V3'] TheilSenRegressor -0.129817 0.019615
15 1['V3'] cal(mean,abs.min) -0.155373 0.129346
16 2['V2_TheilSenRegressor_predict', 'V3'] DecisionTreeRegressor -0.271233 0.030877
17 2['V2_TheilSenRegressor_predict', 'V3'] RandomForestRegressor -0.167869 0.019176
18 2['V2_TheilSenRegressor_predict', 'V3'] XGBRegressor -0.133797 0.020181
19 2['V2_TheilSenRegressor_predict', 'V3'] SVR -0.131832 0.019264
20 2['V2_TheilSenRegressor_predict', 'V3'] LinearRegression -0.129718 0.020525
21 2['V2_TheilSenRegressor_predict', 'V3'] RANSACRegressor -0.140676 0.020167
22 2['V2_TheilSenRegressor_predict', 'V3'] TheilSenRegressor -0.130322 0.020299
23 2['V2_TheilSenRegressor_predict', 'V3'] cal(mean,abs.min) -0.157921 0.129718

this_column, this_model='V3',LinearRegression()
tmp_df
Out[153]:
pearson spearman mine
V3_LinearRegression_predict_diff 1.000000 1.000000 1.000000
V37 -0.016049 0.045578 0.116787
V7 0.137677 0.101421 0.115634
V8 0.103105 0.030846 0.114812
V12 0.038223 0.041866 0.114678
V31 0.130202 0.086553 0.113808
V6 0.164739 0.140440 0.109957

综合
v0,SVR()
v2,TheilSenRegressor()
V3,LinearRegression()



3,思路2,重新开始
action_list=[('V0',SVR()),('V2',SVR()),('V3',TheilSenRegressor())]
mean_squared_error(train_data[target_column].values,train_predict.values)
Out[13]: 0.12006262348733662

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0','V2','V3']])
Out[15]:
name model mean std
0 3['V0', 'V2', 'V3'] DecisionTreeRegressor -0.261602 0.010449
1 3['V0', 'V2', 'V3'] RandomForestRegressor -0.158531 0.010718
2 3['V0', 'V2', 'V3'] XGBRegressor -0.132225 0.018796
3 3['V0', 'V2', 'V3'] SVR -0.134189 0.016505
4 3['V0', 'V2', 'V3'] LinearRegression -0.130411 0.019779
5 3['V0', 'V2', 'V3'] RANSACRegressor -0.165684 0.028795
6 3['V0', 'V2', 'V3'] TheilSenRegressor -0.130851 0.019087
7 3['V0', 'V2', 'V3'] cal(mean,abs.min) -0.159070 0.130411

y_predict=model.fit(train_data[['V0','V2','V3']].values,train_data[target_column].values).predict(train_data[['V0','V2','V3']].values)
mean_squared_error(train_data[target_column].values,y_predict)
Out[5]: 0.12707897431440907



4,思路3,强制使用ts算法,通过ts算法集成


5,思路4,em思想,反馈修改



6,思路5,corr的贪心思路
pearson spearman mine
target 1.000000 1.000000 1.000000
V0 0.884281 0.875604 0.600301
V1 0.881703 0.838538 0.539492
V8 0.844625 0.806765 0.493244
V31 0.761133 0.756125 0.439836
V27 0.820771 0.770214 0.432607
V2 0.647061 0.638264 0.339456
V4 0.609505 0.577577 0.282280
V12 0.599999 0.544452 0.258994
V37 -0.572586 -0.501714 0.255334
V16 0.543457 0.516076 0.241186
V3 0.514266 0.503984 0.224364
V10 0.395467 0.371644 0.200723
V20 0.456667 0.428626 0.183905
V36 0.321002 0.289142 0.171838
V25 -0.017251 0.055327 0.167250
V24 -0.272737 -0.304275 0.165798
V29 0.130980 0.203324 0.150947

FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0'],['V0','V1','V8','V31'],['V0','V1','V8','V31','V27','V2','V4','V12']])
7 1 [V0] cal(mean,abs.min) -0.247422 -0.201277
15 4 [V0, V1, V8, V31] cal(mean,abs.min) -0.181498 -0.147381
23 8 [V0, V1, V8, V31, V27, V2, V4, V12] cal(mean,abs.min) -0.147831 -0.119854
feature_lists=[tmp_df.index.tolist()[1:18],tmp_df.index.tolist()[1:-1]]
FeatureTools.get_score_by_models(train_data,target_column,feature_lists)
7 17 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... cal(mean,abs.min) -0.129066 -0.100021
15 31 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... cal(mean,abs.min) -0.141875 -0.100791

9-17,再次尝试13
feature_lists=[tmp_df.index.tolist()[1:13]]
FeatureTools.get_score_by_models(train_data,target_column,feature_lists)
7 12 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... cal(mean,abs.min) -0.133706 -0.105397
9-13,尝试11
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[tmp_df.index.tolist()[1:11]])
7 10 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16] cal(mean,abs.min) -0.148553 -0.118586
11-13,尝试12
7 11 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, V3] cal(mean,abs.min) -0.134726 -0.110393
保留1->13
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[tmp_df.index.tolist()[1:13]])
Out[16]:
length name model mean std
0 12 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... DecisionTreeRegressor -0.221955 0.022204
1 12 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... RandomForestRegressor -0.122938 0.018605
2 12 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... XGBRegressor -0.110106 0.016433
3 12 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... SVR -0.105751 0.017824
4 12 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... LinearRegression -0.105397 0.018763
5 12 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... RANSACRegressor -0.130239 0.019546
6 12 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... TheilSenRegressor -0.108986 0.019621
7 12 [V0, V1, V8, V31, V27, V2, V4, V12, V37, V16, ... cal(mean,abs.min) -0.129339 -0.105397

特征1-13,算法LR


第二批次的搜索
= FeatureTools.get_score_by_models(train_data, last_column_diff, feature_lists=[tmp_df.index.tolist()[1:10]])
7 9 [V14, V7, V33, V6, V30, V29, V21, V34, V18] cal(mean,abs.min) -0.136039 -0.099557

FeatureTools.get_score_by_models(train_data, last_column_diff, feature_lists=[tmp_df.index.tolist()[1:5]])
7 4 [V14, V7, V33, V6] cal(mean,abs.min) -0.122485 -0.099127

FeatureTools.get_score_by_models(train_data, last_column_diff, feature_lists=[tmp_df.index.tolist()[1:3]])
7 2 [V14, V7] cal(mean,abs.min) -0.124343 -0.100398

FeatureTools.get_score_by_models(train_data, last_column_diff, feature_lists=[tmp_df.index.tolist()[1:2]])
0 1 [V14] DecisionTreeRegressor -0.178184 0.011463
1 1 [V14] RandomForestRegressor -0.143556 0.007787
2 1 [V14] XGBRegressor -0.105564 0.019177
3 1 [V14] SVR -0.101005 0.018589
4 1 [V14] LinearRegression -0.099722 0.017354
5 1 [V14] RANSACRegressor -0.103450 0.017433
6 1 [V14] TheilSenRegressor -0.099831 0.017215
7 1 [V14] cal(mean,abs.min) -0.118759 -0.099722

好像也没什么卵用

29_依次尝试法的记录

自主选择算法

Mse:0.19xx

Out[15]: 0.12717401411909554

Mse:-0.1195

Out[29]: 0.11581053983886108

最终结果:v0,sv4,v2,svr,v3,LR,v14,ts
8 4 [V0, V2, V3, V14] DecisionTreeRegressor -0.261275 0.012765
9 4 [V0, V2, V3, V14] RandomForestRegressor -0.153314 0.016808
10 4 [V0, V2, V3, V14] XGBRegressor -0.132162 0.017646
11 4 [V0, V2, V3, V14] SVR -0.143015 0.020804
12 4 [V0, V2, V3, V14] LinearRegression -0.126406 0.019589
13 4 [V0, V2, V3, V14] RANSACRegressor -0.161940 0.043721
14 4 [V0, V2, V3, V14] TheilSenRegressor -0.128057 0.018856
15 4 [V0, V2, V3, V14] cal(mean,abs.min) -0.158024 -0.126406

思路3,强制ts


Mse:0.14453591848344105


V3,0.13859449690181827


0.13625953441828087

1
2
3
4
5
6
7
8
9
10
11
最终结果:v0,ts,v16,ts,v3,ts,v29,ts

length name model mean std
0 4 [V0, V16, V3, V29] DecisionTreeRegressor -0.250558 0.016142
1 4 [V0, V16, V3, V29] RandomForestRegressor -0.154108 0.011610
2 4 [V0, V16, V3, V29] XGBRegressor -0.135431 0.017501
3 4 [V0, V16, V3, V29] SVR -0.134296 0.018930
4 4 [V0, V16, V3, V29] LinearRegression -0.137952 0.017510
5 4 [V0, V16, V3, V29] RANSACRegressor -0.155902 0.021989
6 4 [V0, V16, V3, V29] TheilSenRegressor -0.140288 0.017884
7 4 [V0, V16, V3, V29] cal(mean,abs.min) -0.158362 -0.134296

思路4,em思想,反馈修改

1
2
3
4
5
6
7
8
9
10
V0,v2
FeatureTools.get_score_by_models(train_data,target_column,feature_lists=[['V0','V2']])
Out[14]:
length name model mean std
0 2 [V0, V2] DecisionTreeRegressor -0.277917 0.016859
1 2 [V0, V2] RandomForestRegressor -0.174734 0.013681
2 2 [V0, V2] XGBRegressor -0.140162 0.019175
3 2 [V0, V2] SVR -0.137948 0.017945
4 2 [V0, V2] LinearRegression -0.138543 0.021356
5 2 [V0, V2] RANSACRegressor -0.154074 0.016480

结论,不如串联效果好

选择思路2的特征进行反馈修改
最终结果:v0,sv4,v2,svr,v3,LR,v14,ts
V2,svr
Mse:0.5179102573895963

V0,LR
0.1342823440195082

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×