Bootstrap

kNN算法调参--超参数和网格搜索

调参 - 超参数和网格搜索


提升机器学习预测准确率的方式

  • 搞数据(特征工程)
    • 增加样本数据行数(样本数量)
    • 增加样本数据列数(特征)
  • 搞算法
    • 调包
    • 调参

  • 超参数:算法运行前需要决定的参数
  • 模型参数:算法训练中学习到的参数
    • y = a * x + b

例如:决策树算法中,max_depth分支最大深度就是超参数;而分支阈值就是模型参数

要寻找既不过拟合,预测准确率又高的超参数需要:(既便宜又好,互相矛盾,取一个中间值,性价比最高)

  • 领域知识:了解业务
  • 经验数值:默认值
  • 遍历搜索:遍历运行并比较大量超参数,挑出最优值
    • 网格搜索
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# 手写数字数据集
digits = datasets.load_digits()
digits
{'data': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  2., ..., 12.,  0.,  0.],
        [ 0.,  0., 10., ..., 12.,  1.,  0.]]),
 'target': array([0, 1, 2, ..., 8, 9, 8]),
 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 'images': array([[[ 0.,  0.,  5., ...,  1.,  0.,  0.],
         [ 0.,  0., 13., ..., 15.,  5.,  0.],
         [ 0.,  3., 15., ..., 11.,  8.,  0.],
         ...,
         [ 0.,  4., 11., ..., 12.,  7.,  0.],
         [ 0.,  2., 14., ..., 12.,  0.,  0.],
         [ 0.,  0.,  6., ...,  0.,  0.,  0.]],
 
        [[ 0.,  0.,  0., ...,  5.,  0.,  0.],
         [ 0.,  0.,  0., ...,  9.,  0.,  0.],
         [ 0.,  0.,  3., ...,  6.,  0.,  0.],
         ...,
         [ 0.,  0.,  1., ...,  6.,  0.,  0.],
         [ 0.,  0.,  1., ...,  6.,  0.,  0.],
         [ 0.,  0.,  0., ..., 10.,  0.,  0.]],
 
        [[ 0.,  0.,  0., ..., 12.,  0.,  0.],
         [ 0.,  0.,  3., ..., 14.,  0.,  0.],
         [ 0.,  0.,  8., ..., 16.,  0.,  0.],
         ...,
         [ 0.,  9., 16., ...,  0.,  0.,  0.],
         [ 0.,  3., 13., ..., 11.,  5.,  0.],
         [ 0.,  0.,  0., ..., 16.,  9.,  0.]],
 
        ...,
 
        [[ 0.,  0.,  1., ...,  1.,  0.,  0.],
         [ 0.,  0., 13., ...,  2.,  1.,  0.],
         [ 0.,  0., 16., ..., 16.,  5.,  0.],
         ...,
         [ 0.,  0., 16., ..., 15.,  0.,  0.],
         [ 0.,  0., 15., ..., 16.,  0.,  0.],
         [ 0.,  0.,  2., ...,  6.,  0.,  0.]],
 
        [[ 0.,  0.,  2., ...,  0.,  0.,  0.],
         [ 0.,  0., 14., ..., 15.,  1.,  0.],
         [ 0.,  4., 16., ..., 16.,  7.,  0.],
         ...,
         [ 0.,  0.,  0., ..., 16.,  2.,  0.],
         [ 0.,  0.,  4., ..., 16.,  2.,  0.],
         [ 0.,  0.,  5., ..., 12.,  0.,  0.]],
 
        [[ 0.,  0., 10., ...,  1.,  0.,  0.],
         [ 0.,  2., 16., ...,  1.,  0.,  0.],
         [ 0.,  0., 15., ..., 15.,  0.,  0.],
         ...,
         [ 0.,  4., 16., ..., 16.,  6.,  0.],
         [ 0.,  8., 16., ..., 16.,  8.,  0.],
         [ 0.,  1.,  8., ..., 12.,  1.,  0.]]]),
 'DESCR': ".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 5620\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\n.. topic:: References\n\n  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n    Graduate Studies in Science and Engineering, Bogazici University.\n  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n    Linear dimensionalityreduction using relevance weighted LDA. School of\n    Electrical and Electronic Engineering Nanyang Technological University.\n    2005.\n  - Claudio Gentile. A New Approximate Maximal Margin Classification\n    Algorithm. NIPS. 2000."}
digits.keys()
dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

特征:64个像素

digits.data
array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])
X = digits.data
y = digits.target
X,X.shape
(array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  2., ..., 12.,  0.,  0.],
        [ 0.,  0., 10., ..., 12.,  1.,  0.]]), (1797, 64))
y,y.shape
(array([0, 1, 2, ..., 8, 9, 8]), (1797,))
X[666]
array([ 0.,  0.,  5., 15., 14.,  3.,  0.,  0.,  0.,  0., 13., 15.,  9.,
       15.,  2.,  0.,  0.,  4., 16., 12.,  0., 10.,  6.,  0.,  0.,  8.,
       16.,  9.,  0.,  8., 10.,  0.,  0.,  7., 15.,  5.,  0., 12., 11.,
        0.,  0.,  7., 13.,  0.,  5., 16.,  6.,  0.,  0.,  0., 16., 12.,
       15., 13.,  1.,  0.,  0.,  0.,  6., 16., 12.,  2.,  0.,  0.])
# 转回本来的二维数组(灰度图像)
X[666].reshape(8, 8)
array([[ 0.,  0.,  5., 15., 14.,  3.,  0.,  0.],
       [ 0.,  0., 13., 15.,  9., 15.,  2.,  0.],
       [ 0.,  4., 16., 12.,  0., 10.,  6.,  0.],
       [ 0.,  8., 16.,  9.,  0.,  8., 10.,  0.],
       [ 0.,  7., 15.,  5.,  0., 12., 11.,  0.],
       [ 0.,  7., 13.,  0.,  5., 16.,  6.,  0.],
       [ 0.,  0., 16., 12., 15., 13.,  1.,  0.],
       [ 0.,  0.,  6., 16., 12.,  2.,  0.,  0.]])
plt.figure(figsize=(1, 1))
plt.imshow(
    X[666].reshape(8, 8),
    cmap='gray',
    interpolation='lanczos',
)
<matplotlib.image.AxesImage at 0xf77b7b8>

[外链图片转存失败(img-udkcAhgB-1566636361510)(output_10_1.png)]

kNN训练和预测

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=5)  # 创建空分类器,默认超参数k值为5
knn_clf.fit(X_train, y_train)  # 训练
knn_clf.score(X_test, y_test)  # 查看准确率
0.9888888888888889

调参,寻找最佳k值

best_r = 0  # 最高准确率
best_k = 0  # 最佳k值

for i in range(1, 11):
#     print(i)
    knn_clf = KNeighborsClassifier(n_neighbors=i)  # 创建空分类器,默认超参数k值为5
    knn_clf.fit(X_train, y_train)  # 训练
    re = knn_clf.score(X_test, y_test) # 查看准确率
#     print(re)
    
    if re > best_r:
        best_r = re
        best_k = i

print('最佳准确率为:', best_r)
print('最佳k值为:', best_k)
最佳准确率为: 0.9916666666666667
最佳k值为: 4

kNN另一个超参数

weights:uniform/distance
# uniform默认,统一只计算距离,权重一样
# distance,增加权重考虑

距离的权重:一般将距离的倒数作为权重,加入运算

例:设一个未知点距离最近的3个点

红点a,距离为1
蓝点b,距离为3
蓝点c,距离为4
  • 根据投票原则,预测未知点为蓝色
  • 但红点离未知点最近,应该加权(距离倒数),计算权重后的预测值为:
    • 红色:1
    • 蓝色:1/3 + 1/4 = 7/12
    • 预测未知点为红色

增加一重超参数遍历,距离权重

best_score = 0.0  # 最高准确率
best_k = -1  # 最佳k值
best_weight = ''  # 距离或权重

for j in  ['uniform', 'distance']:
    for i in range(1, 11):
    #     print(i)
        knn_clf = KNeighborsClassifier(n_neighbors=i, weights=j)  # 创建空分类器,默认超参数k值为5
        knn_clf.fit(X_train, y_train)  # 训练
        re = knn_clf.score(X_test, y_test) # 查看准确率
    #     print(re)

        if re > best_score:
            best_score = re
            best_k = i
            best_weight = j

print('best_score = {}'.format(best_score))
print('best_k = {}'.format(best_k))
print('best_weight = {}'.format(best_weight))
best_score = 0.9916666666666667
best_k = 4
best_weight = uniform

不使用uniform预测,考虑权重的准确率

knn_clf = KNeighborsClassifier(n_neighbors=4, weights='distance')  # 创建空分类器,超参数k值为4
knn_clf.fit(X_train, y_train)  # 训练
score = knn_clf.score(X_test, y_test)  # 查看准确率
score
0.9888888888888889
  1. 欧式距离: d 12 = ∑ k = 1 n ( x 1 k − x 2 k ) 2 d_{12}=\sqrt{\sum_{k=1}^{n}\left(x_{1 k}-x_{2 k}\right)^{2}} d12=k=1n(x1kx2k)2
  2. 曼哈顿距离: d 12 = ∑ k = 1 n ∣ x 1 k − x 2 k ∣ \mathrm{d}_{12}=\sum_{k=1}^{n}\left|\mathrm{x}_{1 k}-x_{2 k}\right| d12=k=1nx1kx2k
  3. 明可夫斯基距离(闵氏距离): ( ∑ i = 1 n ∣ x i − y i ∣ p ) 1 / p \left(\sum_{i=1}^{n}\left|x_{i}-y_{i}\right|^{p}\right)^{1 / p} (i=1nxiyip)1/p

明可夫斯基距离包含下面两种

  • p=1:曼哈顿距离
  • p=2:欧氏距离
  • p趋于无穷大:切比雪夫距离

继续增加超参数:搜索不同距离计算方式

p=2/1/...
# p=2默认为欧拉距离,p=1为曼哈顿距离,其他值为相应距离类型

超参数之间可能有相互依赖关系,如参数p只有在weight=distance考虑权重时才有效

best_score = 0.0
best_k = -1
best_weight = 'distance'
best_p = 0

for j in range(1, 6):  # p值
    for i in range(1, 11):  # 如果最大k值为最优,需要继续更大的k值循环
    #     print(i)
        knn_clf = KNeighborsClassifier(n_neighbors=i, weights=best_weight, p=j)  # 创建空分类器
        knn_clf.fit(X_train, y_train)  # 训练
        score = knn_clf.score(X_test, y_test)  # 查看准确率

        if best_score < score:
            best_score = score
            best_k = i
            best_p = j

print('best_weight = {} 的条件下:'.format(best_weight))
print('best_score = {}'.format(best_score))
print('best_k = {}'.format(best_k))
print('best_p = {}'.format(best_p))
best_weight = distance 的条件下:
best_score = 0.9888888888888889
best_k = 5
best_p = 1

上面手动实现的网格搜索方式,可用sklearn内置方法实现

sklearn内封装好的网格搜索方式:Grid Search

###############################################################
# 首先定义要搜索的参数
# 二维数组内嵌套字典
# 每个字典内都是一组网格搜索,标明每个参数的取值范围
# 注意:p值只有在weights=distance时才有意义

param_grid = [
    {  # 需遍历10次
        'weights': ['uniform'], # 参数取值范围
        'n_neighbors': [i for i in range(1, 11)]  # 使用其他方式如np.arange()也可以
        # 这里没有p参数
    },
    {  # 需遍历50次
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)],
        'p': [i for i in range(1, 6)]
    }
]

# 共需遍历60次
[i for i in range(1, 11)]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
knn_clf = KNeighborsClassifier()  # 默认参数,创建空分类器

from sklearn.model_selection import GridSearchCV  # CV,使用交叉验证方式获得模型正确率
grid_search = GridSearchCV(knn_clf, param_grid, cv=5)  # 网格搜索参数

grid_search.fit(X_train, y_train)  # 网格搜索训练模型,比较耗时,约4分钟
GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'weights': ['uniform']},
                         {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

下面这类最后加下划线的方法,都是标识:非用户传入参数而是根据用户传入参数计算出来的参数

# 网格搜索到的最佳分类器参数
grid_search.best_estimator_
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')
# 输出指定过的最佳参数
grid_search.best_params_
{'n_neighbors': 1, 'weights': 'uniform'}

上面网格搜索到的最佳参数相比之前手动完成的网格搜索得到的参数不同,因为sklearn使用的时CV方式(交叉验证)评价精确度,

虽然测试准确率降低,但避免了过拟合,更好

grid_search.score(X_test, y_test)  # 网格搜索参数模型的准确率
0.9833333333333333
# 将网格搜索训练好的模型赋值给对象,可以和其他训练好的模型一样使用
knn_clf = grid_search.best_estimator_
knn_clf.predict(X_test)  # 直接预测结果
knn_clf.score(X_test, y_test)  # 准确率
0.9833333333333333

一般工作时人工调参和网格搜索结合,参数区间比较大时人工调参,区间小就网格搜索,宗旨就是减少模型训练量,提高调参速度

网格搜索耗时很久,verbose参数可以在执行过程中输出一些信息供用户参考,值越大输出信息越多

注意:执行时间很久,如果使用并行运算占用全部cpu(n_jobs=-1)内核会导致系统卡死,只用部分内核运算就行

  • 量化交易:矿工
  • 程序员:码农
  • 机器学习:调包侠,调参侠
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=2, cv=5, verbose=2)
grid_search.fit(X_train, y_train)

# 供需要搜索60组参数,uniform10组,distance50组
Fitting 5 folds for each of 60 candidates, totalling 300 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  53 tasks      | elapsed:    3.2s
[Parallel(n_jobs=2)]: Done 210 tasks      | elapsed:   34.7s
[Parallel(n_jobs=2)]: Done 300 out of 300 | elapsed:   55.9s finished





GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=1, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=2,
             param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'weights': ['uniform']},
                         {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=2)
grid_search.score(X_test, y_test)  # 网格搜索参数模型的准确率
0.9833333333333333

;