我是靠谱客的博主 优雅大炮,最近开发中收集的这篇文章主要介绍机器学习—k近邻算法3.k近邻算法,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

3.k近邻算法

3.1 kNN算法的实现

k近邻算法(k-Nearest Neighbors)也称为kNN算法,它是算法中最简单、最基础的一种。kNN的基本思想:将数据集分为训练数据集和测试数据集,为了测试的准确率,两个数据集互异;计算测试数据集中每一个样本和训练数据集中每一个样本的距离,统计出距离值最小的k个样本(训练数据集),对k个样本中的标签值进行统计,最多的那个标签值为测试样本的预测;对测试数据集中所有的样本进行预测,并计算准确率。

3.1.1自己实现kNN算法

使用欧式距离进行计算,在n维空间中,点xy的距离:

d(x,y) =

#程序3-1

import numpy as np

from sklearn import datasets

from math import sqrt

from collections import Counter

 

iris = datasets.load_iris()

'''

data:鸢尾花数据集的特征集(150,4)

target:鸢尾花数据集的标签集(150,)

target_names:鸢尾花标签对应的名称['setosa' 'versicolor' 'virginica']

DESCR:鸢尾花数据集的说明文档

feature_names:鸢尾花特征对应的名称,

    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

filename:数据集存放的位置,

    D:Program Files (x86)Pythonlibsite-packagessklearndatasetsdatairis.csv

'''

print(iris.keys())

 

np.random.seed(1)

random_index = np.random.permutation(len(iris.data))

proportion = 0.2

 

train_size = int((1-proportion)*len(iris.data))

train_index = random_index[:train_size]

test_index = random_index[train_size:]

 

#训练数据集

#或train_X = iris.data[train_index]不写列,只对行进行fancy indexing

X_train = iris.data[train_index,:]

y_train = iris.target[train_index]

#测试数据集

X_test = iris.data[test_index,:]

y_test = iris.target[test_index]

 

print('X_train:n',X_train.shape)

print('y_train:n',y_train.shape)

 

def distance(k,X_train,y_train,x):

    #计算x到训练集X_train中每一个点的距离,索引排序后;

    # 通过索引值取出y_train中的k个元素

    # 对k个元素进行计数,返回计数最多值对应的target

    distances = []

    for x_train in X_train:

        dist = sqrt(sum((x_train - x)**2))

        distances.append(dist)

    distances_arg = np.argsort(distances)

    topK_y = [y_train[top_arg] for top_arg in distances_arg[:k]]

    top_counter = Counter(topK_y)

    top_most = top_counter.most_common(1)

    return top_most[0][0]

 

y_predict = []

for x_test in X_test:

    #对X_test中的每个元素求距离,将返回的target存入y_predict中

    y_dis = distance(8,X_train,y_train,x_test)

    y_predict.append(y_dis)

accuracy_rate = sum(y_predict==y_test)/len(y_test)

print('预测准确率: ',accuracy_rate)

运行结果:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

X_train:

 (120, 4)

y_train:

 (120,)

预测准确率:  0.9666666666666667

3.1.2使用sklearn库封装的kNN

官方文档:https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.fit

#程序3-2

import numpy as np

from sklearn import datasets

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

 

iris = datasets.load_iris()

'''

data:鸢尾花数据集的特征集(150,4)

target:鸢尾花数据集的标签集(150,)

target_names:鸢尾花标签对应的名称['setosa' 'versicolor' 'virginica']

DESCR:鸢尾花数据集的说明文档

feature_names:鸢尾花特征对应的名称,

    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

filename:数据集存放的位置,

    D:Program Files (x86)Pythonlibsite-packagessklearndatasetsdatairis.csv

'''

print(iris.keys())

 

#train_size表示训练数据集所占比例

#test_size表示测试数据集所占比例

#train_size + test_size = 1,尽量使用test_size

#random_state表示随机种子

#相当于X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,

#                                       train_size=0.8,random_state=123)

X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,

                                test_size=0.2,random_state=123)

 

print('X_train:n',X_train.shape)

print('X_test:n',X_test.shape)

print('y_train:n',y_train.shape)

print('y_test:n',y_test.shape)

 

iris_kNNClassifier = KNeighborsClassifier(n_neighbors=8)

#训练数据集使用kNN算法进行训练,得到模型

iris_kNNClassifier.fit(X_train,y_train)

 

#使用模型对测试数据集进行预测

# y_predict = iris_kNNClassifier.predict(X_test)

# accuracy_rate = accuracy_score(y_test,y_predict)

accuracy_rate = iris_kNNClassifier.score(X_test,y_test)

print('预测准确率: ',accuracy_rate)

运行结果:

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

X_train:

 (120, 4)

X_test:

 (30, 4)

y_train:

 (120,)

y_test:

 (30,)

预测准确率:  0.9666666666666667

3.2超参数

对于同一个学习算法,当使用不同的参数配置时,也会产生不同的模型。因此,在使用学习算法时,如何调参是很重要的问题。

超参数是在算法运行前需要决定的参数;在kNN算法中,超参数有k、距离的权重、明可夫斯基距离参数p。模型参数是在算法过程中学习的参数;kNN算法中并没有模型参数。

3.2.1超参数k

当选取k为何值时,准确率最高?

#程序3-3

import numpy as np

from sklearn import datasets

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

 

iris = datasets.load_iris()

'''

data:鸢尾花数据集的特征集(150,4)

target:鸢尾花数据集的标签集(150,)

target_names:鸢尾花标签对应的名称['setosa' 'versicolor' 'virginica']

DESCR:鸢尾花数据集的说明文档

feature_names:鸢尾花特征对应的名称,

    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

filename:数据集存放的位置,

    D:Program Files (x86)Pythonlibsite-packagessklearndatasetsdatairis.csv

'''

X = iris.data

y = iris.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=123)

 

best_k = -1

best_score = 0.0

for k in range(1,11):

    iris_kNNClassifier = KNeighborsClassifier(n_neighbors=k)

    iris_kNNClassifier.fit(X_train,y_train)

    score = iris_kNNClassifier.score(X_test,y_test)

    if score > best_score:

        best_score = score

        best_k = k

 

print('k = %d 时准确率最高,score = %f'%(best_k,best_score))

运行结果:

k = 5 时准确率最高,score = 0.977778

3.2.2距离的权重

在训练数据集中有4个训练样本和测试样本最近,setosa有1个训练样本,距离为1,versicolor有2个训练样本,距离为4、5,virginica有1个训练样本,距离为3。

若考虑距离的权重,则setosa的权重是1,versicolor的权重是1/4 + 1/5 = 9/20,virginica的权重是1/3。因为setosa最大,所以预测结果是setosa。

在官方文档https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier中,KNeighborsClassifier的weights参数默认是uniform,即不考虑距离权重;若将weights设置为distance,则表示考虑距离权重。

#程序3-4

import numpy as np

from sklearn import datasets

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

 

iris = datasets.load_iris()

'''

data:鸢尾花数据集的特征集(150,4)

target:鸢尾花数据集的标签集(150,)

target_names:鸢尾花标签对应的名称['setosa' 'versicolor' 'virginica']

DESCR:鸢尾花数据集的说明文档

feature_names:鸢尾花特征对应的名称,

    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

filename:数据集存放的位置,

    D:Program Files (x86)Pythonlibsite-packagessklearndatasetsdatairis.csv

'''

X = iris.data

y = iris.target

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=123)

 

weights = ['uniform','distance']

best_weight = ''

best_k = -1

best_score = 0.0

for weight in weights:

    for k in range(1,10):

        iris_kNNClassifier = KNeighborsClassifier(n_neighbors=k,weights=weight)

        iris_kNNClassifier.fit(X_train,y_train)

        score = iris_kNNClassifier.score(X_test,y_test)

        if score > best_score:

            best_score = score

            best_k = k

            best_weight = weight

 

print('best_weight:%s,best_k:%d,best_score:%f'%(best_weight,best_k,best_score))

运行结果:

best_weight:uniform,best_k:5,best_score:0.977778

使用for循环进行最好参数的搜索非常麻烦,在sklearn.model_selection中提供了网格搜索的办法。官方文档:https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV。

#程序3-5

import numpy as np

from sklearn import datasets

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split,GridSearchCV

 

iris = datasets.load_iris()

'''

data:鸢尾花数据集的特征集(150,4)

target:鸢尾花数据集的标签集(150,)

target_names:鸢尾花标签对应的名称['setosa' 'versicolor' 'virginica']

DESCR:鸢尾花数据集的说明文档

feature_names:鸢尾花特征对应的名称,

    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

filename:数据集存放的位置,

    D:Program Files (x86)Pythonlibsite-packagessklearndatasetsdatairis.csv

'''

X = iris.data

y = iris.target

 

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=123)

 

print(X_train.shape)

print(y_train.shape)

kNN_Classifier = KNeighborsClassifier()  

 

#kNN_grid = [

#               {

#               'weights': ['uniform','distance'],

#               'n_neighbors': [i for i in range(1, 11)]

#               }

#           ]

kNN_grid = {

            'weights': ['uniform','distance'],

            'n_neighbors': [i for i in range(1, 11)]

            }

 

kNN_grid_search = GridSearchCV(kNN_Classifier,kNN_grid,n_jobs=-1,iid='True',cv=3,verbose=2)

#训练数据集

kNN_grid_search.fit(X_train,y_train)

#最优参数的分类器

print('best_estimator_:',kNN_grid_search.best_estimator_)

#成员提供优化过程期间观察到的最高准确率

print('best_score_:',kNN_grid_search.best_score_)

#网格搜索中的最优参数值

print('best_params_:',kNN_grid_search.best_params_)

 

#预测测试数据集

iris_kNN = kNN_grid_search.best_estimator_

y_predict = iris_kNN.predict(X_test)

score = iris_kNN.score(X_test,y_test)

print('准确率:',score)

运行结果:

(105, 4)

(105,)

Fitting 3 folds for each of 20 candidates, totalling 60 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.

[CV] n_neighbors=1, weights=uniform ..................................

[CV] ................... n_neighbors=1, weights=uniform, total=   0.0s

[CV] n_neighbors=2, weights=uniform ..................................

[CV] n_neighbors=1, weights=uniform ..................................

[CV] n_neighbors=1, weights=uniform ..................................

[CV] n_neighbors=1, weights=distance .................................

[CV] ................... n_neighbors=2, weights=uniform, total=   0.0s

[CV] .................. n_neighbors=1, weights=distance, total=   0.0s

[CV] ................... n_neighbors=1, weights=uniform, total=   0.0s

[CV] ................... n_neighbors=1, weights=uniform, total=   0.0s

[CV] n_neighbors=2, weights=distance .................................

[CV] n_neighbors=2, weights=distance .................................

[CV] n_neighbors=2, weights=distance .................................

[CV] .................. n_neighbors=2, weights=distance, total=   0.0s

[CV] n_neighbors=1, weights=distance .................................

[CV] .................. n_neighbors=2, weights=distance, total=   0.0s

[CV] n_neighbors=3, weights=uniform ..................................

[CV] .................. n_neighbors=2, weights=distance, total=   0.0s

[CV] n_neighbors=3, weights=uniform ..................................

[CV] n_neighbors=1, weights=distance .................................

[CV] .................. n_neighbors=1, weights=distance, total=   0.0s

[CV] .................. n_neighbors=1, weights=distance, total=   0.0s

[CV] ................... n_neighbors=3, weights=uniform, total=   0.0s

[CV] n_neighbors=3, weights=uniform ..................................

[CV] n_neighbors=3, weights=distance .................................

[CV] .................. n_neighbors=3, weights=distance, total=   0.0s

[CV] n_neighbors=3, weights=distance .................................

[CV] ................... n_neighbors=3, weights=uniform, total=   0.0s

[CV] .................. n_neighbors=3, weights=distance, total=   0.0s

[CV] n_neighbors=2, weights=uniform ..................................

[CV] n_neighbors=3, weights=distance .................................

[CV] n_neighbors=4, weights=uniform ..................................

[CV] n_neighbors=4, weights=uniform ..................................

[CV] n_neighbors=4, weights=uniform ..................................

[CV] n_neighbors=4, weights=distance .................................

[CV] .................. n_neighbors=3, weights=distance, total=   0.0s

[CV] ................... n_neighbors=3, weights=uniform, total=   0.0s

[CV] n_neighbors=4, weights=distance .................................

[CV] n_neighbors=4, weights=distance .................................

[CV] .................. n_neighbors=4, weights=distance, total=   0.0s

[CV] n_neighbors=5, weights=uniform ..................................

[CV] .................. n_neighbors=4, weights=distance, total=   0.0s

[CV] n_neighbors=5, weights=uniform ..................................

[CV] .................. n_neighbors=4, weights=distance, total=   0.0s

[CV] ................... n_neighbors=4, weights=uniform, total=   0.0s

[CV] n_neighbors=5, weights=uniform ..................................

[CV] ................... n_neighbors=4, weights=uniform, total=   0.0s

[CV] ................... n_neighbors=2, weights=uniform, total=   0.0s

[CV] ................... n_neighbors=5, weights=uniform, total=   0.0s

[CV] n_neighbors=5, weights=distance .................................

[CV] ................... n_neighbors=5, weights=uniform, total=   0.0s

[CV] n_neighbors=5, weights=distance .................................

[CV] n_neighbors=5, weights=distance .................................

[CV] ................... n_neighbors=4, weights=uniform, total=   0.0s

[CV] n_neighbors=6, weights=uniform ..................................

[CV] .................. n_neighbors=5, weights=distance, total=   0.0s

[CV] .................. n_neighbors=5, weights=distance, total=   0.0s

[CV] n_neighbors=6, weights=uniform ..................................

[CV] .................. n_neighbors=5, weights=distance, total=   0.0s

[CV] n_neighbors=6, weights=uniform ..................................

[CV] n_neighbors=6, weights=distance .................................

[CV] ................... n_neighbors=5, weights=uniform, total=   0.0s

[CV] n_neighbors=6, weights=distance .................................

[CV] ................... n_neighbors=6, weights=uniform, total=   0.0s

[CV] .................. n_neighbors=6, weights=distance, total=   0.0s

[CV] n_neighbors=6, weights=distance .................................

[CV] ................... n_neighbors=6, weights=uniform, total=   0.0s

[CV] ................... n_neighbors=6, weights=uniform, total=   0.0s

[CV] n_neighbors=7, weights=uniform ..................................

[CV] .................. n_neighbors=6, weights=distance, total=   0.0s

[CV] .................. n_neighbors=6, weights=distance, total=   0.0s

[CV] n_neighbors=7, weights=uniform ..................................

[CV] n_neighbors=7, weights=uniform ..................................

[CV] ................... n_neighbors=7, weights=uniform, total=   0.0s

[CV] n_neighbors=7, weights=distance .................................

[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    1.2s

[CV] ................... n_neighbors=7, weights=uniform, total=   0.0s

[CV] .................. n_neighbors=7, weights=distance, total=   0.0s

[CV] n_neighbors=7, weights=distance .................................

[CV] ................... n_neighbors=7, weights=uniform, total=   0.0s

[CV] n_neighbors=7, weights=distance .................................

[CV] .................. n_neighbors=7, weights=distance, total=   0.0s

[CV] .................. n_neighbors=7, weights=distance, total=   0.0s

[CV] n_neighbors=8, weights=uniform ..................................

[CV] n_neighbors=8, weights=uniform ..................................

[CV] n_neighbors=8, weights=distance .................................

[CV] ................... n_neighbors=8, weights=uniform, total=   0.0s

[CV] n_neighbors=9, weights=uniform ..................................

[CV] n_neighbors=2, weights=uniform ..................................

[CV] ................... n_neighbors=8, weights=uniform, total=   0.0s

[CV] n_neighbors=9, weights=uniform ..................................

[CV] n_neighbors=9, weights=distance .................................

[CV] n_neighbors=8, weights=uniform ..................................

[CV] n_neighbors=8, weights=distance .................................

[CV] .................. n_neighbors=8, weights=distance, total=   0.0s

[CV] n_neighbors=8, weights=distance .................................

[CV] .................. n_neighbors=8, weights=distance, total=   0.0s

[CV] ................... n_neighbors=9, weights=uniform, total=   0.0s

[CV] .................. n_neighbors=9, weights=distance, total=   0.0s

[CV] ................... n_neighbors=8, weights=uniform, total=   0.0s

[CV] ................... n_neighbors=9, weights=uniform, total=   0.0s

[CV] n_neighbors=9, weights=distance .................................

[CV] n_neighbors=9, weights=distance .................................

[CV] n_neighbors=9, weights=uniform ..................................

[CV] ................... n_neighbors=2, weights=uniform, total=   0.0s

[CV] .................. n_neighbors=8, weights=distance, total=   0.0s

[CV] .................. n_neighbors=9, weights=distance, total=   0.0s

[CV] .................. n_neighbors=9, weights=distance, total=   0.0s

[CV] n_neighbors=10, weights=uniform .................................

[CV] n_neighbors=10, weights=uniform .................................

[CV] n_neighbors=10, weights=distance ................................

[CV] ................... n_neighbors=9, weights=uniform, total=   0.0s

[Parallel(n_jobs=-1)]: Done  45 out of  60 | elapsed:    1.2s remaining:    0.3s

[CV] ................. n_neighbors=10, weights=distance, total=   0.0s

[CV] n_neighbors=10, weights=distance ................................

[CV] .................. n_neighbors=10, weights=uniform, total=   0.0s

[CV] n_neighbors=10, weights=uniform .................................

[CV] .................. n_neighbors=10, weights=uniform, total=   0.0s

[CV] n_neighbors=10, weights=distance ................................

[CV] ................. n_neighbors=10, weights=distance, total=   0.0s

[CV] ................. n_neighbors=10, weights=distance, total=   0.0s

[CV] .................. n_neighbors=10, weights=uniform, total=   0.0s

[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    1.2s finished

best_estimator_: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',

           metric_params=None, n_jobs=None, n_neighbors=6, p=2,

           weights='distance')

best_score_: 0.9714285714285714

best_params_: {'n_neighbors': 6, 'weights': 'distance'}

准确率: 0.9777777777777777

为什么只使用训练数据集就可以得到best_score_?

因为函数GridSearchCV表示使用交叉验证。k折交叉验证将样本随机均匀的分为k份,轮流用k-1分训练模型,1份用于测试模型准确率,用k个准确率的均值作为最终的准确率。

函数原型:GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=’2*n_jobs’, error_score=’raise-deprecating’, return_train_score=’warn’)

参数estimator:所使用的分类器,如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features='sqrt',random_state=10),并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数,或者score方法。

参数param_grid:值为字典或元素为字典的列表,即需要最优化参数的取值。

参数scoring:准确度评价标准,默认None,需要使用score函数;或scoring='roc_auc',根据所选模型不同,评价准则不同。

参数cv:交叉验证参数,默认None,即使用三折交叉验证(指定fold数量为3)。

参数refit:默认为True,程序在搜索参数结束后,用最佳参数再次fit一遍全部数据集。

参数iid:默认True,表示各个样本fold概率分布一致,误差估计为所有样本之和,而非各个fold的平均。

参数verbose:日志冗长度(int),冗长度为0表示不输出训练过程;1表示偶尔输出;>1表示对每个子模型都输出。

参数n_jobs:并行数(int),默认为1,表示使用一个CPU核,-1表示使用所有的CPU核。

参考博客:https://blog.csdn.net/u012969412/article/details/72973055

3.2.3明可夫斯基距离参数p

在n维空间中,求取点a和b的距离,点a坐标(a1,a2,...,an),点b坐标(b1,b2,...,bn)。

曼哈顿距离:d(a,b) =

欧拉距离:d(a,b) =

明可夫斯基距离:d(a,b) =

 

在KNeighborsClassifier中,参数p表示的就是明可夫斯基距离中的p,因此p也是超参数。由于只有当考虑距离时,即当weights=distance时,p才有意义。

#程序3-6

import numpy as np

from sklearn import datasets

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split,GridSearchCV

 

iris = datasets.load_iris()

'''

data:鸢尾花数据集的特征集(150,4)

target:鸢尾花数据集的标签集(150,)

target_names:鸢尾花标签对应的名称['setosa' 'versicolor' 'virginica']

DESCR:鸢尾花数据集的说明文档

feature_names:鸢尾花特征对应的名称,

    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

filename:数据集存放的位置,

    D:Program Files (x86)Pythonlibsite-packagessklearndatasetsdatairis.csv

'''

X = iris.data

y = iris.target

 

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=123)

 

kNN_Classifier = KNeighborsClassifier()  

 

kNN_grid = [

              {

              'weights': ['uniform'],

              'n_neighbors': [i for i in range(1, 11)]

              },

              {

              'weights': ['distance'],

              'n_neighbors': [i for i in range(1, 11)],

              'p': [i for i in range(1, 7)]

              }

          ]

 

kNN_grid_search = GridSearchCV(kNN_Classifier,kNN_grid,n_jobs=-1,iid='True',cv=3,verbose=0)

#训练数据集

kNN_grid_search.fit(X_train,y_train)

#最优参数的分类器

print('best_estimator_:',kNN_grid_search.best_estimator_)

#成员提供优化过程期间观察到的最高准确率

print('best_score_:',kNN_grid_search.best_score_)

#网格搜索中的最优参数值

print('best_params_:',kNN_grid_search.best_params_)

 

#预测测试数据集

iris_kNN = kNN_grid_search.best_estimator_

y_predict = iris_kNN.predict(X_test)

score = iris_kNN.score(X_test,y_test)

print('准确率:',score)

运行结果:

best_estimator_: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',

           metric_params=None, n_jobs=None, n_neighbors=6, p=2,

           weights='distance')

best_score_: 0.9714285714285714

best_params_: {'n_neighbors': 6, 'p': 2, 'weights': 'distance'}

准确率: 0.9777777777777777

3.3数据归一化

在机器学习领域,特征向量中的不同特征由于具有不同的量纲和量纲单位,会影响到数据分析的结果。因此,为了消除量纲不同带来的影响,需要对数据进行标准化处理,最典型的就是数据归一化。

在kNN算法中,需要求取样本之间的距离,以欧拉距离为例。对样本特征值差的平方求和,而不同特征值之间由于量纲的不同,因此需要对数据进行归一化处理。

数据归一化的目的是将所有的数据映射到同一尺度,常用方法最值归一化和均值方差归一化。

最值归一化是把所有的数据映射到0-1之间,公式为

最值归一化使用于有明显边界的数据集,如学生的分数;不适用于没有边界的数据集,如房价。

均值方差归一化是把所有数据归一到均值为0方差为1的分布中,公式为

其中,xmean为均值,s为方差。均值方差归一化使用于没有边界的数据集,或存在极端值的数据集;当时有边界的数据集使用均值方差归一化也是非常好的。

3.3.1自己实现数据归一化

#程序3-7:最值归一化

import numpy as np

 

X = np.random.randint(0,100,(20,2))

X = np.array(X,dtype=float)

 

col = X.shape[1]

i = 0

while i < col:

    #最值归一化

    X[:,i] = (X[:,i]-np.min(X[:,i]))/(np.max(X[:,i])-np.min(X[:,i]))

    i += 1

print(X)

运行结果:

[[0.37209302 0.        ]

 [1.         0.38461538]

 [0.48837209 1.        ]

 [0.1744186  0.59340659]

 [0.61627907 0.96703297]

 [0.55813953 0.49450549]

 [0.77906977 0.93406593]

 [0.97674419 0.52747253]

 [0.43023256 0.87912088]

 [0.18604651 0.1978022 ]

 [0.45348837 0.8021978 ]

 [0.89534884 0.36263736]

 [0.61627907 0.56043956]

 [0.05813953 0.35164835]

 [1.         0.87912088]

 [0.8255814  0.96703297]

 [0.68604651 0.9010989 ]

 [0.         0.68131868]

 [0.24418605 0.7032967 ]

 [0.90697674 0.28571429]]

#程序3-8:均值方差归一化

import numpy as np

 

X = np.random.randint(0,100,(20,2))

X = np.array(X,dtype=float)

 

col = X.shape[1]

i = 0

while i < col:

    #均值方差归一化

X[:,i] = (X[:,i]-np.mean(X[:,i]))/np.std(X[:,i])

print('特征的均值:',np.mean(X[:,i]))

print('特征的方差:',np.std(X[:,i]))

    i += 1

print(X)

运行结果:

特征的均值: -2.2204460492503132e-17

特征的方差: 1.0

特征的均值: 6.938893903907228e-17

特征的方差: 1.0

[[-0.83151186  2.08977336]

 [-0.90197897  0.91751646]

 [ 0.01409342  0.06086719]

 [-1.50094937 -0.52526126]

 [-0.76104475  1.05277687]

 [-0.76104475  1.27821089]

 [-0.05637369  0.01578038]

 [-1.21908095  0.51173522]

 [ 1.388202   -0.11948003]

 [ 1.63483688 -0.20965364]

 [ 1.03586647  0.01578038]

 [ 0.29596185 -0.39000085]

 [ 1.24726779 -1.51717095]

 [ 0.47212962 -0.34491405]

 [ 0.47212962  0.46664842]

 [ 0.26072829 -0.48017446]

 [ 1.8462382  -1.87786538]

 [-1.00767963  1.27821089]

 [-0.90197897 -1.74260497]

 [-0.7258112  -0.48017446]]

3.3.2使用sklearn中的preprocessing

#程序3-9

import numpy as np

from sklearn import datasets

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split,GridSearchCV

from sklearn.preprocessing import StandardScaler

 

iris = datasets.load_iris()

'''

data:鸢尾花数据集的特征集(150,4)

target:鸢尾花数据集的标签集(150,)

target_names:鸢尾花标签对应的名称['setosa' 'versicolor' 'virginica']

DESCR:鸢尾花数据集的说明文档

feature_names:鸢尾花特征对应的名称,

    ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

filename:数据集存放的位置,

    D:Program Files (x86)Pythonlibsite-packagessklearndatasetsdatairis.csv

'''

X = iris.data

y = iris.target

 

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=123)

 

stdScale = StandardScaler()

#使用训练数据集对StandardScaler对象进行训练,求的对应的均值和方差

stdScale.fit(X_train)

print('训练数据集的均值:',stdScale.mean_)

print('训练数据集的方差:',stdScale.scale_)

#对训练数据集和测试数据集进行转换,其使用的均值和方差都是训练数据集的均值、方差

X_train_std = stdScale.transform(X_train)

X_test_std = stdScale.transform(X_test)

 

kNN_Classifier = KNeighborsClassifier()  

 

kNN_grid = [

              {

              'weights': ['uniform'],

              'n_neighbors': [i for i in range(1, 11)]

              },

              {

              'weights': ['distance'],

              'n_neighbors': [i for i in range(1, 11)],

              'p': [i for i in range(1, 7)]

              }

          ]

 

kNN_grid_search = GridSearchCV(kNN_Classifier,kNN_grid,n_jobs=-1,iid='True',cv=3,verbose=0)

#训练数据集

kNN_grid_search.fit(X_train_std,y_train)

#最优参数的分类器

print('best_estimator_:',kNN_grid_search.best_estimator_)

#成员提供优化过程期间观察到的最高准确率

print('best_score_:',kNN_grid_search.best_score_)

#网格搜索中的最优参数值

print('best_params_:',kNN_grid_search.best_params_)

 

#预测测试数据集

iris_kNN = kNN_grid_search.best_estimator_

y_predict = iris_kNN.predict(X_test_std)

score = iris_kNN.score(X_test_std,y_test)

print('准确率:',score)

运行结果:

训练数据集的均值: [5.86952381 3.05047619 3.83714286 1.23238095]

训练数据集的方差: [0.792239   0.40710916 1.70608196 0.73568824]

best_estimator_: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',

           metric_params=None, n_jobs=None, n_neighbors=5, p=2,

           weights='uniform')

best_score_: 0.9714285714285714

best_params_: {'n_neighbors': 5, 'weights': 'uniform'}

准确率: 0.9333333333333333

使用sklearn库的preprocessing模块中的类StandardScaler进行数据集的转换。

3.4评价kNN算法

优点:kNN算法适合解决多分类的任务,其实现简单、效果强大;除了分类任务,kNN算法还可以解决回归任务。

缺点:效率低下。当数据集有上万个特征,上百万个样本时,使用kNN算法是明显不合适的。

最后

以上就是优雅大炮为你收集整理的机器学习—k近邻算法3.k近邻算法的全部内容,希望文章能够帮你解决机器学习—k近邻算法3.k近邻算法所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(60)

评论列表共有 0 条评论

立即
投稿
返回
顶部