sklearn决策树（Decision Tree）多分类问题步骤结果

106 阅读 0 评论 70 点赞

我是靠谱客的博主靓丽路人，最近开发中收集的这篇文章主要介绍sklearn决策树（Decision Tree）多分类问题步骤结果，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

文章目录

步骤
- 建立模型
- - 特征选择
  - 预剪枝
- 预处理
- 训练
- 测试+评价模型
- 可视化
- 不同的criterion和max_depth训练决策树
结果

步骤

建立模型

class sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

参数详解
比较重要的参数
- criterion：{‘gini’, ‘entropy’} 基尼系数和信息增益，特征选择要用
- max_depth：整数，预剪枝要用

特征选择

选择特征的算法主要有以下三种：
- ID3 （使用信息增益）
- C4.5（使用信息增益率）
- CART（使用gini系数）
在模型的criterion参数中设定。'gini’是CART, 'entropy’是ID3

预剪枝

通过设定max_depth实现。max_depth表示决策树的最大层数。太大容易过拟合，太小容易欠拟合。

预处理

主要是看有没有缺失值、异常值；
还需要把分类变量转化为数值变量（分类变量不能是字符串）

import pandas as pd
# 数据
path = "../Data/classify.csv"
rawdata = pd.read_csv(path) 
X = rawdata.iloc[:,:13]
Y = rawdata.iloc[:,14]  # {”A":0,"B":1,"C":2}
Y = pd.Categorical(Y).codes  # ABC变成123

训练

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
# 训练集和测试集
x_train, x_test, y_train, y_test = 
            train_test_split(X, Y, test_size=0.3)
# 训练
tree = DecisionTreeClassifier(random_state=1, criterion="gini", max_depth=5)
tree.fit(x_train, y_train)

测试+评价模型

分类模型，一般采用准确率和召回率评价性能的好坏。这里采用模型自带的score，给出的是准确率。

acu_test = tree.score(x_test, y_test)
acu_train = tree.score(x_train, y_train)

可视化

决策树是可以可视化的。使用pydotplus和graphviz可视化。

import graphviz
import pydotplus
dot_data = export_graphviz(tree
                               , feature_names=X.columns
                               , class_names=["A", "B", "C"]
                               , filled=True
                               , rounded=True
                               )
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf(path)

注意需要提前安装pydotplus和graphviz，graphviz还需要安装软件并配置好环境。

不同的criterion和max_depth训练决策树

待改进的代码

def tree_mdl(x_train, x_test, y_train, y_test, criterion, max_depth):
    tree = DecisionTreeClassifier(random_state=1, criterion=criterion, max_depth=max_depth)  # "entropy"
    tree.fit(x_train, y_train)
    acu_train = tree.score(x_train, y_train)
    acu_test = tree.score(x_test, y_test)
    dot_data = export_graphviz(tree
                               , feature_names=X.columns
                               , class_names=["A", "B", "C"]
                               , filled=True
                               , rounded=True
                               )
    graph = pydotplus.graph_from_dot_data(dot_data)
    graph.write_pdf("C:/Users/apple/Desktop/treegraph/tree-"+str(criterion)+"max" + str(max_depth) + ".pdf")
    return acu_train, acu_test

def run_tree(test_size):
    x_train, x_test, y_train, y_test = 
        train_test_split(X, Y, test_size=test_size)
    result = {"criterion":[],
              "max_depth":[],
              "acu_train":[],
              "acu_test":[]
              }
    for criterion in ["gini", "entropy"]:
        acu_tr_lis = []
        acu_te_lis = []
        for max_depth in range(1,11):
            acu_train, acu_test = tree_mdl(x_train, x_test, y_train, y_test, criterion, max_depth)
            acu_tr_lis.append(acu_train)
            acu_te_lis.append(acu_test)
            result["criterion"].append(criterion)
            result["max_depth"].append(max_depth)
            result["acu_train"].append(acu_train)
            result["acu_test"].append(acu_test)
        plt.plot(range(1, 11), acu_tr_lis, "o-",label="acu-train")
        plt.plot(range(1, 11), acu_te_lis, "*-",label="acu-test")
        plt.xlabel("max_depth")
        plt.ylabel("accuracy")
        plt.title("Criterion = "+str(criterion))
        plt.legend(["acu-train", "acu-test"])
        plt.show()

    return pd.DataFrame(result)