决策树（Decision Tree）

122 阅读 0 评论 81 点赞

我是靠谱客的博主忧心鸵鸟，最近开发中收集的这篇文章主要介绍决策树（Decision Tree），觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

决策树（Decision Tree）算法应用案例：

1. 什么是决策树/判定树？

决策树是一个类似于流程图的树状结构。其中，每个内部结点表示在一个属性上的测试，每个分支代表一个属性输出，而每一个树叶结点代表类或类分布。树的最顶层是根结点。

2. 熵（Entropy）的概念：

抽象的“信息”该如何来度量？
1948年，香农提出了“信息熵”的概念。
一条信息的信息量大小和它的不确定性有着直接的关系。要弄清楚一件非常非常不确定的事情，或者是我们一无所知的事情，需要了解大量信息 ==> 信息量的度量就相当于不确定性的多少。

用比特（bit）来衡量信息的多少：
$sum_{x}^{ }P(x)log_{2}[P(x)]$
变量的不确定性越大，相应的熵也就越大。

3. 决策树归纳算法（ID3）

1970-1980年，由 J.Ross.Quinlan 提出 ID3 算法。

选择属性判断结点

信息获取量（Information Gain）:

$Gain(A) = Info(D) - Info_{A}(D)$
表示通过 A 来作为结点分类获取了多少信息。

数据集如下：

表 1 训练样本集 $$| RID | age | income | student | credit_rating | Class_buys_computer | |-----|-------------|--------|---------|---------------|---------------------| | 1 | youth | high | no | fair | no | | 2 | youth | high | no | excellent | no | | 3 | middle_aged | high | no | fair | yes | | 4 | senior | medium | no | fair | yes | | 5 | senior | low | yes | fair | yes | | 6 | senior | low | yes | excellent | no | | 7 | middle_aged | low | yes | excellent | yes | | 8 | youth | medium | no | fair | no | | 9 | youth | low | yes | fair | yes | | 10 | senior | medium | yes | fair | yes | | 11 | youth | medium | yes | excellent | yes | | 12 | middle_aged | medium | no | excellent | yes | | 13 | middle_aged | high | yes | fair | yes | | 14 | senior | medium | no | excellent | no |$$

数据集获取：点击

计算公式：

$-frac{9}{14}log_{2}(frac{9}{14}) - frac{5}{14}log_{2}(frac{5}{14})$

$Info_{age}(D) = -frac{5}{14}(log_{2}(frac{2}{5}) - frac{3}{5}log_{2}(frac{3}{5})) + frac{4}{14}(- frac{4}{4}log_{2}frac{4}{4} - frac{0}{4}log_{2}frac{0}{4}) + frac{5}{14}(-frac{3}{5}log_{2}(frac{3}{5}) - frac{2}{5}log_{2}(frac{2}{5})) = 0.694 bits$

$Gain(age)=Info(D) - Info_{age}(D) = 0.940-0.694=0.246 bits$

类似地， $G a i n (i n c o m e) = 0.029$ ; $G a i n (s t u d e n t) = 0.151$ ; $Gain(credit_rating)=0.048$ .

其中， $G a i n (a g e) = 0.246$ 最大，所以选择 age 作为第一个根结点。

以 age 为根结点

图 2 以 age 为根结点重复以上过程。。。。

算法流程：

树以代表训练样本的单个结点开始。
如果样本都在同一个类中，则该结点成为树叶结点，并用该类标号。
否则，算法使用称为“信息增益”的基于熵的度量作为启发信息，选择能够最好地将样本分类的属性。该属性成为该结点的“测试”或“判定”属性。
在该版本的算法中，所有的属性都是分类的，即离散值。连续值属性必须进过离散化。
对测试属性的每个已知的值，创建一个分支，并据此划分样本。
算法使用同样的过程，递归地形成每个划分上的样本判定树。一旦一个属性出现在一个结点上，就不必在该结点的任何后代上考虑它。
递归划分步骤仅当下列条件之一成立时停止：
（a）给定结点的所有样本属于同一类。
（b）没有剩余属性可以用来进一步划分样本。在此情况下，使用多数表决。这涉及将给定的结点转换成树叶结点，并用样本中的多数所在的类来标记它。替换地，可以存放结点样本的类分布。
（c）分枝： $test_attribute = a_{i}$ 没有样本。在这种情况下，以Samples 中的多数类创建一个树叶。

具体的代码如下：

本代码是在 Anaconda Jupyter 中运行的版本，Anaconda 默认已经安装了 numpy 、SciPy 以及 matplotlib 等常用的库，可以大大减少运行环境搭建的步骤。

from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import preprocessing
from sklearn import tree
from sklearn.externals.six import StringIO

allElectronicsData = open(r'D:harrytsz-MachineLearningSpaceAllElectronics.csv', 'rt')
reader = csv.reader(allElectronicsData)
headers = next(reader)

print(headers)

[‘RID’, ‘age’, ‘income’, ‘student’, ‘credit_rating’, ‘Class_buys_computer’]

featureList = []
labelList = []

for row in reader:
    labelList.append(row[-1])
    rowDict = {}
    for i in range(1, len(row)-1):
        rowDict[headers[i]] = row[i]
    featureList.append(rowDict)
print(featureList)

[{‘age’: ‘youth’, ‘income’: ‘high’, ‘student’: ‘no’, ‘credit_rating’: ‘fair’}, {‘age’: ‘youth’, ‘income’: ‘high’, ‘student’: ‘no’, ‘credit_rating’: ‘excellent’}, {‘age’: ‘middle_aged’, ‘income’: ‘high’, ‘student’: ‘no’, ‘credit_rating’: ‘fair’}, {‘age’: ‘senior’, ‘income’: ‘medium’, ‘student’: ‘no’, ‘credit_rating’: ‘fair’}, {‘age’: ‘senior’, ‘income’: ‘low’, ‘student’: ‘yes’, ‘credit_rating’: ‘fair’}, {‘age’: ‘senior’, ‘income’: ‘low’, ‘student’: ‘yes’, ‘credit_rating’: ‘excellent’}, {‘age’: ‘middle_aged’, ‘income’: ‘low’, ‘student’: ‘yes’, ‘credit_rating’: ‘excellent’}, {‘age’: ‘youth’, ‘income’: ‘medium’, ‘student’: ‘no’, ‘credit_rating’: ‘fair’}, {‘age’: ‘youth’, ‘income’: ‘low’, ‘student’: ‘yes’, ‘credit_rating’: ‘fair’}, {‘age’: ‘senior’, ‘income’: ‘medium’, ‘student’: ‘yes’, ‘credit_rating’: ‘fair’}, {‘age’: ‘youth’, ‘income’: ‘medium’, ‘student’: ‘yes’, ‘credit_rating’: ‘excellent’}, {‘age’: ‘middle_aged’, ‘income’: ‘medium’, ‘student’: ‘no’, ‘credit_rating’: ‘excellent’}, {‘age’: ‘middle_aged’, ‘income’: ‘high’, ‘student’: ‘yes’, ‘credit_rating’: ‘fair’}, {‘age’: ‘senior’, ‘income’: ‘medium’, ‘student’: ‘no’, ‘credit_rating’: ‘excellent’}]

vec = DictVectorizer()
dummyX = vec.fit_transform(featureList).toarray()

print("dummyX: n" + str(dummyX))

print(vec.get_feature_names())

print("labelList: n" + str(labelList))

dummyX:
[[0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]
[0. 0. 1. 1. 0. 1. 0. 0. 1. 0.]
[1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
[0. 1. 0. 0. 1. 0. 0. 1. 1. 0.]
[0. 1. 0. 0. 1. 0. 1. 0. 0. 1.]
[0. 1. 0. 1. 0. 0. 1. 0. 0. 1.]
[1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
[0. 0. 1. 0. 1. 0. 0. 1. 1. 0.]
[0. 0. 1. 0. 1. 0. 1. 0. 0. 1.]
[0. 1. 0. 0. 1. 0. 0. 1. 0. 1.]
[0. 0. 1. 1. 0. 0. 0. 1. 0. 1.]
[1. 0. 0. 1. 0. 0. 0. 1. 1. 0.]
[1. 0. 0. 0. 1. 1. 0. 0. 0. 1.]
[0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
[‘age=middle_aged’, ‘age=senior’, ‘age=youth’, ‘credit_rating=excellent’, ‘credit_rating=fair’, ‘income=high’, ‘income=low’, ‘income=medium’, ‘student=no’, ‘student=yes’]
labelList:
[‘no’, ‘no’, ‘yes’, ‘yes’, ‘yes’, ‘no’, ‘yes’, ‘no’, ‘yes’, ‘yes’, ‘yes’, ‘yes’, ‘yes’, ‘no’]

lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print("dummyY: n" + str(dummyY))

clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX, dummyY)
print("clf: n" + str(clf))

dummyY:
[[0]
[0]
[1]
[1]
[1]
[0]
[1]
[0]
[1]
[1]
[1]
[1]
[1]
[0]]
clf:
DecisionTreeClassifier(class_weight=None, criterion=‘entropy’, max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter=‘best’)

#Visulize model

with open("allElectronicInformationGainOri.dot", 'w') as f:
    f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file = f)

以上代码段调用 tree.export_graphviz() 函数，将训练好的模型保存为 .dot 格式的文件。可以通过使用 Graphviz 软件来将 dot 文件转化成 pdf 等可视化的决策树。

Windows 系统用户可以在成功安装 Graphviz 软件之后，将 Graphviz 软件的安装路径添加至 Windows系统的环境变量 path 中。然后，可以在命令行（cmd）中运行以下命令，来显示训练好的决策树模型，打开 DecisionTreeModel.pdf 文件即可。

dot -T pdf D:MachineLearningSpaceallElectronicInformationGainOri.dot -o DecisionTreeModel.pdf

DecisionTreeResult

图 3 决策树模型 DecisionTreeResult 的可视化

训练好决策树模型之后，接下来，针对新输入的数据进行预测：

# predict
oneRowX = dummyX[0, :]
print("oneRowX: " + str(oneRowX))

newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print("newRowX: " + str(newRowX))

predictedY = clf.predict(newRowX.reshape(1,-1))
print("predictedY: " + str(predictedY))

oneRowX: [0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]
newRowX: [1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
predictedY: [1]

从输出的结果 predictedY:[1] 可以看出针对新输入数据 newRowX 决策树模型给出的预测结果为 1，即 “会购买”。

在模型训练的过程中遇到的几个问题以及解决方案如下：

ValueError: Expected 2D array, got 1D array instead:
https://blog.csdn.net/dongyanwen6036/article/details/78864585

AttributeError: ‘_csv.reader’ object has no attribute ‘next’
https://www.cnblogs.com/hfdkd/p/7719134.html

https://blog.csdn.net/zhyh1435589631/article/details/53869803/

最后

以上就是忧心鸵鸟为你收集整理的决策树（Decision Tree）的全部内容，希望文章能够帮你解决决策树（Decision Tree）所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：Python
浏览次数：122 次浏览
发布日期：2023-05-31 02:53:01
本文链接：https://www.kaopuke.com/article/k-p-k_14_ujogfz_13_jcy.html

决策树（Decision Tree）

概述

决策树（Decision Tree）算法应用案例：

1. 什么是决策树/判定树？

2. 熵（Entropy）的概念：

3. 决策树归纳算法（ID3）

具体的代码如下：

最后

评论列表共有 0 条评论

发表评论取消回复

决策树（Decision Tree）

概述

决策树（Decision Tree）算法应用案例：

1. 什么是决策树/判定树？

2. 熵（Entropy）的概念：

3. 决策树归纳算法（ID3）

具体的代码如下：

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复