机器学习Python学习——决策树

230 阅读 0 评论 152 点赞

我是靠谱客的博主妩媚水池，这篇文章主要介绍机器学习Python学习——决策树，现在分享给大家，希望可以做个参考。

1.1 决策树的概念

1.2 信息增益

1.3 划分数据集

1.3.1 使用信息增益率划分数据集

1.3.2 使用基尼指数划分数据集

2.1 递归构建决策树

3.1 在Python中使用Matplotlib注解绘制树形图

3.2 Matplotlib注解

4.1 示例：使用决策树预测井字棋选择的好坏

5.1 总结

1.1 决策树的概念

决策树(Decision Tree）是在已知各种情况发生概率的基础上，通过构成决策树来求取净现值的期望值大于等于零的概率，评价项目风险，判断其可行性的决策分析方法，是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干，故称决策树。在机器学习中，决策树是一个预测模型，他代表的是对象属性与对象值之间的一种映射关系。同时，分类树（决策树）是一种十分常用的分类方法。它是一种监督学习，所谓监督学习就是给定一堆样本，每个样本都有一组属性和一个类别，这些类别是事先确定的，那么通过学习得到一个分类器，这个分类器能够对新出现的对象给出正确的分类。

1.2 信息增益

划分数据集的大原则是，将无序的数据变得更加有序。我们可以使用多种方法划分数据集，但是每种方法都各自的优缺点。在划分数据集之前之后信息发生的变化成为信息增益，知道如何计算信息增益，我们就可以计算每个特征值划分数据集获得的信息增益，获得信息增益最高的特征就是最好的选择。

下面我们将使用Python计算信息熵，创建名为trees.py的文件，将下列代码录入到trees.py文件中，此代码的功能是计算给定数据集的熵。

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
            labelCounts[currentLabel] += 1
        ShannonEnt = 0.0
        for key in labelCounts:
            prob = float(labelCounts[key]) / numEntries
            shannonEnt -= prob * log(prob, 2)
        return ShannonEnt

在trees.py文件中，我们可以利用createDataSet()函数得到鉴定数据集，你可以输入自己的createDataSet()函数：

def createDataSet():
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    labels = ['no surfacing','flippers']
    #change to discrete values
    return dataSet, labels

在Python命令提示符下输入下列命令：

>>> import trees
>>> myDat,labels=trees.createDataSet()
>>> myDat
>>> trees.calcShannonEnt(myDat)

可以得到这样的结果：

1.3 划分数据集

分类算法除了需要测量信息熵，还需要划分数据集，度量花费数据集的熵，以便判断当前是否正确地划分了数据集。我们将对每个特征划分数据集的结果计算一次信息熵，然后判断按照哪个特征划分数据集是最好的划分方式。

要划分数据集，打开文本编辑器，在trees.py中加入下列代码：

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

这个代码使用了三个输入参数：带划分的数据集、划分数据集的特征、特征的返回值。

我们假定两个列表，a和b：

如果执行a.append(b)，则列表得到了第四个元素，而且第四个元素也是一个列表。然后如果使用extend方法：

一个包含了a、b所有元素的列表。

我们可以在前面的简单样本数据上测试函数splitDataSet()。在Python命令提示符下输入下述命令：

importlib.reload(trees)
myDat,labels=trees.createDataSet()
myDat
trees.splitDataSet(myDat,0,1)
trees.splitDataSet(myDat,0,0)

我们可以得到下面的结果：

接下来我们将遍历整个数据集，循环计算香农熵和splitDataSet()函数，找到最好的特征划分方式。熵计算会告诉我们如何划分数据集是最好的数据组织方式。在trees.py中加入下面代码：

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

该函数中调用的数据需要满足一定的要求：第一个要求是，数据必须是一种由列表元素组成的列表，而且所有的列表元素都要具有相同的数据长度；第二个要求是，数据的最后一列或者每个实例的最后一个元素是当前实例的看类别标签。数据集一旦满足上述要求，我们就可以在函数的第一行判定当前数据集包含多少特征属性。我们无需限定list中的数据类型，它们既可以是数字也可以是字符串，并不影响实际计算。

现在我们可以测试上面代码的实际输出结果，在Python命令提示符下输入下列命令：

importlib.reload(trees)
myDat,label=trees.createDataSet()
trees.chooseBestFeatureToSplit(myDat)
myDat

可以得到下面的结果：

这个结果告诉我们，第1个特征是最好的用于划分数据集的特征。

类似的，下面会再给出两种划分数据集的方式，分别是使用信息增益率来划分数据集和使用基尼指数划分数据集。

1.3.1 使用信息增益率划分数据集

信息增益率的公式为：

其中

称为属性a的“固有值“，属性a的可能取值数目越多（即V越大），则IV(a)的值通常就越大。

我们使用下列代码来实现信息增益率划分数据集。

def chooseBestFeatureToSplit2(dataSet):
    numFeatures = len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGainration = 0.0
    bestFeature = -1
    sum=0.0
    m=0
    for j in range(numFeatures):
        m=m+1
        featList = [example[j] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        IV=0.0
        #0/1分类，遍历
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, j, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
            IV-=prob * log(prob,2)
        infoGain = baseEntropy - newEntropy
        sum+=infoGain
    avg=sum/m
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        IV=0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
            IV-=prob * log(prob,2)
        infoGain = baseEntropy - newEntropy
        if IV==0.0:
            infoGainration=0.0
        else:
            infoGainration=float(infoGain)/float(IV)
        if (infoGainration > bestInfoGainration and infoGainration>avg):
            bestInfoGainration = infoGainration         
            bestFeature = i
    return bestFeature

1.3.2 使用基尼指数划分数据集

分类问题中，假设D有K个类，样本点属于第k类的概念为pk，则概率分布的基尼值定义为：

Gini(D)越小，数据集D的纯度越高；

给定数据集D，属性a的基尼指数定义为：

在候选属性集合A中，选择那个使得划分后基尼指数最小的属性为最优划分属性。

我们使用下列代码来实现基尼指数划分数据集。

def calcGini(dataset):
    feature = [example[-1] for example in dataset]
    uniqueFeat = set(feature)
    sumProb =0.0
    for feat in uniqueFeat:
        prob = feature.count(feat)/len(uniqueFeat)
        sumProb += prob*prob
    sumProb = 1-sumProb
    return sumProb

2.1 递归构建决策树

递归构建决策树的工作原理如下：得到原始数据集，然后基于最好的属性值划分数据集，由于特征值尽可能多于两个，因此可能存在大于两个分支的数据集划分。第一次划分之后，数据将被向下传递到树分支的下一个节点，在这个节点上，我们可以再次划分数据。、

递归结束的条件是：程序遍历所有划分数据集的属性，或者每个分支下的所有实例都具有相同的分类。如果所有实例具有相同的分类，则得到一个叶子节点或者终止块。任何到达叶子节点的数据必然属于叶子节点的分类。

在增加构建函数之前，需要在trees.py文件顶部增加一行代码：import operator，然后添加下面的代码到trees.py文件中：

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

该函数使用分类名称的列表，然后创建键值为classList中唯一值的数据字典，字典对象存储了classList中每个类标签出现的频率，最后利用operator操作键值排序字典，并返回出现次数最多的分类名称。

现在我们在trees.py中添加创建树的函数代码：

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList): 
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

上述代码首先创建了名为classList的列表变量，其中包含了数据集的所有类标签。递归函数的第一个停止条件是所有的类标签完全相同，则直接返回类标签。递归函数的第二个停止条件是用完了所有特征，仍然不能将数据集划分成仅包含唯一类别的分组。由于第二个条件无法简单地返回唯一的类标签，这里使用chooseBestFeatureToSplit(dataSet)函数挑选出现次数最多的类别作为返回值。

下一步开始创建树，这里使用Python语言的字典类型存储树的信息，当然也可以声明函特殊的数据类型存储树，但是这里完全没有必要。字典变量myTree存储了树的所有信息，这对于其后绘制树形图非常重要。当前数据集选取的最好特征存储在变量bestFeat中，得到列表包含的所有属性值。

最后代码遍历当前选择特征包含的所有属性值，在每个数据集划分上递归调用了函数createTree()，得到的返回值将被插入到字典变量myTree中，因此函数终止执行时，字典中将会嵌套很多代表叶子节点信息的字典数据。在解释这个嵌套函数之前，我们先看一下循环的第一行subLabels = labels[:]，这行代码复制了类标签，并将其存储在新列表变量subLabels中。之所以这样做，是因为在Python语言中函数参数是列表类型时，参数是按照引用方式传递的。为了保证每次调用createTree()时不改变原始列表的内容，使用新变量subLabels代替原始列表。

现在我们可以测试上面代码的实际输出结果，在Python命令提示符下输入下列命令：

importlib.reload(trees)
myDat,labels=trees.createDataSet()
myTree=trees.createTree(myDat,labels)
myTree

可以得到下面的结果：

变量myTree包含了很多代表树结构信息的嵌套字典，从左边开始，第一个关键字flipper是第一个划分数据集的特征名称，该关键字的值也是另一个数据字典。第二个关键字是no surfacing特征划分的数据集，这些关键字的值是no surfacing节点的子节点。这些值可能是类标签，也可能是可另一个数据字典。如果值是类标签，则该子节点是叶子节点；如果值是另一个数据字典，则子节点是一个判断节点，这种格式结构不断重复就构成了整棵树。

3.1 在Python中使用Matplotlib注解绘制树形图

我们可以使用Matplotlib库创建树形图。决策树的主要优点就是直观易于理解。我们将使用代码绘制如下图的决策树。

3.2 Matplotlib注解

我们创建名为treePlotter.py的新文件，然后输入下面的程序代码：

'''
Created on Oct 14, 2010

@author: Peter Harrington
'''
import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")

def getNumLeafs(myTree):
    numLeafs = 0
    firstSides = list(myTree.keys())  
    firstStr = firstSides[0]#找到输入的第一个元素  
    #firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

def getTreeDepth(myTree):
    maxDepth = 0
    firstSides = list(myTree.keys())  
    firstStr = firstSides[0]#找到输入的第一个元素  
    #firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
             xytext=centerPt, textcoords='axes fraction',
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
    
def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstSides = list(myTree.keys())  
    firstStr = firstSides[0]#找到输入的第一个元素  
    #firstStr = myTree.keys()[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodes   
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it's a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it's a tree, and the first element will be another dict

def createPlot(inTree):
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), '')
    plt.show()

#def createPlot():
#    fig = plt.figure(1, facecolor='white')
#    fig.clf()
#    createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses 
#    plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)
#    plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)
#    plt.show()

def retrieveTree(i):
    listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
                  {'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
                  ]
    return listOfTrees[i]

#createPlot(thisTree)

之后在Python命令提示符下输入下列命令：

importlib.reload(treePlotter)
myTree=treePlotter.retrieveTree(0)
treePlotter.createPlot(myTree)

然后可以得到这样的结果：

这样的结果没有坐标轴标签，我们按照如下命令变更字典，重新绘制树形图：

myTree['no surfacing'][3]='maybe'
myTree
treePlotter.createPlot(myTree)

4.1 示例：使用决策树预测井字棋选择的好坏

首先我们先创建一个main.py用于读取文件并生成决策树。

from pickle import load
import numpy as np
import matplotlib.pyplot as plt
import math
from treePlotter import *
from trees import *

if __name__ == '__main__':
    fr=open('C:/Users/Kano/Desktop/Study/vscode python/Test2/tic-tac-toe.txt')
    lenses=[inst.strip().split(',') for inst in fr.readlines()]
    lensesLabels=['top-left-square: {x,o,b}',
    'top-middle-square: {x,o,b}',
    'top-right-square: {x,o,b}',
    'middle-left-square: {x,o,b}',
    'middle-middle-square: {x,o,b}', 
    'middle-right-square: {x,o,b}', 
    'bottom-left-square: {x,o,b}', 
    'bottom-middle-square: {x,o,b}', 
    'bottom-right-square: {x,o,b}']
    lensesTree=createTree(lenses,lensesLabels)
    createPlot(lensesTree)

我们将使用井字棋的数据集来预测结果好坏（positive or negative）。部分数据集如下图：

对应的属性分别为：

'top-left-square: {x,o,b}',

'top-middle-square: {x,o,b}',

'top-right-square: {x,o,b}',

'middle-left-square: {x,o,b}',

'middle-middle-square: {x,o,b}',

'middle-right-square: {x,o,b}',

'bottom-left-square: {x,o,b}',

'bottom-middle-square: {x,o,b}',

'bottom-right-square: {x,o,b}'

这些标签为落子的位置。

之后我们分别使用三种划分数据集的方法来构造决策树：信息增益、信息增益率、基尼指数。

图一：信息增益

图二：信息增益率

图三：基尼指数

5.1 总结

通过这次实验，成功实现了根据数学公式，对于信息增益，信息增益率，基尼指数这三种划分方式的实现。并且也可以使用这些方式来生成决策树并可视化。了解了决策树的原理，我认为决策树的优点在于不用调整很多参数或者代码就可以训练数据集。但是决策树的缺点也很明显，当数据集并不适合决策树锻炼时，例如标签很多，或者标签属性为具体详细的数值的时候，决策树的可视化效果就会很差，并且很容易出现过拟合的现象。后续也会学习不同的方法来解决这些问题。