Mallet机器语言工具包-入门测试

83 阅读 0 评论 55 点赞

我是靠谱客的博主活力学姐，最近开发中收集的这篇文章主要介绍Mallet机器语言工具包-入门测试，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

Mallet主要用于文本分类，所以它设计思路都是偏向文本分类的。

由于需要用到里面的最大熵以及贝叶斯算法所以得研究一下

主页：http://mallet.cs.umass.edu/index.php

参考文章：http://mallet.cs.umass.edu/classifier-devel.php

http://mallet.cs.umass.edu/import-devel.php

网上找了下，材料不多，只能自己苦逼地去看官方提供的一些guide还有API，然后就研究源代码了

我的目的是，把MALLET导入到自己的java项目中（用的是eclipse),然后灵活地用里面一些算法，bayes，和最大熵算法进行文本分类。

导入到工程部分：

下载链接:http://mallet.cs.umass.edu/download.php 我这个时候的最新版本是2.0.7

p_w_picpath 这是压缩包里面的内容，把src文件夹以及lib里面的jar包都拷贝到工程项目里面，把jar包都加载到工程上

p_w_picpath

最终我的工程目录是这样的，src放我自己的一些类

malletSrc放mallet的源码

mallet文件夹里面放的都是对应的jar包

下面是我的研究笔记:

具体各个类的用法只能通过API和源码以及自己的测试去分析了。

下面提供一些测试例子

为了生成一个Instance得搞定下图这几个东西啊..REF:http://mallet.cs.umass.edu/import-devel.php

好像子类还好多，我只研究到我够用的几个东西就O了。

p_w_picpath

源代码里面的注释:

An instance contains four generic fields of predefined name:
     "data", "target", "name", and "source".   "Data" holds the data represented
    `by the instance, "target" is often a label associated with the instance,
     "name" is a short identifying name for the instance (such as a filename),
     and "source" is human-readable sourceinformation, (such as the original text).

关于Data:

需要Alphabet以及FeatureVetor，配合使用，Alphabet用来保存各个属性的名字，FeatureVector用来保存一个对象在各个属性下的值

测试代码1:

public static void main(String[] args) {
String[] attributeStr = new String[]{"长","宽","高"};
Alphabet dict = new Alphabet(attributeStr);
double[] values = new double[]{1,2,3};
FeatureVector vetor = new FeatureVector(dict, values);
System.out.println(vetor.toString());
}

输出:

长(0)=1.0
宽(1)=2.0
高(2)=3.0

我们可以指定values对应与哪个属性值，从0开始，比如长对应0，宽对应1，高对应2，测试如下

public static void main(String[] args) {
String[] attributeStr = new String[]{"长","宽","高"};
Alphabet dict = new Alphabet(attributeStr);
double[] values = new double[]{1,2,3};
int[] indices = new int[]{2,0,1};
FeatureVector vetor = new FeatureVector(dict, indices,values);
System.out.println(vetor.toString());
}

输出:

长(0)=2.0
宽(1)=3.0
高(2)=1.0

一个比较地方需要注意的是如果指明的values的对应索引有重复，比如，2和3都指明它属于长，那么得到的值是累计的而不是覆盖的，值为5，这个就单词统计的效果吧

String[] attributeStr = new String[]{"长","宽","高"};
Alphabet dict = new Alphabet(attributeStr);
double[] values = new double[]{1,2,3};
int[] indices = new int[]{2,0,0};
FeatureVector vetor = new FeatureVector(dict, indices,values);
System.out.println(vetor.toString());

输出:

长(0)=5.0
高(2)=1.0

好吧先把Data搞定了。FeatureVector就是我需要的data

Source：我就让它为NULL了

Label:

/** You should never call this directly. New Label objects are
created on-demand by calling LabelAlphabet.lookupIndex(obj). */

上面是源代码的一句话，Label需要通过LabelAlphabet来创建，所以再研究下LabelAlphabet，然后做以下测试

public static void main(String[] args) {
LabelAlphabet labels = new LabelAlphabet();
Label label = labels.lookupLabel("桌子");
System.out.println(label.toString());
}

输出为:桌子，这样一来Label也搞定了

Name:作为一个instance的id号，那么就简单的用整型作为它的序号好了。

好了，这四个东西都搞定了，就可以创建Instance了,然后把Instance都加入到InstanceList里面去之后就可以参考http://mallet.cs.umass.edu/classifier-devel.php

进行分类了，分类测试代码如下:

import cc.mallet.*;
import cc.mallet.classify.Classifier;
import cc.mallet.classify.ClassifierTrainer;
import cc.mallet.classify.MaxEntTrainer;
import cc.mallet.types.Alphabet;
import cc.mallet.types.FeatureVector;
import cc.mallet.types.Instance;
import cc.mallet.types.InstanceList;
import cc.mallet.types.Label;
import cc.mallet.types.LabelAlphabet;
import cc.mallet.types.Labeling;
public class test {
String label;//实例的类别
double length;//长度
double width;//宽度
double high;
public test(String label,double length,double width,double high){
this.label = label;
this.length = length;
this.width = width;
this.high = high;
}
public static void main(String[] args) {
LabelAlphabet labels = new LabelAlphabet();
String[] attributeName = new String[]{"长","宽","高"};
Alphabet dic = new Alphabet(attributeName);
labels.lookupIndex("桌子");
labels.lookupIndex("椅子");
InstanceList list = new InstanceList(dic,labels);
int id = 0;
for(int i = 0; i < 100; ++i){
test temp = new test("桌子",4,2,3);
test temp2 = new test("椅子",0,0,0);
double[] tempArray = new double[3];
tempArray[0] = temp.length;
tempArray[1] = temp.width;
tempArray[2] = temp.high;
FeatureVector vec = new FeatureVector(dic, tempArray);
Instance ins = new Instance(vec, labels.lookupLabel(temp.label), ++id, null);
list.add(ins);
tempArray[0] = temp2.length;
tempArray[1] = temp2.width;
tempArray[2] = temp2.high;
vec = new FeatureVector(dic, tempArray);
ins = new Instance(vec, labels.lookupLabel(temp2.label), ++id, null);
list.add(ins);
}
//创造一个测试样本
test testTemp = new test("未知",0,0,2);
double[] tempArray = new double[3];
tempArray[0] = testTemp.length;
tempArray[1] = testTemp.width;
tempArray[2] = testTemp.high;
FeatureVector vec = new FeatureVector(dic, tempArray);
Instance testIns = new Instance(vec,null, ++id, null);
//进行最大熵分类
ClassifierTrainer trainer = new MaxEntTrainer();
Classifier classifier = trainer.train(list);
Labeling label = classifier.classify(testIns).getLabeling();
System.out.println(label.getBestLabel().toString());
}
}

输出结果：