概述
Test Data for Trading—Sentiment Analysis系列文章是对《Machine Learning for Algorithmic Trading》第十四章内容的讲解以及相关代码复现。因为中英文的文本分析存在较大差异,顾此系列没有选取中国市场的材料做为代码复现的数据,而是选择书后源代码进行复现。
代码复现(二)主要内容是对Textblob的各种应用的代码实现以及解释。
Textblob是一个Python库,为常见的NLP任务提供简单的API(应用程序编程接口),它是建立在自然语言工具包(nltk)和Pattern网络挖掘库的基础上的。TextBlob使语音部分标记、名词短语提取、情感分析、分类、翻译变得更加方便。
1、配置环境
import warnings
warnings.filterwarnings('ignore')#引入warnings库
#配置情绪分析所需环境
pip install textblob#引入textblob包
pip install nltk#引入nltk库
%matplotlib inline
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
#导入sklearn用于特征提取和建模
sns.set_style('white')
np.random.seed(42)#可视化设置
为了解释TextBlob是如何使用的,我们对文章进行采样,与spaCy和其他库类似,第一步是将文件通过TextBlob管道,分配各种任务所需的注释。
path = Path('..', 'data', 'bbc')#导入所需文章
files = sorted(list(path.glob('**/*.txt')))
doc_list = []
for i, file in enumerate(files):
topic = file.parts[-2]
article = file.read_text(encoding='latin1').split('n')
2、选择随机文章
article = docs.sample(1).squeeze()#抽取文章
print(f'Topic:t{article.topic.capitalize()}nn{article.heading}n')
print(article.body.strip())
抽取到的文章是: Topic: Business UK house prices dip in November UK house
prices dipped slightly in November, the Office of the Deputy Prime
Minister (ODPM) has said. The average house price fell marginally to
£180,226, from £180,444 in October. Recent evidence has suggested
that the UK housing market is slowing after interest rate increases,
and economists forecast a drop in prices during 2005. (文章节选)
文章大致讲述了英国十一月份房价下降以及经济学家做出的预测等等。
3、分词和分句处理
parsed_body = TextBlob(article.body)
parsed_body.words#抽取文章主干部分
运行该段代码可以得到一个wordlist
WordList(['UK', 'house', 'prices', 'dipped', 'slightly', 'in', 'November', 'the', 'Office', 'of', 'the', 'Deputy', 'Prime', 'Minister', 'ODPM', 'has', 'said', 'The', 'average', 'house', 'price', 'fell', 'marginally', 'to', '£180,226', 'from'
parsed_body.sentences#文章句子边界检测,检测文章主体句
Sentence("UK house prices dipped slightly in November, the Office of the Deputy Prime Minister (ODPM) has said."),
Sentence("The average house price fell marginally to £180,226, from £180,444 in October."),
Sentence("Recent evidence has suggested that the UK housing market is slowing after interest rate increases, and economists forecast a drop in prices during 2005."),
Sentence("But while the monthly figures may hint at a cooling of the market, annual house price inflation is still strong, up 13.8% in the year to November."),
Sentence("Economists, however, forecast that ODPM figures are likely to show a weakening in annual house price growth in coming months."),
Sentence(""Overall, the housing market activity is slowing down and that is backed up by the mortgage lending and the mortgage approvals data," said Mark Miller, at HBOS Treasury Services.")#同样的得到了一个sentence list
4、词元化处理
stemmer = SnowballStemmer('english')#初始化词元
[(word, stemmer.stem(word)) for i, word in enumerate(parsed_body.words)
if word.lower() != stemmer.stem(parsed_body.words[i])]#对每一个单词进行词元化
[('prices', 'price'),
('dipped', 'dip'),
('slightly', 'slight'),
('November', 'novemb'),
('Office', 'offic'),
('Deputy', 'deputi'),
('Minister', 'minist'),
('average', 'averag'),
('house', 'hous'),
('marginally', 'margin'),
('October', 'octob'),
('evidence', 'evid'),
('suggested', 'suggest'),
('housing', 'hous'),
('slowing', 'slow'),
('increases', 'increas'),
('economists', 'economist'),
('prices', 'price')]
进行词元化处理后,得到了一张左右对照的单词表。左边是原始的单词,右侧是经过词元化处理后的单词;可以看出一些词的词元化出现了错误。比如说:increases的词元应该是increase而不是increas。
5、情感分析
Textblob进行情感分析的方式是返回一个元组 Sentiment(polarity, subjectivity). 在元组中polarity和subjective的值分别代表情绪是积极还是消极,是主观还是客观。polarity的取值范围是-1到1,-1.0代表最消极,1.0代表最积极;subjectivity的取值范围是0到1,0.0 表示最客观,1.0表示最主观。
parsed_body.sentiment#sentiment这一指标表达的是token情绪得分的平均值
Sentiment(polarity=0.10447845804988663, subjectivity=0.44258786848072557)
该文档token polarity的平均值是0.1,比较接近情绪中性:即不积极也不消极;subjectivity的平均值是0.44,也是比较中性的数值。
parsed_body.sentiment_assessments#sentiment_assessment代表的是每个划线token各自的情绪得分
Sentiment(polarity=0.10447845804988663, subjectivity=0.44258786848072557, assessments=[(['slightly'], -0.16666666666666666, 0.16666666666666666, None), (['average'], -0.15, 0.39999999999999997, None), (['recent'], 0.0, 0.25, None), (['strong'], 0.4333333333333333, 0.7333333333333333, None), (['likely'], 0.0, 1.0, None), (['overall'], 0.0, 0.0, None), (['down'], -0.15555555555555559, 0.2888888888888889, None), (['fairly'], 0.7, 0.9, None), (['nearly'], 0.1, 0.4, None), (['last'], 0.0, 0.06666666666666667, None), (['first'], 0.25, 0.3333333333333333, None), (['rose'], 0.6, 0.95, None), (['whole'], 0.2, 0.4, None), (['only'], 0.0, 1.0, None), (['second'], 0.0, 0.0, None), (['half'], -0.16666666666666666, 0.16666666666666666, None), (['overall'], 0.0, 0.0, None), (['large'], 0.21428571428571427, 0.42857142857142855, None), (['recent'], 0.0, 0.25, None), (['rose'], 0.6, 0.95, None), (['same'], 0.0, 0.125, None), (['average'], -0.15, 0.39999999999999997, None), (['more'], 0.5, 0.5, None), (['average'], -0.15, 0.39999999999999997, None), (['rose'], 0.6, 0.95, None), (['only'], 0.0, 1.0, None), (['slightly'], -0.16666666666666666, 0.16666666666666666, None), (['previous'], -0.16666666666666666, 0.16666666666666666, None)])
更多 机器学习、投资管理 相关的内容,均收录在微信公众号 HI投量化俱乐部
欢迎扫码关注
最后
以上就是乐观舞蹈为你收集整理的Text Data for Trading—Sentiment Analysis 代码复现(二)的全部内容,希望文章能够帮你解决Text Data for Trading—Sentiment Analysis 代码复现(二)所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复