AI day09(2020 8/8)朴素贝叶斯

78 阅读 0 评论 52 点赞

我是靠谱客的博主凶狠时光，最近开发中收集的这篇文章主要介绍AI day09(2020 8/8)朴素贝叶斯，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

朴素贝叶斯

每个训练集中的输入值,在每个类别中默认至少出现一次。

新闻分类

贝叶斯拼写检查器

import re, collections
 
def words(text): return re.findall('[a-z]+', text.lower()) 
 
def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model
 
NWORDS = train(words(open('big.txt').read()))
 
alphabet = 'abcdefghijklmnopqrstuvwxyz'
 
def edits1(word):
    n = len(word)
    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion
               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion
 
def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
 
def known(words): return set(w for w in words if w in NWORDS)
 
def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=lambda w: NWORDS[w])

#appl #appla #learw #tess #morw
correct('knon')

'know'

求解：argmaxc P(c|w) -> argmaxc P(w|c) P© / P(w)

P©, 文章中出现一个正确拼写词 c 的概率, 也就是说, 在英语文章中, c 出现的概率有多大
P(w|c), 在用户想键入 c 的情况下敲成 w 的概率. 因为这个是代表用户会以多大的概率把 c 敲错成 w
argmaxc, 用来枚举所有可能的 c 并且选取概率最大的

# 把语料中的单词全部抽取出来, 转成小写, 并且去除单词中间的特殊符号
def words(text): return re.findall('[a-z]+', text.lower()) 
 
def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model
 
NWORDS = train(words(open('big.txt').read()))

要是遇到我们从来没有过见过的新词怎么办. 假如说一个词拼写完全正确, 但是语料库中没有包含这个词, 从而这个词也永远不会出现在训练集中. 于是, 我们就要返回出现这个词的概率是0. 这个情况不太妙, 因为概率为0这个代表了这个事件绝对不可能发生, 而在我们的概率模型中, 我们期望用一个很小的概率来代表这种情况. lambda: 1

NWORDS

             'ureters': 3,
         'displeasure': 9,
         'omitted': 10,
         'sparrow': 5,
         'tubercle': 66,
         'curse': 7,
         'pauncefote': 2,
         'updated': 5,
         'gloomier': 4,
         'foremost': 17,
         'wabash': 2,
         'anarchists': 4,
         'intermediacy': 2,
         'threadbare': 2,
         'endeavouring': 9,
         'freeholders': 11,
         'irreproachably': 3,
         'ignominious': 3,
         'illuminated': 9,
         'galitsyn': 2,
         'struthers': 3,
         'shuya': 2,
         'futile': 16,
         'each': 412,
         'district': 38,
         'acquiesced': 2,
         'staircase': 14,
         'shamelessly': 2,
         'doubter': 2,
         'plumage': 3,
         'worming': 2,
         'militiamen': 30,
         'tombstones': 2,
         'presupposable': 2,
         'notable': 6,
         'louise': 5,
         'overtook': 17,
         'abstraction': 8,
         'displeased': 20,
         'ranchmen': 2,
         'instal': 2,
         'kashmir': 3,
         'nay': 4,
         'wired': 5,
         'pencil': 11,
         'mustache': 46,
         'breast': 87,
         'dioxide': 9,
         'disappointments': 4,
         'impassive': 6,
         'though': 651,
         'floridas': 7,
         'torban': 2,
         'combine': 11,
         'yawning': 7,
         'homeless': 4,
         'cinema': 2,
         'subjects': 68,
         'rib': 9,
         'bin': 3,
         'cylinders': 18,
         'bijou': 2,
         'acted': 38,
         'accepted': 88,
         'attainment': 11,
         'mustered': 8,
         'audacious': 2,
         'respectable': 15,
         'bilateral': 10,
         'coraco': 2,
         'stuffs': 2,
         'reheat': 2,
         'roberts': 3,
         'trenton': 6,
         'sharpening': 5,
         'component': 6,
         'pat': 4,
         'animation': 32,
         'coincidently': 5,
         'cy': 2,
         'smoker': 2,
         'manes': 3,
         'adelaide': 2,
         'prayer': 43,
         'industries': 65,
         'advantageously': 5,
         'dissolute': 3,
         'tendon': 130,
         'barton': 2,
         'ablest': 2,
         'episode': 12,
         'barges': 3,
         'sipping': 4,
         'inoperative': 2,
         'soap': 8,
         'padlocks': 2,
         'vagaries': 2,
         'potemkins': 3,
         'blackguard': 5,
         'smashed': 11,
         'bursitis': 17,
         'goes': 61,
         'prefix': 3,
         'shops': 23,
         'basketful': 2,
         'stepfather': 22,
         'veil': 17,
         'adorers': 2,
         'overhauled': 6,
         'liquors': 3,
         'bottoms': 3,
         'plastun': 2,
         'surest': 4,
         'carlton': 5,
         'friedland': 6,
         'alice': 14,
         'unhealthy': 15,
         'cannula': 9,
         'eleven': 22,
         'persuasions': 3,
         'cawolla': 2,
         'elephants': 2,
         'mechanicks': 2,
         'kitten': 8,
         'promotes': 2,
         'venae': 2,
         'matt': 2,
         'private': 94,
         'essential': 93,
         'creating': 25,
         'exclaiming': 5,
         'extent': 100,
         'oxidising': 2,
         'dessicans': 3,
         'uplands': 4,
         'tops': 4,
         'jerky': 6,
         'irregularity': 6,
         'recruitment': 3,
         'fringes': 17,
         'shopkeepers': 7,
         'tendencies': 16,
         'unconditionally': 3,
         'brandy': 16,
         'camberwell': 3,
         'statue': 9,
         'metatarsal': 9,
         'measurement': 3,
         'enclosures': 2,
         'suspecting': 4,
         'noses': 7,
         'standard': 55,
         'inspection': 19,
         'enterprising': 6,
         'freak': 4,
         'liberating': 2,
         'ordeal': 3,
         'pancras': 2,
         'luxury': 9,
         'livery': 3,
         'anconeus': 2,
         'polypus': 4,
         'leapt': 3,
         'liberally': 2,
         'finish': 50,
         'previously': 56,
         'mccarthy': 38,
         'mallet': 6,
         'bluestocking': 3,
         'conveyance': 8,
         'transformer': 2,
         'compel': 10,
         'blasphemies': 3,
         'suggest': 25,
         'shares': 4,
         'dishonoured': 4,
         'hen': 7,
         'vols': 28,
         'narcotisation': 2,
         'speranski': 80,
         'cherished': 15,
         'overcoat': 27,
         'malbrook': 2,
         'nephroma': 2,
         'habeus': 2,
         'coward': 9,
         'widower': 5,
         'extremely': 52,
         'resembling': 53,
         'understood': 223,
         'impetus': 10,
         'actinomyces': 10,
         'eosinophile': 4,
         'pronounce': 10,
         'arrangements': 30,
         'inevitably': 33,
         'hochgeboren': 2,
         'crusted': 3,
         'weeks': 118,
         'slightest': 26,
         'fords': 2,
         'stimulatingly': 2,
         'economically': 3,
         'thrice': 9,
         'peg': 5,
         'adventurous': 4,
         'mountainous': 3,
         'potch': 2,
         'adults': 27,
         'kindled': 11,
         'have': 3494,
         'sedate': 3,
         'democrats': 94,
         'vaginitis': 2,
         'foo': 2,
         'headgear': 2,
         'gape': 8,
         'reassigned': 2,
         'incompletely': 2,
         'pharmacopoeial': 2,
         'feelings': 79,
         'phone': 3,
         'anger': 60,
         'improvisations': 2,
         'dethrone': 2,
         'toothed': 2,
         'sweetish': 2,
         'tack': 4,
         'unwinding': 3,
         'pediculosis': 2,
         'overfed': 2,
         'rabble': 8,
         'opsonins': 4,
         'ver': 3,
         'postures': 3,
         'entertainment': 8,
         'unkind': 5,
         'lightest': 3,
         'undergone': 10,
         'persons': 120,
         'mention': 47,
         'iodism': 2,
         'sterne': 2,
         'kolocha': 17,
         'wecollect': 2,
         'asked': 778,
         'augury': 2,
         'uhlans': 29,
         'fist': 9,
         'winner': 3,
         'praise': 17,
         'legislative': 32,
         'greediness': 3,
         'resembled': 9,
         'expense': 34,
         'despotic': 2,
         'storage': 4,
         'league': 54,
         'granular': 8,
         'differed': 11,
         'enlistment': 2,
         'authorizing': 17,
         'remembers': 7,
         'outlays': 4,
         'malignant': 89,
         'allowed': 87,
         'rang': 30,
         'wing': 33,
         'sped': 2,
         'map': 39,
         'might': 537,
         'disrupting': 2,
         'pyrexia': 7,
         'besprinkled': 3,
         'reddaway': 3,
         'intrusions': 2,
         'respectively': 11,
         'ilagin': 26,
         'portray': 2,
         'shenkii': 2,
         'drilled': 5,
         'devilish': 4,
         'abate': 6,
         'trophic': 21,
         'soil': 94,
         'smelters': 3,
         'people': 900,
         'minerals': 9,
         'site': 33,
         'ceremony': 18,
         'rostov': 777,
         'occupation': 53,
         'whale': 4,
         'definitely': 36,
         'steam': 31,
         'turbinate': 2,
         'recollection': 24,
         'sports': 3,
         'apportioned': 12,
         'ripon': 2,
         'representations': 4,
         'calcanean': 3,
         'demonetized': 2,
         'cloaks': 10,
         'river': 113,
         'industrial': 99,
         'foreman': 6,
         'girt': 4,
         'close': 220,
         'warne': 2,
         'blubberers': 2,
         'disruption': 5,
         'diligently': 4,
         'blamelessly': 2,
         'cornwall': 3,
         'edinburgh': 26,
         'denouncing': 5,
         'sasha': 3,
         'zeal': 26,
         'shouted': 255,
         'powerlessness': 2,
         'helpless': 23,
         'cheapest': 3,
         'homely': 9,
         'pyosalpynx': 2,
         'poorhouse': 2,
         'adhesive': 4,
         'lumber': 15,
         'cawing': 3,
         'splashed': 7,
         'interphalangeal': 3,
         'maintenance': 11,
         'pervade': 2,
         'beads': 5,
         'scarcity': 5,
         'anticipate': 5,
         'orthodox': 10,
         'sponge': 8,
         'infer': 4,
         'inverted': 4,
         'bleak': 5,
         'main': 110,
         'doddering': 2,
         'ligation': 53,
         'shiloh': 3,
         'haitian': 2,
         'hindrance': 12,
         'amputate': 5,
         'obstructed': 11,
         'extensively': 9,
         'raised': 213,
         'visualized': 2,
         'showed': 150,
         'awaits': 8,
         'specified': 10,
         'gush': 2,
         'partially': 11,
         'calibres': 2,
         'kettles': 2,
         'easy': 123,
         'confirming': 4,
         'hordes': 4,
         'saved': 52,
         'sneered': 3,
         'peevishly': 2,
         'unexpectedly': 58,
         'handedness': 2,
         'gigantic': 24,
         'bouts': 3,
         'hoosier': 2,
         'plural': 4,
         'disuse': 6,
         'flora': 10,
         'wriggles': 2,
         'reproving': 2,
         'unrepaired': 2,
         'repose': 8,
         'baltic': 2,
         'whistle': 29,
         'barbarous': 3,
         'warp': 3,
         'vozdvizhenka': 8,
         'intubation': 7,
         'dreaminess': 2,
         'creek': 4,
         'suffrage': 116,
         'gay': 36,
         'bagovut': 3,
         'dropping': 19,
         'poultry': 3,
         'sentiments': 15,
         'bivouacs': 4,
         'pobox': 5,
         'fuller': 13,
         'bombard': 2,
         'vanishes': 4,
         'independent': 74,
         'card': 31,
         'sheen': 2,
         'descends': 3,
         'cerebritis': 2,
         'sparrows': 2,
         'stab': 9,
         'coverlet': 2,
         'footmarks': 4,
         'homicidal': 3,
         'frustration': 2,
         'sicca': 4,
         'slowly': 134,
         'stifling': 3,
         'poetry': 11,
         'inadvertent': 2,
         'faints': 2,
         'tudor': 2,
         'hancocks': 2,
         'postulated': 2,
         'delegate': 10,
         'ratify': 8,
         'rag': 7,
         'yankovo': 4,
         'staked': 6,
         'leprosy': 5,
         'cheyenne': 2,
         'verbally': 2,
         're': 190,
         'pedunculated': 14,
         'splendidly': 12,
         'troubled': 18,
         'healed': 17,
         'tuning': 2,
         'farmer': 28,
         'bourbon': 3,
         'differently': 26,
         'descry': 2,
         'filter': 5,
         'sob': 20,
         'newfoundland': 2,
         'flattered': 18,
         'lotions': 9,
         'refinements': 6,
         'overtake': 13,
         'offerings': 3,
         'kent': 6,
         'pimple': 2,
         'carousals': 4,
         'intrigue': 12,
         'salut': 2,
         'pliant': 2,
         'zum': 2,
         'saw': 600,
         'load': 9,
         'engineers': 5,
         'imports': 12,
         'indifference': 27,
         'are': 3631,
         'wringing': 3,
         'kidney': 22,
         'nun': 4,
         'wool': 43,
         'utero': 2,
         'tag': 3,
         'hamlet': 2,
         'attain': 51,
         'ingested': 2,
         'longitudinal': 6,
         'detestable': 4,
         'inspector': 32,
         'hernia': 14,
         'circulates': 2,
         'adieu': 7,
         'annulment': 4,
         'tumour': 224,
         'elect': 13,
         'sidorov': 5,
         'eats': 9,
         'bond': 27,
         'messages': 7,
         'surgeon': 44,
         'dissecting': 6,
         'clarifying': 2,
         'stout': 63,
         'wrung': 18,
         'sincere': 23,
         'reverie': 8,
         'stampede': 3,
         'vindicate': 2,
         'cartload': 2,
         'semiopen': 2,
         'gainful': 2,
         'noting': 11,
         'marsh': 4,
         'interfering': 20,
         'sevres': 3,
         'charmee': 2,
         'cherish': 7,
         'beseech': 4,
         'transylvania': 4,
         'new': 1212,
         'weason': 2,
         'frontiersmen': 6,
         'feverish': 25,
         'establish': 46,
         'grassy': 2,
         'assent': 10,
         'muscularly': 4,
         'rire': 2,
         'mcmaster': 11,
         'pineapples': 3,
         'maritime': 6,
         'debauchery': 6,
         'disliked': 21,
         'relationships': 3,
         'swarmed': 5,
         'surpassed': 4,
         'discernible': 4,
         'thyreoid': 22,
         'paddle': 4,
         'zoology': 4,
         'tenn': 4,
         'wood': 89,
         'persuasiveness': 3,
         'bashful': 3,
         'uterine': 12,
         'convention': 152,
         'unshapely': 2,
         'homes': 36,
         'bicipital': 3,
         'admirable': 15,
         'nightshirt': 2,
         'fibrin': 17,
         'bartenstein': 3,
         'guess': 18,
         'circlet': 2,
         'adventitious': 9,
         'indemnify': 6,
         'typewriting': 5,
         'expunging': 2,
         'quand': 6,
         'exactions': 2,
         'receiving': 55,
         'incur': 3,
         'giants': 3,
         'tearing': 23,
         'probing': 4,
         'devise': 10,
         'hearth': 2,
         'placental': 2,
         'hammering': 6,
         'defeating': 3,
         'womanly': 8,
         'jewellery': 3,
         'kuragins': 4,
         'target': 6,
         'stevedore': 2,
         'safest': 3,
         'sounder': 3,
         'likes': 9,
         'appointments': 12,
         'speech': 83,
         'quaker': 3,
         'antonov': 2,
         'proofs': 15,
         'boasting': 6,
         'wegiment': 3,
         'disciplined': 4,
         'occupancy': 2,
         'flare': 2,
         'copenhagen': 3,
         'zenger': 4,
         'battalions': 27,
         'truth': 116,
         'research': 37,
         'recurvatum': 2,
         'notion': 17,
         'dishonour': 2,
         'glittered': 17,
         'kittenish': 2,
         'handed': 81,
         'acquiescence': 2,
         'waggled': 2,
         'forged': 5,
         'unjust': 11,
         'callous': 13,
         'mantles': 2,
         'election': 120,
         'drummed': 2,
         'salve': 2,
         'agrees': 3,
         'discharging': 5,
         'cannons': 2,
         'impure': 3,
         'probable': 28,
         'administration': 118,
         'considerable': 174,
         'forwards': 19,
         'waterloo': 10,
         'flail': 5,
         'claiming': 7,
         'phalanges': 12,
         'bondage': 16,
         'pelageya': 17,
         'bigger': 8,
         'southern': 197,
         'perch': 5,
         'enriched': 4,
         'metaphysis': 3,
         'protection': 64,
         'factotum': 2,
         'cavalryman': 6,
         'radiated': 6,
         'cheerfully': 14,
         'solid': 42,
         'vines': 2,
         'scarves': 4,
         'quick': 82,
         'notabilities': 4,
         'him': 5231,
         'sixteenth': 12,
         'ignoring': 2,
         'deserters': 2,
         'protege': 5,
         'indulgence': 10,
         'supreme': 75,
         'closes': 7,
         'shilling': 5,
         'footing': 16,
         'mission': 35,
         'madame': 44,
         'dissatisfied': 35,
         'signatures': 4,
         'helps': 6,
         'garlic': 2,
         'wart': 6,
         'won': 202,
         'overlooked': 14,
         'lanfrey': 2,
         'dulness': 2,
         'unnatural': 38,
         'supplier': 3,
         'harassed': 4,
         'hare': 37,
         'slide': 5,
         'necessitates': 6,
         'conceived': 8,
         'mode': 19,
         'chant': 2,
         'packing': 35,
         'tentacles': 2,
         'liberality': 2,
         'phantasm': 2,
         'gloat': 3,
         'promptitude': 2,
         'merchants': 55,
         'whatnots': 2,
         'spirited': 13,
         'rupia': 5,
         'succumbs': 2,
         'fondest': 3,
         'rusty': 5,
         'strapped': 2,
         'looking': 490,
         'numerical': 6,
         'jaws': 19,
         'mann': 2,
         'smelter': 2,
         'becher': 4,
         'comprehensive': 5,
         'vessel': 138,
         'code': 14,
         'twopence': 3,
         'semilunar': 9,
         'elected': 45,
         'tone': 167,
         'epithelium': 56,
         'steve': 2,
         'pinckney': 4,
         'knapsack': 8,
         'kneeling': 9,
         'strand': 6,
         'solitude': 19,
         'gentlemanly': 2,
         'thoughts': 126,
         'castanet': 2,
         'bushy': 10,
         'descried': 4,
         'aponeurotic': 2,
         'surrenders': 2,
         'ordered': 149,
         'emancipation': 24,
         'alley': 6,
         'blazers': 3,
         'servants': 89,
         'rests': 18,
         'tooth': 28,
         'risks': 15,
         'yoke': 6,
         'inflammation': 94,
         'blonde': 10,
         'album': 6,
         'duc': 12,
         'thorns': 5,
         'planter': 19,
         'log': 21,
         'swarm': 8,
         'trocar': 4,
         'injured': 56,
         'liquor': 14,
         'perforations': 7,
         'censuring': 2,
         'contracture': 29,
         'informer': 2,
         'switzerland': 9,
         'ponies': 2,
         'glass': 117,
         'burdensome': 5,
         'security': 22,
         'bitch': 13,
         'bacillary': 5,
         'transmuted': 2,
         'atrocious': 3,
         'elects': 2,
         'but': 5654,
         'muir': 3,
         'nikita': 5,
         'muddled': 4,
         'lifts': 3,
         'impresses': 2,
         'slim': 14,
         'maksim': 2,
         'garrulously': 2,
         'utterances': 2,
         'stammered': 6,
         'midsummer': 2,
         'trousseau': 5,
         'hogs': 3,
         'metacarpals': 3,
         'blindfold': 5,
         'nonmoral': 2,
         'moonlight': 19,
         'waddling': 6,
         'pointing': 89,
         'differentiating': 4,
         'silly': 14,
         'ingest': 2,
         'imply': 13,
         'enlarges': 3,
         'cart': 57,
         'differ': 16,
         'inflamed': 54,
         'bolding': 2,
         'shave': 4,
         'adroitly': 4,
         'served': 64,
         'straining': 19,
         'fall': 125,
         'autocrats': 3,
         'unreasonably': 2,
         'salter': 2,
         'ramshackle': 2,
         'eminent': 13,
         'molle': 2,
         'pained': 8,
         'loyal': 21,
         'chalk': 14,
         'willard': 3,
         'idleness': 13,
         'canceled': 4,
         'inoculation': 23,
         'sounding': 7,
         'loopholes': 2,
         'initial': 20,
         'vicar': 3,
         'oklahoma': 16,
         'precautions': 15,
         'commensurable': 2,
         'batch': 2,
         'push': 20,
         'sublimed': 2,
         'warned': 19,
         'meuse': 4,
         'snoring': 7,
         'manner': 136,
         'this': 4064,
         'household': 56,
         'millionaires': 6,
         'duodenum': 3,
         'dixon': 2,
         'liberated': 11,
         'arrangement': 36,
         'soulless': 2,
         'reserved': 16,
         'essen': 2,
         'whirlpool': 2,
         'duport': 9,
         'hellish': 4,
         'engaged': 88,
         'apropos': 2,
         'retuned': 2,
         'cancellation': 2,
         'crack': 21,
         'morrant': 2,
         'pleasurable': 2,
         'sunk': 28,
         'indorsed': 7,
         'wanted': 214,
         'integration': 3,
         'translations': 2,
         'framework': 24,
         'skill': 35,
         'upper': 131,
         'mufti': 2,
         'softened': 29,
         'callosity': 5,
         'thrombosis': 40,
         'septum': 4,
         'sont': 3,
         'sheaf': 2,
         'redistribution': 6,
         'clymer': 2,
         'fatal': 64,
         'laid': 187,
         'nares': 2,
         'archway': 2,
         'deterred': 2,
         'oscillates': 2,
         'ineffective': 2,
         'vacant': 11,
         'confide': 7,
         'nominal': 9,
         'hiram': 2,
         'resected': 12,
         'unwanted': 2,
         'pattern': 8,
         'cicatricial': 42,
         'bowl': 7,
         'nor': 281,
         'wobbly': 3,
         'sarcomas': 9,
         'solder': 2,
         'laceration': 10,
         'illustrating': 6,
         'psammoma': 3,
         'stuck': 21,
         'perversion': 2,
         'jewels': 6,
         'country': 424,
         'enforce': 30,
         'battlefields': 2,
         'speechless': 2,
         'ruin': 49,
         'grassland': 2,
         'mistrusting': 2,
         'bench': 27,
         'scurrying': 2,
         'rhetor': 25,
         'morton': 2,
         'suggestiveness': 2,
         'remorseless': 2,
         'divert': 7,
         'melyukovs': 8,
         'obviates': 2,
         'cooperative': 10,
         'started': 97,
         'employe': 3,
         'perfidiousness': 2,
         'replied': 321,
         'imagined': 49,
         'marvel': 2,
         'pitiable': 8,
         'genteel': 2,
         'unlocked': 9,
         'independence': 152,
         'appealed': 14,
         'flexible': 6,
         'xxiv': 9,
         'authority': 101,
         'frilled': 2,
         'thoroughbred': 6,
         'accuser': 2,
         'hesitating': 17,
         'pork': 4,
         'voltaires': 2,
         'bengal': 3,
         'tends': 49,
         'wastage': 2,
         'shrill': 14,
         'profunda': 4,
         'thanksgiving': 7,
         'enclose': 2,
         'seedy': 3,
         'furs': 11,
         'splash': 5,
         'supplementary': 4,
         'sedately': 3,
         'certainly': 120,
         'puppet': 4,
         'injustice': 14,
         'lanolin': 4,
         'thicknesses': 2,
         'excuses': 7,
         'faces': 163,
         'fourteenth': 28,
         'trimming': 2,
         'prick': 6,
         'economic': 121,
         'stands': 20,
         'programming': 2,
         'princes': 12,
         'lottery': 2,
         'exsiccated': 2,
         'parceled': 2,
         'musculo': 9,
         'standstill': 3,
         'consolation': 19,
         'uncle': 136,
         'federalism': 2,
         'intravenous': 4,
         'assumption': 32,
         'doors': 48,
         'lisa': 2,
         'disregard': 7,
         'appendix': 12,
         'myoma': 9,
         'cost': 66,
         'parrot': 6,
         'manoeuvre': 2,
         'achtung': 2,
         'acorn': 3,
         'aggrieved': 4,
         'cutlery': 2,
         'glorious': 12,
         'rotten': 9,
         'denoted': 2,
         'muster': 4,
         'hug': 2,
         'eccentricity': 3,
         'susquehanna': 4,
         'partake': 4,
         'nicely': 10,
         'thing': 304,
         'wages': 51,
         'dislike': 10,
         'beams': 10,
         'spree': 3,
         'antagonizing': 2,
         'advises': 4,
         'snuffling': 2,
         'rods': 6,
         'borisov': 2,
         'mayor': 16,
         'mathematics': 12,
         'invent': 11,
         'teaching': 19,
         'girdle': 4,
         'averting': 4,
         'elijah': 2,
         'platelets': 2,
         'uvula': 2,
         'mumbling': 5,
         'retreat': 95,
         'fwashing': 2,
         'bar': 26,
         'scudding': 2,
         'nowadays': 19,
         'loftiness': 3,
         'ceded': 9,
         'delirium': 35,
         'wiring': 3,
         'centre': 51,
         'frightfully': 2,
         'wag': 6,
         'sockets': 2,
         'fluctuates': 2,
         'concealment': 4,
         'intima': 11,
         'burnt': 9,
         'tiding': 3,
         'osteomas': 4,
         'tug': 5,
         'vis': 2,
         'fabrication': 2,
         'powers': 150,
         'sweeps': 2,
         'supervene': 7,
         'meal': 16,
         'briskly': 20,
         'reinforce': 3,
         'devriez': 2,
         'youngster': 5,
         'coast': 41,
         'sameness': 2,
         'about': 1498,
         'whither': 9,
         'tolstoy': 14,
         'certain': 362,
         'paunch': 2,
         'laurel': 4,
         'stamping': 4,
         'incorporate': 4,
         ...})

编辑距离:

两个词之间的编辑距离定义为使用了几次插入(在词中插入一个单字母), 删除(删除一个单字母), 交换(交换相邻两个字母), 替换(把一个字母换成另一个)的操作从一个词变到另一个词.

#返回所有与单词 w 编辑距离为 1 的集合
def edits1(word):
    n = len(word)
    return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion
               [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
               [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
               [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion

与 something 编辑距离为2的单词居然达到了 114,324 个

优化:在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词,只能返回 3 个单词: ‘smoothing’, ‘something’ 和 ‘soothing’

#返回所有与单词 w 编辑距离为 2 的集合
#在这些编辑距离小于2的词中间, 只把那些正确的词作为候选词
def edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1))

正常来说把一个元音拼成另一个的概率要大于辅音 (因为人常常把 hello 打成 hallo 这样); 把单词的第一个字母拼错的概率会相对小, 等等.但是为了简单起见, 选择了一个简单的方法: 编辑距离为1的正确单词比编辑距离为2的优先级高, 而编辑距离为0的正确单词优先级比编辑距离为1的高.

def known(words): return set(w for w in words if w in NWORDS)

#如果known(set)非空, candidate 就会选取这个集合, 而不继续计算后面的
def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=lambda w: NWORDS[w])

朴树贝叶斯实现新闻分类

import pandas as pd
import jieba
#pip install jieba

数据源：http://www.sogou.com/labs/resource/ca.php

df_news = pd.read_table('./data/val.txt',names=['category','theme','URL','content'],encoding='utf-8')
df_news = df_news.dropna()
df_news.head()

	category	theme	URL	content
0	汽车	新辉腾　４．２　Ｖ８　４座加长Ｉｎｄｉｖｉｄｕａｌ版２０１１款　最新报价	http://auto.data.people.com.cn/model_15782/	经销商　电话　试驾／订车Ｕ憬杭州滨江区江陵路１７８０号４００８－１１２２３３转５８６４＃保常...
1	汽车	９１８　Ｓｐｙｄｅｒ概念车	http://auto.data.people.com.cn/prdview_165423....	呼叫热线　４００８－１００－３００　服务邮箱　ｋｆ＠ｐｅｏｐｌｅｄａｉｌｙ．ｃｏｍ．ｃｎ
2	汽车	日内瓦亮相　ＭＩＮＩ性能版／概念车－１．６Ｔ引擎	http://auto.data.people.com.cn/news/story_5249...	ＭＩＮＩ品牌在二月曾经公布了最新的ＭＩＮＩ新概念车Ｃｌｕｂｖａｎ效果图，不过现在在日内瓦车展...
3	汽车	清仓大甩卖一汽夏利Ｎ５威志Ｖ２低至３．３９万	http://auto.data.people.com.cn/news/story_6144...	清仓大甩卖！一汽夏利Ｎ５、威志Ｖ２低至３．３９万＝日，启新中国一汽强势推出一汽夏利Ｎ５、威志...
4	汽车	大众敞篷家族新成员　高尔夫敞篷版实拍	http://auto.data.people.com.cn/news/story_5686...	在今年３月的日内瓦车展上，我们见到了高尔夫家族的新成员，高尔夫敞篷版，这款全新敞篷车受到了众...

df_news.shape

(5000, 4)

分词：使用结吧分词器

content = df_news.content.values.tolist()
print (content[1000])

阿里巴巴集团昨日宣布，将在集团管理层面设立首席数据官岗位（Ｃｈｉｅｆ　Ｄａｔａ　Ｏｆｆｉｃｅｒ），阿里巴巴Ｂ２Ｂ公司ＣＥＯ陆兆禧将会出任上述职务，向集团ＣＥＯ马云直接汇报。＞菹ぃ和６月初的首席风险官职务任命相同，首席数据官亦为阿里巴巴集团在完成与雅虎股权谈判，推进“ｏｎｅ　ｃｏｍｐａｎｙ”目标后，在集团决策层面新增的管理岗位。０⒗锛团昨日表示，“变成一家真正意义上的数据公司”已是战略共识。记者刘夏

content_S = []
for line in content:
    current_segment = jieba.lcut(line)
    if len(current_segment) > 1 and current_segment != 'rn': #换行符
        content_S.append(current_segment)

content_S[1000]

['阿里巴巴',
 '集团',
 '昨日',
 '宣布',
 '，',
 '将',
 '在',
 '集团',
 '管理',
 '层面',
 '设立',
 '首席',
 '数据',
 '官',
 '岗位',
 '（',
 'Ｃ',
 'ｈ',
 'ｉ',
 'ｅ',
 'ｆ',
 'u3000',
 'Ｄ',
 'ａ',
 'ｔ',
 'ａ',
 'u3000',
 'Ｏ',
 'ｆ',
 'ｆ',
 'ｉ',
 'ｃ',
 'ｅ',
 'ｒ',
 '）',
 '，',
 '阿里巴巴',
 'Ｂ',
 '２',
 'Ｂ',
 '公司',
 'Ｃ',
 'Ｅ',
 'Ｏ',
 '陆兆禧',
 '将',
 '会',
 '出任',
 '上述',
 '职务',
 '，',
 '向',
 '集团',
 'Ｃ',
 'Ｅ',
 'Ｏ',
 '马云',
 '直接',
 '汇报',
 '。',
 '＞',
 '菹',
 'ぃ',
 '和',
 '６',
 '月初',
 '的',
 '首席',
 '风险',
 '官',
 '职务',
 '任命',
 '相同',
 '，',
 '首席',
 '数据',
 '官亦为',
 '阿里巴巴',
 '集团',
 '在',
 '完成',
 '与',
 '雅虎',
 '股权',
 '谈判',
 '，',
 '推进',
 '“',
 'ｏ',
 'ｎ',
 'ｅ',
 'u3000',
 'ｃ',
 'ｏ',
 'ｍ',
 'ｐ',
 'ａ',
 'ｎ',
 'ｙ',
 '”',
 '目标',
 '后',
 '，',
 '在',
 '集团',
 '决策',
 '层面',
 '新增',
 '的',
 '管理',
 '岗位',
 '。',
 '０',
 '⒗',
 '锛',
 '团',
 '昨日',
 '表示',
 '，',
 '“',
 '变成',
 '一家',
 '真正',
 '意义',
 '上',
 '的',
 '数据',
 '公司',
 '”',
 '已',
 '是',
 '战略',
 '共识',
 '。',
 '记者',
 '刘夏']

df_content=pd.DataFrame({'content_S':content_S})
df_content.head()

	content_S
0	[经销商, 　, 电话, 　, 试驾, ／, 订车, Ｕ, 憬, 杭州, 滨江区, 江陵, ...
1	[呼叫, 热线, 　, ４, ０, ０, ８, －, １, ０, ０, －, ３, ０, ０...
2	[Ｍ, Ｉ, Ｎ, Ｉ, 品牌, 在, 二月, 曾经, 公布, 了, 最新, 的, Ｍ, Ｉ...
3	[清仓, 大, 甩卖, ！, 一汽, 夏利, Ｎ, ５, 、, 威志, Ｖ, ２, 低至, ...
4	[在, 今年, ３, 月, 的, 日内瓦, 车展, 上, ，, 我们, 见到, 了, 高尔夫...

stopwords=pd.read_csv("stopwords.txt",index_col=False,sep="t",quoting=3,names=['stopword'], encoding='utf-8')
stopwords.head(20)

	stopword
0	!
1	"
2	#
3	$
4	%
5	&
6	'
7	(
8	)
9	*
10	+
11	,
12	-
13	--
14	.
15	..
16	...
17	......
18	...................
19	./

def drop_stopwords(contents,stopwords):
    contents_clean = []
    all_words = []
    for line in contents:
        line_clean = []
        for word in line:
            if word in stopwords:
                continue
            line_clean.append(word)
            all_words.append(str(word))
        contents_clean.append(line_clean)
    return contents_clean,all_words
    #print (contents_clean)
        

contents = df_content.content_S.values.tolist()    
stopwords = stopwords.stopword.values.tolist()
contents_clean,all_words = drop_stopwords(contents,stopwords)

#df_content.content_S.isin(stopwords.stopword)
#df_content=df_content[~df_content.content_S.isin(stopwords.stopword)]
#df_content.head()

df_content=pd.DataFrame({'contents_clean':contents_clean})
df_content.head()

	contents_clean
0	[经销商, 电话, 试驾, 订车, Ｕ, 憬, 杭州, 滨江区, 江陵, 路, 号, 转, ...
1	[呼叫, 热线, 服务, 邮箱, ｋ, ｆ, ｐ, ｅ, ｏ, ｐ, ｌ, ｅ, ｄ, ａ,...
2	[Ｍ, Ｉ, Ｎ, Ｉ, 品牌, 二月, 公布, 最新, Ｍ, Ｉ, Ｎ, Ｉ, 新, 概念...
3	[清仓, 甩卖, 一汽, 夏利, Ｎ, 威志, Ｖ, 低至, 万, 启新, 中国, 一汽, ...
4	[日内瓦, 车展, 见到, 高尔夫, 家族, 新, 成员, 高尔夫, 敞篷版, 款, 全新,...

df_all_words=pd.DataFrame({'all_words':all_words})
df_all_words.head()

	all_words
0	经销商
1	电话
2	试驾
3	订车
4	Ｕ

words_count=df_all_words.groupby(by=['all_words'])['all_words'].agg({"count":numpy.size})
words_count=words_count.reset_index().sort_values(by=["count"],ascending=False)
words_count.head()

C:Anaconda3libsite-packagesipykernel__main__.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  if __name__ == '__main__':

	all_words	count
4077	中	5199
4209	中国	3115
88255	说	3055
104747	Ｓ	2646
1373	万	2390

from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)

wordcloud=WordCloud(font_path="./data/simhei.ttf",background_color="white",max_font_size=80)
word_frequence = {x[0]:x[1] for x in words_count.head(100).values}
wordcloud=wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)

<matplotlib.image.AxesImage at 0x186064c64e0>

在这里插入图片描述

TF-IDF ：提取关键词###

import jieba.analyse
index = 2400
print (df_news['content'][index])
content_S_str = "".join(content_S[index])  
print ("  ".join(jieba.analyse.extract_tags(content_S_str, topK=5, withWeight=False)))

法国ＶＳ西班牙、里贝里ＶＳ哈维，北京时间６月２４日凌晨一场的大战举世瞩目，而这场胜利不仅仅关乎两支顶级强队的命运，同时也是他们背后的球衣赞助商耐克和阿迪达斯之间的一次角逐。Ｔ谌胙”窘炫分薇的１６支球队之中，阿迪达斯和耐克的势力范围也是几乎旗鼓相当：其中有５家球衣由耐克提供，而阿迪达斯则赞助了６家，此外茵宝有３家，而剩下的两家则由彪马赞助。而当比赛进行到现在，率先挺进四强的两支球队分别被耐克支持的葡萄牙和阿迪达斯支持的德国占据，而由于最后一场１／４决赛是茵宝（英格兰）和彪马（意大利）的对决，这也意味着明天凌晨西班牙同法国这场阿迪达斯和耐克在１／４决赛的唯一一次直接交手将直接决定两家体育巨头在此次欧洲杯上的胜负。８据评估，在２０１２年足球商品的销售额能总共超过４０亿欧元，而单单是不足一个月的欧洲杯就有高达５亿的销售额，也就是说在欧洲杯期间将有７００万件球衣被抢购一空。根据市场评估，两大巨头阿迪达斯和耐克的市场占有率也是并驾齐驱，其中前者占据３８％，而后者占据３６％。体育权利顾问奥利弗－米歇尔在接受《队报》采访时说：“欧洲杯是耐克通过法国翻身的一个绝佳机会！”Ｃ仔尔接着谈到两大赞助商的经营策略：“竞技体育的成功会燃起球衣购买的热情，不过即便是水平相当，不同国家之间的欧洲杯效应却存在不同。在德国就很出色，大约１／４的德国人通过电视观看了比赛，而在西班牙效果则差很多，由于民族主义高涨的加泰罗尼亚地区只关注巴萨和巴萨的球衣，他们对西班牙国家队根本没什么兴趣。”因此尽管西班牙接连拿下欧洲杯和世界杯，但是阿迪达斯只为西班牙足协支付每年２６００万的赞助费＃相比之下尽管最近两届大赛表现糟糕法国足协将从耐克手中每年可以得到４０００万欧元。米歇尔解释道：“法国创纪录的４０００万欧元赞助费得益于阿迪达斯和耐克竞逐未来１５年欧洲市场的竞争。耐克需要笼络一个大国来打赢这场欧洲大陆的战争，而尽管德国拿到的赞助费并不太高，但是他们却显然牢牢掌握在民族品牌阿迪达斯手中。从长期投资来看，耐克给法国的赞助并不算过高。”
耐克  阿迪达斯  欧洲杯  球衣  西班牙

LDA ：主题模型###

格式要求：list of list形式，分词好的的整个语料

from gensim import corpora, models, similarities
import gensim
#http://radimrehurek.com/gensim/

C:Anaconda3libsite-packagesgensimutils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

#做映射，相当于词袋
dictionary = corpora.Dictionary(contents_clean)
corpus = [dictionary.doc2bow(sentence) for sentence in contents_clean]

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20) #类似Kmeans自己指定K值

#一号分类结果
print (lda.print_topic(1, topn=5))

0.007*"中" + 0.006*"说" + 0.004*"观众" + 0.002*"赛区" + 0.002*"岁"

for topic in lda.print_topics(num_topics=20, num_words=5):
    print (topic[1])

0.007*"女人" + 0.006*"男人" + 0.006*"Ｍ" + 0.004*"Ｓ" + 0.004*"说"
0.004*"中" + 0.004*"训练" + 0.003*"说" + 0.003*"学校" + 0.002*"研究生"
0.006*"戏" + 0.006*"导演" + 0.005*"该剧" + 0.004*"中" + 0.004*"演员"
0.007*"中" + 0.006*"说" + 0.004*"观众" + 0.002*"赛区" + 0.002*"岁"
0.004*"万" + 0.003*"号" + 0.003*"中" + 0.002*"Ｓ" + 0.002*"Ｒ"
0.014*"电影" + 0.009*"导演" + 0.007*"影片" + 0.006*"中国" + 0.005*"中"
0.006*"中" + 0.005*"比赛" + 0.004*"说" + 0.003*"撒" + 0.002*"时间"
0.006*"赛季" + 0.005*"中" + 0.003*"联赛" + 0.003*"中国" + 0.002*"航母"
0.005*"李小璐" + 0.004*"中" + 0.002*"贾乃亮" + 0.002*"Ｗ" + 0.002*"皮肤"
0.004*"万" + 0.003*"号" + 0.003*"Ｖ" + 0.003*"Ｔ" + 0.003*"刘涛"
0.021*"男人" + 0.008*"女人" + 0.007*"考生" + 0.004*"说" + 0.003*"中"
0.005*"中" + 0.005*"食物" + 0.004*"ｉ" + 0.004*"ａ" + 0.004*"吃"
0.006*"中" + 0.004*"电影" + 0.004*"说" + 0.002*"中国" + 0.002*"高考"
0.007*"中" + 0.006*"孩子" + 0.004*"说" + 0.003*"教育" + 0.003*"中国"
0.005*"中" + 0.005*"节目" + 0.004*"说" + 0.004*"表演" + 0.003*"岁"
0.007*"电视剧" + 0.004*"中" + 0.003*"说" + 0.003*"飞行" + 0.002*"飞机"
0.007*"中" + 0.006*"球队" + 0.005*"选手" + 0.004*"观众" + 0.004*"ｉ"
0.005*"中" + 0.005*"天籁" + 0.004*"产品" + 0.004*"肌肤" + 0.003*"职场"
0.008*"中国" + 0.008*"饰演" + 0.007*"中" + 0.004*"说" + 0.004*"节目"
0.021*"ｅ" + 0.021*"ａ" + 0.016*"ｏ" + 0.013*"ｉ" + 0.013*"ｎ"

df_train=pd.DataFrame({'contents_clean':contents_clean,'label':df_news['category']})
df_train.tail()

	contents_clean	label
4995	[天气, 炎热, 补水, 变得, 美国, 跑步, 世界, 杂志, 报道, 喝水, 身体, 补...	时尚
4996	[不想, 说, 话, 刺激, 说, 做, 只能, 走, 离开, 伤心地, 想起, 一句, 话...	时尚
4997	[岁, 刘晓庆, 最新, 嫩照, Ｏ, 衷, 诘, 牧跸, 庆, 看不出, 岁, 秒杀, 刘...	时尚
4998	[导语, 做, 爸爸, 一种, 幸福, 无论是, 领养, 亲生, 更何况, 影视剧, 中, ...	时尚
4999	[全球, 最美, 女人, 合成图, 国, 整形外科, 教授, 李承哲, 国际, 学术, 杂志...	时尚

df_train.label.unique()

array(['汽车', '财经', '科技', '健康', '体育', '教育', '文化', '军事', '娱乐', '时尚'], dtype=object)

label_mapping = {"汽车": 1, "财经": 2, "科技": 3, "健康": 4, "体育":5, "教育": 6,"文化": 7,"军事": 8,"娱乐": 9,"时尚": 0}
df_train['label'] = df_train['label'].map(label_mapping)
df_train.head()

	contents_clean	label
0	[经销商, 电话, 试驾, 订车, Ｕ, 憬, 杭州, 滨江区, 江陵, 路, 号, 转, ...	1
1	[呼叫, 热线, 服务, 邮箱, ｋ, ｆ, ｐ, ｅ, ｏ, ｐ, ｌ, ｅ, ｄ, ａ,...	1
2	[Ｍ, Ｉ, Ｎ, Ｉ, 品牌, 二月, 公布, 最新, Ｍ, Ｉ, Ｎ, Ｉ, 新, 概念...	1
3	[清仓, 甩卖, 一汽, 夏利, Ｎ, 威志, Ｖ, 低至, 万, 启新, 中国, 一汽, ...	1
4	[日内瓦, 车展, 见到, 高尔夫, 家族, 新, 成员, 高尔夫, 敞篷版, 款, 全新,...	1

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df_train['contents_clean'].values, df_train['label'].values, random_state=1)

#x_train = x_train.flatten()
x_train[0][1]

'上海'

words = []
for line_index in range(len(x_train)):
    try:
        #x_train[line_index][word_index] = str(x_train[line_index][word_index])
        words.append(' '.join(x_train[line_index]))
    except:
        print (line_index,word_index)
words[0]

'中新网 上海 日电 于俊 父亲节 网络 吃 一顿 电影 快餐 微 电影 爸 对不起 我爱你 定于 本月 父亲节 当天 各大 视频 网站 首映 葜 谱 鞣 剑 保慈 障蚣 钦 呓 樯 埽 ⒌ 缬 埃 ǎ 停 椋 悖 颍 铩 妫 椋 恚 称 微型 电影 新 媒体 平台 播放 状态 短时 休闲 状态 观看 完整 策划 系统 制作 体系 支持 显示 较完整 故事情节 电影 微 超短 放映 微 周期 制作 天 数周 微 规模 投资 人民币 几千 数万元 每部 内容 融合 幽默 搞怪 时尚 潮流 人文 言情 公益 教育 商业 定制 主题 单独 成篇 系列 成剧 唇 开播 微 电影 爸 对不起 我爱你 讲述 一对 父子 观念 缺少 沟通 导致 关系 父亲 传统 固执 钟情 传统 生活 方式 儿子 新派 音乐 达 习惯 晚出 早 生活 性格 张扬 叛逆 两种 截然不同 生活 方式 理念 差异 一场 父子 间 拉开序幕 子 失手 打破 父亲 心爱 物品 父亲 赶出 家门 剧情 演绎 父亲节 妹妹 哥哥 化解 父亲 这场 矛盾 映逋坏 嚼 斫 狻 ⒍ 粤 ⒌ 桨容 争执 退让 传统 尴尬 父子 尴尬 情 男人 表达 心中 那份 感恩 一杯 滤挂 咖啡 父亲节 变得 温馨 镁 缬 缮 虾 Ｎ 逄 煳 幕 传播 迪欧 咖啡 联合 出品 出品人 希望 观摩 扪心自问 父亲节 父亲 记得 父亲 生日 哪一天 父亲 爱喝 跨出 家门 那一刻 感觉 一颗 颤动 心 操劳 天下 儿女 父亲节 大声 喊出 父亲 家人 爱 完'

print (len(words))

from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())


print(cv_fit.toarray().sum(axis=0))

['bird', 'cat', 'dog', 'fish']
[[0 1 1 1]
 [0 2 1 0]
 [1 0 0 1]
 [1 0 0 0]]
[2 3 2 2]

from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer(ngram_range=(1,4))
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())


print(cv_fit.toarray().sum(axis=0))

['bird', 'cat', 'cat cat', 'cat fish', 'dog', 'dog cat', 'dog cat cat', 'dog cat fish', 'fish', 'fish bird']
[[0 1 0 1 1 1 0 1 1 0]
 [0 2 1 0 1 1 1 0 0 0]
 [1 0 0 0 0 0 0 0 1 1]
 [1 0 0 0 0 0 0 0 0 0]]
[2 3 1 1 2 2 1 1 2 1]

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(analyzer='word', max_features=4000,  lowercase = False)
vec.fit(words)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(words), y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

test_words = []
for line_index in range(len(x_test)):
    try:
        #x_train[line_index][word_index] = str(x_train[line_index][word_index])
        test_words.append(' '.join(x_test[line_index]))
    except:
         print (line_index,word_index)
test_words[0]

'国家 公务员 考试 申论 应用文 类 试题 实质 一道 集 概括 分析 提出 解决问题 一体 综合性 试题 说 一道 客观 凝练 申发 论述 文章 题目 分析 历年 国考 申论 真题 公文 类 试题 类型 多样 包括 公文 类 事务性 文书 类 题材 从题 干 作答 材料 内容 整合 分析 无需 太 创造性 发挥 纵观 历年 申论 真题 作答 应用文 类 试题 文种 格式 作出 特别 重在 内容 考查 行文 格式 考生 平常心 面对 应用文 类 试题 准确 把握 作答 领会 内在 含义 把握 题材 主旨 材料 结构 轻松 应对 应用文 类 试题 Ｒ 弧 ⒆ 钒 盐 展文 写作 原则 Ｔ 材料 中来 应用文 类 试题 材料 总体 把握 客观 考生 材料 中来 材料 中 把握 材料 准确 理解 题材 主旨 Ｔ 政府 角度 作答 应用文 类 试题 更应 注重 政府 角度 观点 政府 角度 出发 原则 表述 观点 提出 解决 之策 考生 作答 站 政府 人员 角度 看待 提出 解决问题 Ｔ 文体 结构 形式 考查 重点 文体 结构 大部分 评分 关键点 解答 方法 薄 ⒆ ス 丶 词 明 方向 作答 题目 题干 作答 作答 方向 作答 角度 关键 向导 考生 仔细阅读 题干 作答 抓住 关键词 作答 方向 相关 要点 整理 作答 思路 年国考 地市级 真 题为 例 潦惺姓 府 宣传 推进 近海 水域 污染 整治 工作 请 给定 资料 市政府 工作人员 身份 草拟 一份 宣传 纲要 Ｒ 求 保对 宣传 内容 要点 提纲挈领 陈述 玻 体现 政府 精神 全市 各界 关心 支持 污染 整治 工作 通俗易懂 超过 字 肮 丶 词 近海 水域 污染 整治 工作 市政府 工作人员 身份 宣传 纲要 提纲挈领 陈述 体现 政府 精神 全市 各界 关心 支持 污染 整治 工作 通俗易懂 提示 归结 作答 要点 包括 污染 情况 原因 解决 对策 作答 思路 情况 原因 对策 意义 逻辑 顺序 安排 文章 结构 病 ⒋ 缶殖 龇 ⅲ 明 结构 解答 应用文 类 试题 考生 材料 整体 出发 大局 出发 高屋建瓴 把握 材料 主题 思想 事件 起因 解决 对策 阅读文章 构建 文章 结构 直至 快速 解答 场 ⒗ 硭 乘悸 罚明 逻辑 应用文 类 试题 严密 逻辑思维 情况 原因 对策 意义 考生 作答 先 弄清楚 解答 思路 统筹安排 脉络 清晰 逻辑 表达 内容 表述 础 把握 明 详略 考生 仔细阅读 分析 揣摩 应用文 类 试题 内容 答题 时要 详略 得当 主次 分明 安排 内容 增加 文章 层次感 阅卷 老师 阅卷 时能 明白 清晰 一目了然 玻埃 保蹦旯 考 考试 申论 试卷 分为 省级 地市级 两套 试卷 能力 大有 省级 申论 试题 考生 宏观 角度看 注重 深度 广度 考生 深谋远虑 地市级 试题 考生 微观 视角 观察 侧重 考查 解决 能力 考生 贯彻执行 作答 区别对待'

classifier.score(vec.transform(test_words), y_test)

0.80400000000000005

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word', max_features=4000,  lowercase = False)
vectorizer.fit(words)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=1.0, max_features=4000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\b\w\w+\b', tokenizer=None, use_idf=True,
        vocabulary=None)

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vectorizer.transform(words), y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

classifier.score(vectorizer.transform(test_words), y_test)

0.81520000000000004

最后

以上就是凶狠时光为你收集整理的AI day09(2020 8/8)朴素贝叶斯的全部内容，希望文章能够帮你解决AI day09(2020 8/8)朴素贝叶斯所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：AI
浏览次数：78 次浏览
发布日期：2023-10-27 18:30:21
本文链接：https://www.kaopuke.com/article/k-p-k_13_u_23_o_22_f5_12__7_c1.html

AI day09(2020 8/8)朴素贝叶斯

概述

朴素贝叶斯

每个训练集中的输入值,在每个类别中默认至少出现一次。

新闻分类

贝叶斯拼写检查器

求解：argmaxc P(c|w) -> argmaxc P(w|c) P© / P(w)

编辑距离:

朴树贝叶斯实现新闻分类

数据源：http://www.sogou.com/labs/resource/ca.php

分词：使用结吧分词器

TF-IDF ：提取关键词###

LDA ：主题模型###

最后

评论列表共有 0 条评论

发表评论取消回复

AI day09(2020 8/8)朴素贝叶斯

概述

朴素贝叶斯

每个训练集中的输入值,在每个类别中默认至少出现一次。

新闻分类

贝叶斯拼写检查器

求解：argmaxc P(c|w) -> argmaxc P(w|c) P© / P(w)

编辑距离:

朴树贝叶斯实现新闻分类

数据源：http://www.sogou.com/labs/resource/ca.php

分词：使用结吧分词器

TF-IDF ：提取关键词###

LDA ：主题模型###

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复