我是靠谱客的博主 花痴蜜粉,这篇文章主要介绍python 如何将ppt和word转化为txt文档,现在分享给大家,希望可以做个参考。

  最近做了关于计算文档中关键词的程序,使用Tf-idf方法去提取,其中需要使用python读取MS word文档和 MS powerpoint中的内容,现将部分讲解和代码贴出来,请指正。

  首先,介绍一下win32com,这是个和window链接的模块,实话说,功能是很强大的,在网上看到很多功能,可以用来打开word,ppt,Excel,Access,模拟浏览器等行为,下载地址:http://starship.python.net/~skippy/win32/Downloads.html,现在,介绍其功能,是从其他网上的资料粘贴过来的。

1.word功能:http://my.oschina.net/duxuefeng/blog/64137(这个写的比较清楚) 

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
1 import win32com 2 from win32com.client import Dispatch, constants 3 4 w = win32com.client.Dispatch('Word.Application') 5 # 或者使用下面的方法,使用启动独立的进程: 6 # w = win32com.client.DispatchEx('Word.Application') 7 8 # 后台运行,不显示,不警告 9 w.Visible = 0 10 w.DisplayAlerts = 0 11 12 # 打开新的文件 13 doc = w.Documents.Open( FileName = filenamein ) 14 # worddoc = w.Documents.Add() # 创建新的文档 15 16 # 插入文字 17 myRange = doc.Range(0,0) 18 myRange.InsertBefore('Hello from Python!') 19 20 # 使用样式 21 wordSel = myRange.Select() 22 wordSel.Style = constants.wdStyleHeading1 23 24 # 正文文字替换 25 w.Selection.Find.ClearFormatting() 26 w.Selection.Find.Replacement.ClearFormatting() 27 w.Selection.Find.Execute(OldStr, False, False, False, False, False, True, 1, True, NewStr, 2) 28 29 # 页眉文字替换 30 w.ActiveDocument.Sections[0].Headers[0].Range.Find.ClearFormatting() 31 w.ActiveDocument.Sections[0].Headers[0].Range.Find.Replacement.ClearFormatting() 32 w.ActiveDocument.Sections[0].Headers[0].Range.Find.Execute(OldStr, False, False, False, False, False, True, 1, False, NewStr, 2) 33 34 # 表格操作 35 doc.Tables[0].Rows[0].Cells[0].Range.Text ='123123' 36 worddoc.Tables[0].Rows.Add() # 增加一行 37 38 # 转换为html 39 wc = win32com.client.constants 40 w.ActiveDocument.WebOptions.RelyOnCSS = 1 41 w.ActiveDocument.WebOptions.OptimizeForBrowser = 1 42 w.ActiveDocument.WebOptions.BrowserLevel = 0 # constants.wdBrowserLevelV4 43 w.ActiveDocument.WebOptions.OrganizeInFolder = 0 44 w.ActiveDocument.WebOptions.UseLongFileNames = 1 45 w.ActiveDocument.WebOptions.RelyOnVML = 0 46 w.ActiveDocument.WebOptions.AllowPNG = 1 47 w.ActiveDocument.SaveAs( FileName = filenameout, FileFormat = wc.wdFormatHTML ) 48 49 # 打印 50 doc.PrintOut() 51 52 # 关闭 53 # doc.Close() 54 w.Documents.Close(wc.wdDoNotSaveChanges) 55 w.Quit()
View Code

2. Excel功能:同上的网址

3. Access功能:

复制代码
1
2
3
4
5
6
7
8
9
1 import win32com.client 2 oAccess = win32com.client.Dispatch('Access.Application') 3 DbFile = r'C:UsershansDocumentsNewDb.mdb' 4 dbLangGeneral = ';LANGID=0x0409;CP=1252;COUNTRY=0' 5 # dbVersion40 64 6 dbVersion = 64 7 oAccess.DBEngine.CreateDatabase(DbFile, dbLangGeneral, dbVersion) 8 oAccess.Quit() 9 del oAccess
View Code

4. 模拟浏览器行为:http://www.cnblogs.com/chenzehe/archive/2010/09/01/1814397.html

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
1 Code highlighting produced by Actipro CodeHighlighter (freeware)http://www.CodeHighlighter.com/--># -*- coding:UTF-8 -*- 2 #!/user/bin/env python 3 ''' 4 Created on 2010-9-1 5 @author: chenzehe 6 ''' 7 import win32com.client 8 from time import sleep 9 10 loginurl='http://passport.cnblogs.com/login.aspx' 11 loginouturl='http://passport.cnblogs.com/logout.aspx' 12 username='XXX' 13 password='XXX' 14 15 ie = win32com.client.Dispatch("InternetExplorer.Application") 16 ie.Visible = 0 17 18 19 ie.Navigate(loginurl) 20 state = ie.ReadyState 21 print "打开登陆页面" 22 while 1: 23 state = ie.ReadyState 24 if state ==4: 25 break 26 sleep(1) 27 print "页面载入完毕,输入用户名密码" 28 state = None 29 30 ie.Document.getElementById("tbUserName").value=username 31 ie.Document.getElementById("tbPassword").value=password 32 ie.Document.getElementById("btnLogin").click() 33 34 35 while 1: 36 state = ie.ReadyState 37 print state 38 if state ==4 and str(ie.LocationURL) == "http://home.cnblogs.com/": 39 break 40 sleep(1) 41 print "登陆成功" 42 print '你的昵称是:' 43 print ie.Document.getElementById('lnk_current_user').title 44 45 #博客园只能登录一次,注销 46 print '注销!' 47 ie.Navigate(loginouturl)
View Code

5.播放mp3文件:http://www.sharejs.com/codes/python/5733

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
1 from win32com.client import Dispatch 2 mp = Dispatch("WMPlayer.OCX") 3 # use an mp3 file you have ... 4 #tune = mp.newMedia("C:/Program Files/Common Files/HP/Memories Disc/2.0/audio/Swing.mp3") 5 # or copy one to the working folder ... 6 #tune = mp.newMedia("Bier1.mp3") 7 # you can also play wma files, this cool sound came with XP ... 8 tune = mp.newMedia("C:/WINDOWS/system32/oobe/images/title.wma") 9 mp.currentPlaylist.appendItem(tune) 10 mp.controls.play() 11 # to stop playing use 12 raw_input("Press Enter to stop playing") 13 mp.controls.stop()
View Code

  真心感觉这个东西很强大呀!言归正传,该上将word和ppt转化为txt的代码了!如下:

复制代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
1 #coding:utf-8 2 import win32com 3 import win32con 4 import win32gui 5 import codecs 6 from win32com.client import Dispatch 7 import pythoncom 8 9 class MSOffice2txt(): 10 def __init__(self, fileType=['doc','ppt']): 11 self.docCom = None 12 self.pptCom = None 13 pythoncom.CoInitialize() 14 if type(fileType) is not list: 15 return 'Error, please check the fileType, it must be list[]' 16 for ft in fileType: 17 if ft == 'doc': 18 self.docCom = self.docApplicationOpen() 19 elif ft == 'ppt': 20 self.pptCom = self.pptApplicationOpen() 21 22 def close(self): 23 self.docApplicationClose(self.docCom) 24 self.pptApplicationClose(self.pptCom) 25 26 def docApplicationOpen(self): 27 docCom = win32com.client.Dispatch('Word.Application') 28 docCom.Visible = 1 29 docCom.DisplayAlerts = 0 30 docHwnd = win32gui.FindWindow(None, 'Microsoft Word') 31 win32gui.ShowWindow(docHwnd, win32con.SW_HIDE) 32 return docCom 33 34 def docApplicationClose(self,docCom): 35 if docCom is not None: 36 docCom.Quit() 37 38 def doc2Txt(self, docCom, docFile, txtFile): 39 doc = docCom.Documents.Open(FileName=docFile,ReadOnly=1) 40 doc.SaveAs(txtFile, 2) 41 doc.Close() 42 43 44 45 def pptApplicationOpen(self): 46 pptCom = win32com.client.Dispatch('PowerPoint.Application') 47 pptCom.Visible = 1 48 pptCom.DisplayAlerts = 0 49 pptHwnd = win32gui.FindWindow(None, 'Microsoft PowerPoint') 50 win32gui.ShowWindow(pptHwnd, win32con.SW_HIDE) 51 return pptCom 52 53 def pptApplicationClose(self, pptCom): 54 if pptCom is not None: 55 pptCom.Quit() 56 57 def ppt2txt(self, pptCom, pptFile, txtFile): 58 ppt = pptCom.Presentations.Open(pptFile,ReadOnly=1, Untitled=0, WithWindow=0) 59 f = codecs.open(txtFile,"w",'gb18030') 60 slide_count = ppt.Slides.Count 61 for i in xrange(1,slide_count + 1): 62 shape_count = ppt.Slides(i).Shapes.Count 63 for j in xrange(1,shape_count + 1): 64 if ppt.Slides(i).Shapes(j).HasTextFrame: 65 s = ppt.Slides(i).Shapes(j).TextFrame.TextRange.Text 66 f.write(s) 67 f.close() 68 ppt.Close() 69 70 def translate(self, filename, txtFilename): 71 if filename.endswith('doc') or filename.endswith('docx'): 72 if self.docCom is None: 73 self.docCom = self.docApplicationOpen() 74 self.doc2Txt(self.docCom, filename, txtFilename) 75 return True 76 elif filename.endswith('ppt') or filename.endswith('pptx'): 77 if self.pptCom is None: 78 self.pptCom = self.pptApplicationOpen() 79 self.ppt2txt(self.pptCom, filename, txtFilename) 80 return True 81 else: 82 return False 83 84 if __name__=='__main__': 85 msoffice = MSOffice2txt() 86 filename = u'F:\study.docx' 87 if msoffice.translate(filename, 'temp.txt'): 88 print 'Successed!' 89 else: 90 print 'Failed!' 91 msoffice.close()

 

   

转载于:https://www.cnblogs.com/AlgorithmDot/p/3386918.html

最后

以上就是花痴蜜粉最近收集整理的关于python 如何将ppt和word转化为txt文档的全部内容,更多相关python内容请搜索靠谱客的其他文章。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(59)

评论列表共有 0 条评论

立即
投稿
返回
顶部