从文档中提取文本从文档中提取文本

200 阅读 0 评论 132 点赞

我是靠谱客的博主怡然白昼，这篇文章主要介绍从文档中提取文本从文档中提取文本，现在分享给大家，希望可以做个参考。

从文档中提取文本

本教程系列将涵盖txtai的主要用例，这是一个 AI 驱动的语义搜索平台。该系列的每章都有相关代码，可也可以在colab 中使用。
colab 地址

本文展示了文档如何提取和分割文本以支持相似性搜索。

安装依赖

安装txtai和所有依赖项。由于本文使用的是可选管道，因此我们需要安装管道 extras 包。

pip install txtai[pipeline]

# Get test data
wget -N https://github.com/neuml/txtai/releases/download/v2.0.0/tests.tar.gz
tar -xvzf tests.tar.gz

# Install NLTK
import nltk
nltk.download('punkt')

创建一个 Textracor 实例

Textractor 实例是提取文本的主要入口点。这种方法得到了 Apache Tika 的支持，Apache Tika 是一个用 Java 编写的强大的文本提取库。Apache Tika支持大量文件格式：PDF、Word、Excel、HTML 等。所述的Python包自动安装并开始用于读取所提取的数据的本地REST API实例。
注意：这需要在本地安装Java。

from txtai.pipeline import Textractor

# Create textractor model
textractor = Textractor()

提取文本

textractor("txtai/article.pdf")

Introducing txtai, an AI-powered search engine built on Transformers Add Natural Language Understanding to any application Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done. The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis. This article introduces txtai, an AI-powered search engine that enables Natural Language Understanding (NLU) based search in any application. Introducing txtai txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification. txtai is open source and available on GitHub. txtai and/or the concepts behind it has already been used to power the Natural Language Processing (NLP) applications listed below: • paperai — AI-powered literature discovery and review engine for medical/scientific papers • tldrstory — AI-powered understanding of headlines and story text • neuspo — Fact-driven, real-time sports event and news site • codequestion — Ask coding questions directly from the terminal Build an Embeddings index For small lists of texts, the method above works. But for larger repositories of documents, it doesn’t make sense to tokenize and convert all embeddings for each query. txtai supports building pre- computed indices which significantly improves performance. Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search. https://github.com/neuml/codequestion https://neuspo.com/ https://github.com/neuml/tldrstory https://github.com/neuml/paperai Introducing txtai, an AI-powered search engine built on Transformers Add Natural Language Understanding to any application Introducing txtai Build an Embeddings index

请注意，文章中的文本被提取到单个字符串中。根据文章的不同，这可能是可以接受的。对于较长的文章，您通常希望将内容拆分为逻辑部分以构建更好的下游向量。

提取句子

textractor = Textractor(sentences=True)
textractor("txtai/article.pdf")

['Introducing txtai, an AI-powered search engine built on Transformers Add Natural Language Understanding to any application Search is the base of many applications.',
 'Once data starts to pile up, users want to be able to find it.',
 'It’s the foundation of the internet and an ever-growing challenge that is never solved or done.',
 'The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments.',
 'Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people.',
 'Innovation continues with new models and advancements coming in at what seems a weekly basis.',
 'This article introduces txtai, an AI-powered search engine that enables Natural Language Understanding (NLU) based search in any application.',
 'Introducing txtai txtai builds an AI-powered index over sections of text.',
 'txtai supports building text indices to perform similarity searches and create extractive question-answering based systems.',
 'txtai also has functionality for zero-shot classification.',
 'txtai is open source and available on GitHub.',
 'txtai and/or the concepts behind it has already been used to power the Natural Language Processing (NLP) applications listed below: • paperai — AI-powered literature discovery and review engine for medical/scientific papers • tldrstory — AI-powered understanding of headlines and story text • neuspo — Fact-driven, real-time sports event and news site • codequestion — Ask coding questions directly from the terminal Build an Embeddings index For small lists of texts, the method above works.',
 'But for larger repositories of documents, it doesn’t make sense to tokenize and convert all embeddings for each query.',
 'txtai supports building pre- computed indices which significantly improves performance.',
 'Building on the previous example, the following example runs an index method to build and store the text embeddings.',
 'In this case, only the query is converted to an embeddings vector each search.',
 'https://github.com/neuml/codequestion https://neuspo.com/ https://github.com/neuml/tldrstory https://github.com/neuml/paperai Introducing txtai, an AI-powered search engine built on Transformers Add Natural Language Understanding to any application Introducing txtai Build an Embeddings index']

现在文档在句子级别被拆分。这些句子可以提供给将每个句子添加到嵌入索引的工作流。根据任务的不同，这可能会很好地工作。或者，在段落级别拆分可能会更好。

提取段落

textractor = Textractor(paragraphs=True)
textractor("txtai/article.pdf")

['Introducing txtai, an AI-powered search engine built on Transformers',
 'Add Natural Language Understanding to any application',
 'Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.',
 'The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis.',
 'This article introduces txtai, an AI-powered search engine that enables Natural Language Understanding (NLU) based search in any application.',
 'Introducing txtai txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification. txtai is open source and available on GitHub.',
 'txtai and/or the concepts behind it has already been used to power the Natural Language Processing (NLP) applications listed below:',
 '• paperai — AI-powered literature discovery and review engine for medical/scientific papers • tldrstory — AI-powered understanding of headlines and story text • neuspo — Fact-driven, real-time sports event and news site • codequestion — Ask coding questions directly from the terminal',
 'Build an Embeddings index For small lists of texts, the method above works. But for larger repositories of documents, it doesn’t make sense to tokenize and convert all embeddings for each query. txtai supports building pre- computed indices which significantly improves performance.',
 'Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search.',
 'https://github.com/neuml/codequestion https://neuspo.com/ https://github.com/neuml/tldrstory https://github.com/neuml/paperai',
 'Introducing txtai, an AI-powered search engine built on Transformers Add Natural Language Understanding to any application',
 'Introducing txtai Build an Embeddings index']