【Python】使用pandas读取以特殊字符作为分隔符的文本数据

207 阅读 0 评论 137 点赞

我是靠谱客的博主坚强宝马，这篇文章主要介绍【Python】使用pandas读取以特殊字符作为分隔符的文本数据，现在分享给大家，希望可以做个参考。

金融民工在日常工作中遇到的小问题，分享一下～（大佬请绕道）

import pandas as pd

df = pd.read_table(file_path, sep='$$', engine='python', header=0)

问题描述：

直接用read_csv或read_table读取分隔符为"$$"的文本型数据集时，pandas可能会生成一列为超长字符串（未通过指定的分隔符将一行数据分拆），另一列为"unnamed: 1"且元素均为np.nan的DataFrame，非实际所需
同时报告DtypeWarnings，提示部分行含有mixed types

解决方案：

import pandas as pd

file_path = folder_path + file_new_name

# replace '$$' with normal delimiter ','
file_input = open(folder_path + file_original_name, "rt", encoding='utf-8')
file_output = open(file_path, "wt", encoding='utf-8')

for line in file_input:
	file_output.write(line.replace('$$', ','))

file_input.close()
file_output.close()

# read the new text file with delimiter ','
data_chunks = pd.read_table(file_path, sep=',', converters=converter_dict, low_memory=False,
                            encoding='utf-8', header=0, chunksize=100000)
chunk_list = []
for index, chunk in enumerate(data_chunks):
    chunk_list.append(chunk) # extract data from io.parsers.TextFileReader
    print(">>> chunk{0} loaded...".format(index+1))
print(">>> concating...")

df = pd.concat(chunk_list, ignore_index=True)
print(">>> DataFrame created successfully")