我是靠谱客的博主 疯狂宝马,最近开发中收集的这篇文章主要介绍python随机抽取人名_python – 从文件中随机抽样,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

拥有相同长度的线条的最大优点是您不需要找到换行符来了解每条线的起始位置.文件大小约为40GB,包含约1.8M行,您的行长度约为20KB /行.如果你想采样10K线,你的线之间有~40MB.这几乎可以肯定比磁盘上块的大小大三个数量级.因此,寻找下一个读取位置比读取文件中的每个字节要有效得多.

寻求将使用具有不等行长度的文件(例如,UTF-8编码中的非ascii字符),但是需要对该方法进行微小的修改.如果您有不相等的线,您可以搜索估计的位置,然后扫描到下一行的开头.这仍然是非常有效的,因为你需要为每个~20KB的内容跳过~40MB.由于您将选择字节位置而不是行位置,因此您的采样均匀性会受到轻微影响,并且您无法确定您正在读取的行号.

您可以使用生成行号的Python代码直接实现解决方案.以下是如何处理所有具有相同字节数的行的示例(通常为ascii编码):

import random

from os.path import getsize

# Input file path

file_name = 'file.csv'

# How many lines you want to select

selection_count = 10000

file_size = getsize(file_name)

with open(file_name) as file:

# Read the first line to get the length

file.readline()

line_size = file.tell()

# You don't have to seek(0) here: if line #0 is selected,# the seek will happen regardless later.

# Assuming you are 100% sure all lines are equal,this might

# discard the last line if it doesn't have a trailing newline.

# If that bothers you,use `math.round(file_size / line_size)`

line_count = file_size // line_size

# This is just a trivial example of how to generate the line numbers.

# If it doesn't work for you,just use the method you already have.

# By the way,this will just error out (ValueError) if you try to

# select more lines than there are in the file,which is ideal

selection_indices = random.sample(range(line_count),selection_count)

selection_indices.sort()

# Now skip to each line before reading it:

prev_index = 0

for line_index in selection_indices:

# Conveniently,the default seek offset is the start of the file,# not from current position

if line_index != prev_index + 1:

file.seek(line_index * line_size)

print('Line #{}: {}'.format(line_index,file.readline()),end='')

# Small optimization to avoid seeking consecutive lines.

# Might be unnecessary since seek probably already does

# something like that for you

prev_index = line_index

如果您愿意牺牲(非常)少量的行号分布均匀性,您可以轻松地将类似的技术应用于行长度不等的文件.您只需生成随机字节偏移,并跳过偏移后的下一个完整行.在以下实现中,假设您知道没有行的长度超过40KB.如果您的CSV具有以UTF-8编码的非ascii unicode字符,则必须执行此类操作,因为即使这些行包含相同数量的字符,它们也将包含不同数量的字节.在这种情况下,您必须以二进制模式打开文件,否则当您跳到随机字节时,如果该字节碰巧是中间字符,则可能会遇到解码错误:

import random

from os.path import getsize

# Input file path

file_name = 'file.csv'

# How many lines you want to select

selection_count = 10000

# An upper bound on the line size in bytes,not chars

# This serves two purposes:

# 1. It determines the margin to use from the end of the file

# 2. It determines the closest two offsets are allowed to be and

# still be 100% guaranteed to be in different lines

max_line_bytes = 40000

file_size = getsize(file_name)

# make_offset is a function that returns `selection_count` monotonically

# increasing unique samples,at least `max_line_bytes` apart from each

# other,in the range [0,file_size - margin). Implementation not provided.

selection_offsets = make_offsets(selection_count,file_size,max_line_bytes)

with open(file_name,'rb') as file:

for offset in selection_offsets:

# Skip to each offset

file.seek(offset)

# Readout to the next full line

file.readline()

# Print the next line. You don't know the number.

# You also have to decode it yourself.

print(file.readline().decode('utf-8'),end='')

这里的所有代码都是Python 3.

最后

以上就是疯狂宝马为你收集整理的python随机抽取人名_python – 从文件中随机抽样的全部内容,希望文章能够帮你解决python随机抽取人名_python – 从文件中随机抽样所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(50)

评论列表共有 0 条评论

立即
投稿
返回
顶部