如何用 Python 读取大型文本文件

Python File对象提供各种方式来阅读文本文件. 最流行的方法是使用 readlines() 方法,返回文件中的所有行列表。

阅读 Python 中的大型文本文件

我们可以使用文件对象作为迭代器。迭代器将返回每行一行,可以处理。这不会将整个文件读入内存,并且适合在Python中阅读大型文件。

 1import resource
 2import os
 3
 4file_name = "/Users/pankaj/abcdef.txt"
 5
 6print(f'File Size is {os.stat(file_name).st_size / (1024 * 1024)} MB')
 7
 8txt_file = open(file_name)
 9
10count = 0
11
12for line in txt_file:
13    # we can process file line by line here, for simplicity I am taking count of lines
14    count += 1
15
16txt_file.close()
17
18print(f'Number of Lines in the file is {count}')
19
20print('Peak Memory Usage =', resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
21print('User Mode Time =', resource.getrusage(resource.RUSAGE_SELF).ru_utime)
22print('System Mode Time =', resource.getrusage(resource.RUSAGE_SELF).ru_stime)

当我们运行这个程序时,产生的输出是:

1File Size is 257.4920654296875 MB
2Number of Lines in the file is 60000000
3Peak Memory Usage = 5840896
4User Mode Time = 11.46692
5System Mode Time = 0.09655899999999999

Python Read Large Text File

我正在使用 os 模块来打印文件的大小.
资源模块用于检查程序的内存和CPU时间使用

我们也可以使用 with statement来打开文件,在这种情况下,我们不必明确关闭文件对象。

1with open(file_name) as txt_file:
2    for line in txt_file:
3        # process the line
4        pass

如果大文件没有行怎么办?

上面的代码在大文件内容被分成许多行时会非常出色,但是,如果一个行中有大量数据,那么它会使用大量的内存,在这种情况下,我们可以将文件内容读入缓冲器并处理。

1with open(file_name) as f:
2    while True:
3        data = f.read(1024)
4        if not data:
5            break
6        print(data)

上面的代码会将文件数据读到1024字节的缓冲器中,然后我们将其打印到控制台上。当整个文件被读取时,数据将变得空白,并且 break statement将终止同时循环。

1with open(destination_file_name, 'w') as out_file:
2    with open(source_file_name) as in_file:
3        for line in in_file:
4            out_file.write(line)

** 参考**: StackOverflow 问题