第六章：数据加载、存储于文件格式Day12-14读写文本格式的数据

286 阅读 0 评论 189 点赞

我是靠谱客的博主彪壮玉米，这篇文章主要介绍第六章：数据加载、存储于文件格式Day12-14读写文本格式的数据，现在分享给大家，希望可以做个参考。

说明：本文章为Python数据处理学习日志，记录内容为实现书本内容时遇到的错误以及一些与书本不一致的地方，一些简单操作则不再赘述。日志主要内容来自书本《利用Python进行数据分析》，Wes McKinney著，机械工业出版社。

读写文本格式的数据

read_csv()

Signature:
pd.read_csv(
filepath_or_buffer,
sep=’,’,
delimiter=None,
header=’infer’,
names=None,
index_col=None,
usecols=None,
squeeze=False,
prefix=None,
mangle_dupe_cols=True,
dtype=None,
engine=None,
converters=None,
true_values=None,
false_values=None,
skipinitialspace=False,
skiprows=None,
skipfooter=None,
nrows=None,
na_values=None,
keep_default_na=True,
na_filter=True,
verbose=False,
skip_blank_lines=True,
parse_dates=False,
infer_datetime_format=False,
keep_date_col=False,
date_parser=None,
dayfirst=False,
iterator=False,
chunksize=None,
compression=’infer’,
thousands=None,
decimal=’.’,
lineterminator=None,
quotechar=’”’,
quoting=0,
escapechar=None,
comment=None,
encoding=None,
dialect=None,
tupleize_cols=False,
error_bad_lines=True,
warn_bad_lines=True,
skip_footer=0,
doublequote=True,
delim_whitespace=False,
as_recarray=False,
compact_ints=False,
use_unsigned=False,
low_memory=True,
buffer_lines=None,
memory_map=False,
float_precision=None)

Docstring: Read CSV (comma-separated) file into DataFrame Also supports optionally iterating or breaking of the file into chunks.
Additional help can be found in the online docs for IO Tools <http://pandas.pydata.org/pandas-docs/stable/io.html>_.

Parameters:
filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)
The string could be a URL. Valid URL schemes include http, ftp, s3, and
file. For file URLs, a host is expected. For instance, a local file could
be file ://localhost/path/to/table.csv
sep : str, default ‘,’
Delimiter to use. If sep is None, will try to automatically determine
this. Regular expressions are accepted and will force use of the python
parsing engine and will ignore quotes in the data.
delimiter : str, default None
Alternative argument name for sep.
header : int or list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data.
Default behavior is as if set to 0 if no names passed, otherwise
None. Explicitly pass header=0 to be able to replace existing
names. The header can be a list of integers that specify row locations for
a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not
specified will be skipped (e.g. 2 in this example is skipped). Note that
this parameter ignores commented lines and empty lines if
skip_blank_lines=True, so header=0 denotes the first line of data
rather than the first line of the file.
names : array-like, default None
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame. If a sequence is given, a
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to not
use the first column as the index (row names)
usecols : array-like, default None
Return a subset of the columns.
Results in much faster parsing time and lower memory usage.
squeeze : boolean, default False
If the parsed data only contains one column then return a Series
prefix : str, default None
Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as ‘X.0’…’X.N’, rather than ‘X’…’X’
dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
(Unsupported with engine=’python’). Use str or object to preserve and
not interpret dtype.
engine : {‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is
currently more feature-complete.
converters : dict, default None
Dict of functions for converting values in certain columns. Keys can either
be integers or column labels
true_values : list, default None
Values to consider as True
false_values : list, default None
Values to consider as False
skipinitialspace : boolean, default False
Skip spaces after delimiter.
skiprows : list-like or integer, default None
Line numbers to skip (0-indexed) or number of lines to skip (int)
at the start of the file
skipfooter : int, default 0
Number of lines at bottom of file to skip (Unsupported with engine=’c’)
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific
per-column NA values. By default the following values are interpreted as
NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A',
'NA', 'NULL', 'NaN', 'nan'.
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN
values are overridden, otherwise they’re appended to.
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In
data without any NAs, passing na_filter=False can improve the performance
of reading a large file
verbose : boolean, default False
Indicate number of NA values placed in non-numeric columns
skip_blank_lines : boolean, default True
If True, skip over blank lines rather than interpreting as NaN values parse_dates : boolean or list of ints or names or list of lists
or dict, default False
* boolean. If True -> try parsing the index.
* list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
each as a separate date column.
* list of lists. e.g.
If [[1, 3]] -> combine columns 1 and 3 and parse as
a single date column.
* dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
'foo'
Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format : boolean, default False
If True and parse_dates is enabled for a column, attempt to infer
the datetime format to speed up the processing
keep_date_col : boolean, default False
If True and parse_dates specifies combining multiple columns then
keep the original columns.
date_parser : function, default None
Function to use for converting a sequence of string columns to an array of
datetime instances. The default uses dateutil.parser.parser to do the
conversion. Pandas will try to call date_parser in three different ways,
advancing to the next if an exception occurs: 1) Pass one or more arrays
(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
string values from the columns defined by parse_dates into a single array
and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst : boolean, default False
DD/MM format dates, international and European format
iterator : boolean, default False
Return TextFileReader object for iteration or getting chunks with
get_chunk().
chunksize : int, default None
Return TextFileReader object for iteration. See IO Tools docs for more information <http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>_ on
iterator and chunksize.
compression : {‘infer’, ‘gzip’, ‘bz2’, None}, default ‘infer’
For on-the-fly decompression of on-disk data. If ‘infer’, then use gzip or
bz2 if filepath_or_buffer is a string ending in ‘.gz’ or ‘.bz2’,
respectively, and no decompression otherwise. Set to None for no
decompression.
thousands : str, default None
Thousands separator
decimal : str, default ‘.’
Character to recognize as decimal point (e.g. use ‘,’ for European data).
lineterminator : str (length 1), default None
Character to break file into lines. Only valid with C parser.
quotechar : str (length 1), optional
The character used to denote the start and end of a quoted item. Quoted
items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default None
Control field quoting behavior per csv.QUOTE_* constants. Use one of
QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
Default (None) results in QUOTE_MINIMAL behavior.
escapechar : str (length 1), default None
One-character string used to escape delimiter when quoting is QUOTE_NONE.
comment : str, default None
Indicates remainder of line should not be parsed. If found at the beginning
of a line, the line will be ignored altogether. This parameter must be a
single character. Like empty lines (as long as skip_blank_lines=True),
fully commented lines are ignored by the parameter header but not by
skiprows. For example, if comment=’#’, parsing ‘#emptyna,b,cn1,2,3’
with header=0 will result in ‘a,b,c’ being
treated as the header.
encoding : str, default None
Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings <https://docs.python.org/3/library/codecs.html#standard-encodings>_
dialect : str or csv.Dialect instance, default None
If None defaults to Excel dialect. Ignored if sep longer than 1 char
See csv.Dialect documentation for more details
tupleize_cols : boolean, default False
Leave a list of tuples on columns as is (default is to convert to
a Multi Index on the columns)
error_bad_lines : boolean, default True
Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned.
If False, then these “bad lines” will dropped from the DataFrame that is
returned. (Only valid with C parser)
warn_bad_lines : boolean, default True
If error_bad_lines is False, and warn_bad_lines is True, a warning for each
“bad line” will be output. (Only valid with C parser).

Returns:
result : DataFrame or TextParser

read_table和read_csv相似，sep改为”t”，其余大同小异。

to_csv()

Signature: data.to_csv(path_or_buf=None, sep=’,’, na_rep=”, float_format=None, columns=None, header=True, index=True,
index_label=None, mode=’w’, encoding=None, compression=None,
quoting=None, quotechar=’”’, line_terminator=’n’, chunksize=None,
tupleize_cols=False, date_format=None, doublequote=True,
escapechar=None, decimal=’.’, **kwds)

Docstring: Write DataFrame to a comma-separated values (csv) file

Parameters:
path_or_buf : string or file handle, default None
File path or object, if None is provided the result is returned as
a string.
sep : character, default ‘,’
Field delimiter for the output file.
na_rep : string, default ”
Missing data representation
float_format : string, default None
Format string for floating point numbers
columns : sequence, optional
Columns to write
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed
to be aliases for the column names
index : boolean, default True
Write row names (index)
index_label : string or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and
header and index are True, then the index names are used. A
sequence should be given if the DataFrame uses MultiIndex. If
False do not print fields for index names. Use index_label=False
for easier importing in R
nanRep : None
deprecated, use na_rep
mode : str
Python write mode, default ‘w’
encoding : string, optional
A string representing the encoding to use in the output file,
defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
compression : string, optional
a string representing the compression to use in the output file,
allowed values are ‘gzip’, ‘bz2’,
only used when the first argument is a filename
line_terminator : string, default ‘n’
The newline character or character sequence to use in the output
file
quoting : optional constant from csv module
defaults to csv.QUOTE_MINIMAL
quotechar : string (length 1), default ‘”’
character used to quote fields
doublequote : boolean, default True
Control quoting of quotechar inside a field
escapechar : string (length 1), default None
character used to escape sep and quotechar when appropriate
chunksize : int or None
rows to write at a time
tupleize_cols : boolean, default False
write multi_index columns as a list of tuples (if True)
or new (expanded format) if False)
date_format : string, default None
Format string for datetime objects
decimal: string, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for
European data

书本注

P164 header=None
header不同：

pd.read_csv('ex2.csv',header=None)
Out[14]:
0
1
2
3
4
0
1
2
3
4
hello
1
5
6
7
8
world
2
9
10
11
12
foo

P166 读txt
似乎并不需要书上那么复杂，现在功能扩充完善，不过格式稍有不同：

list(open('ex3.txt'))
Out[21]:
['
A
B
Cn',
'aaa -0.264438 -1.026059 -0.619500n',
'bbb
0.927272
0.302904 -0.032399n',
'ccc -0.264273 -0.386314 -0.217601n',
'ddd -0.871858 -0.348382
1.100491n']
result = pd.read_csv('ex3.txt')
result
Out[23]:
A
B
C
0
aaa -0.264438 -1.026059 -0.619500
1
bbb
0.927272
0.302904 -0.032399
2
ccc -0.264273 -0.386314 -0.217601
3
ddd -0.871858 -0.348382
1.100491
result = pd.read_table('ex3.txt')
result
Out[25]:
A
B
C
0
aaa -0.264438 -1.026059 -0.619500
1
bbb
0.927272
0.302904 -0.032399
2
ccc -0.264273 -0.386314 -0.217601
3
ddd -0.871858 -0.348382
1.100491
result = pd.read_table('ex3.txt',sep='s+')
result
Out[27]:
A
B
C
aaa -0.264438 -1.026059 -0.619500
bbb
0.927272
0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382
1.100491

P166 skiprows
如果不skiprows，函数会将连续的读成一个单元格的数据：

pd.read_csv('ex4.csv')
Out[29]:
# hey!
a
b
c
d
message
# just wanted to make things more difficult for... NaN
NaN NaN
NaN
# who reads CSV files with computers
anyway? NaN NaN
NaN
1
2
3
4
hello
5
6
7
8
world
9
10
11
12
foo
pd.read_csv('ex4.csv',skiprows=[0,2,3])
Out[30]:
a
b
c
d message
0
1
2
3
4
hello
1
5
6
7
8
world
2
9
10
11
12
foo

P166 na_values的含义

na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', 'N/A',
'NA', 'NULL', 'NaN', 'nan'.

na_values=[‘xxx’]的意思为DataFrame里面为xxx的元素标记未NaN：

result = pd.read_csv('ex5.csv')
result
Out[33]:
something
a
b
c
d message
0
one
1
2
3.0
4
NaN
1
two
5
6
NaN
8
world
2
three
9
10
11.0
12
foo
result = pd.read_csv('ex5.csv',na_values=['5'])
result
Out[40]:
something
a
b
c
d message
0
one
1.0
2
3.0
4
NaN
1
two
NaN
6
NaN
8
world
2
three
9.0
10
11.0
12
foo
result = pd.read_csv('ex5.csv',na_values=['three'])
result
Out[42]:
something
a
b
c
d message
0
one
1
2
3.0
4
NaN
1
two
5
6
NaN
8
world
2
NaN
9
10
11.0
12
foo

P168 显示具体信息

"""
直接result回车，会直接显示result全部内容
"""
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
one
10000 non-null float64
two
10000 non-null float64
three
10000 non-null float64
four
10000 non-null float64
key
10000 non-null object
dtypes: float64(4), object(1)
memory usage: 390.7+ KB

P170 to_csv()
参数变了，cols–>columns：

 data.to_csv(sys.stdout,index=False,cols=list('abc'))
something,a,b,c,d,message
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
data.to_csv(sys.stdout,index=False,columns=list('abc'))
a,b,c
1,2,3.0
5,6,
9,10,11.0

P172 diaect
书上写错了，应该是dialect：

reader = csv.reader(f,diaect=my_dialect)
---------------------------------------------------------------------------
TypeError
Traceback (most recent call last)
<ipython-input-97-fff676b470d2> in <module>()
----> 1 reader = csv.reader(f,diaect=my_dialect)
TypeError: 'diaect' is an invalid keyword argument for this function

P172 写错误，定义类的时候似乎没有继承

with open('mydata.csv','w') as f:
writer = csv.writer(f,dialect=my_dialect)
writer.writerow(('one','two','three'))
writer.writerow(('1','2','3'))
writer.writerow(('1','2','3'))
writer.writerow(('1','2','3'))
---------------------------------------------------------------------------
TypeError
Traceback (most recent call last)
<ipython-input-106-1f82c8cdfdb0> in <module>()
1 with open('mydata.csv','w') as f:
----> 2
writer = csv.writer(f,dialect=my_dialect)
3
writer.writerow(('one','two','three'))
4
writer.writerow(('1','2','3'))
5
writer.writerow(('1','2','3'))
TypeError: "quoting" must be an integer
"""
重新定义一下my_dialect中的quoting，必须是整数，并不清楚其含义，暂设为0。
"""
class my_dialect(csv.Dialect):
lineterminator = 'n'
delimiter = ';'
quotechar = '"'
quoting = 0
with open('mydata.csv','w') as f:
writer = csv.writer(f,dialect=my_dialect)
writer.writerow(('one','two','three'))
writer.writerow(('1','2','3'))
writer.writerow(('1','2','3'))
writer.writerow(('1','2','3'))
!cat mydata.csv
one;two;three
1;2;3
1;2;3
1;2;3

P173 json.loads()和json.dumps()
经过一系列转化后，和原obj还是iyouyidian差别的：

obj = """
{"name":"Wes",
"place_lived":["United States","Spain","Germany"],
"pet":null,
"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},
{"naem":"Katie","age":33,"pet":"Cisco"}]
}
"""
import json
obj
Out[117]: 'n{"name":"Wes",n"place_lived":["United States","Spain","Germany"],n"pet":null,n"siblings":[{"name":"Scott","age":25,"pet":"Zuko"},n{"naem":"Katie","age":33,"pet":"Cisco"}]n}n'
result = json.loads(obj)
result
Out[119]:
{u'name': u'Wes',
u'pet': None,
u'place_lived': [u'United States', u'Spain', u'Germany'],
u'siblings': [{u'age': 25, u'name': u'Scott', u'pet': u'Zuko'},
{u'age': 33, u'naem': u'Katie', u'pet': u'Cisco'}]}
asjson = json..dumps(result)
File "<ipython-input-120-b73195ced089>", line 1
asjson = json..dumps(result)
^
SyntaxError: invalid syntax
asjson = json.dumps(result)
asjson
Out[122]: '{"pet": null, "place_lived": ["United States", "Spain", "Germany"], "name": "Wes", "siblings": [{"pet": "Zuko", "age": 25, "name": "Scott"}, {"pet": "Cisco", "age": 33, "naem": "Katie"}]}'