In [21]: result = pd.read_table('examples/ex3.txt', sep='\s+')
In [22]: result Out[22]: A B C aaa -0.264438 -1.026059 -0.619500 bbb 0.9272720.302904 -0.032399 ccc -0.264273 -0.386314 -0.217601 ddd -0.871858 -0.3483821.100491
In [23]: !cat examples/ex4.csv # hey! a,b,c,d,message # just wanted to make things more difficult for you # who reads CSV files with computers, anyway? 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo In [24]: pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3]) Out[24]: a b c d message 01234 hello 15678 world 29101112 foo
In [25]: !cat examples/ex5.csv something,a,b,c,d,message one,1,2,3,4,NA two,5,6,,8,world three,9,10,11,12,foo In [26]: result = pd.read_csv('examples/ex5.csv')
In [27]: result Out[27]: something a b c d message 0 one 123.04 NaN 1 two 56 NaN 8 world 2 three 91011.012 foo
In [28]: pd.isnull(result) Out[28]: something a b c d message 0FalseFalseFalseFalseFalseTrue 1FalseFalseFalseTrueFalseFalse 2FalseFalseFalseFalseFalseFalse
na_values可以用一个列表或集合的字符串表示缺失值:
1 2 3 4 5 6 7 8
In [29]: result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
In [30]: result Out[30]: something a b c d message 0 one 123.04 NaN 1 two 56 NaN 8 world 2 three 91011.012 foo
字典的各列可以使用不同的NA标记值:
1 2 3 4 5 6 7 8
In [31]: sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
In [32]: pd.read_csv('examples/ex5.csv', na_values=sentinels) Out[32]: something a b c d message 0 one 123.04 NaN 1 NaN 56 NaN 8 world 2 three 91011.012 NaN
In [35]: result Out[35]: one two three four key 00.467976 -0.038649 -0.295344 -1.824726 L 1 -0.3588931.4044530.704965 -0.200638 B 2 -0.5018400.659254 -0.421691 -0.057688 G 30.2048861.0741341.388361 -0.982404 R 40.354628 -0.1331160.283763 -0.837063 Q ... ... ... ... ... .. 99952.311896 -0.417070 -1.409599 -0.515821 L 9996 -0.479893 -0.6504190.745152 -0.646038 E 99970.5233310.7871120.4860661.093156 K 9998 -0.3625590.598894 -1.8432010.887292 G 9999 -0.096376 -1.012999 -0.657431 -0.5733150 [10000 rows x 5 columns] If you want to only read a small
如果只想读取几行(避免读取整个文件),通过nrows进行指定即可:
1 2 3 4 5 6 7 8
In [36]: pd.read_csv('examples/ex6.csv', nrows=5) Out[36]: one two three four key 00.467976 -0.038649 -0.295344 -1.824726 L 1 -0.3588931.4044530.704965 -0.200638 B 2 -0.5018400.659254 -0.421691 -0.057688 G 30.2048861.0741341.388361 -0.982404 R 40.354628 -0.1331160.283763 -0.837063 Q
要逐块读取文件,可以指定chunksize(行数):
1 2 3 4 5
In [874]: chunker = pd.read_csv('ch06/ex6.csv', chunksize=1000)
In [875]: chunker Out[875]: <pandas.io.parsers.TextParser at 0x8398150>
In [73]: tables = pd.read_html('examples/fdic_failed_bank_list.html')
In [74]: len(tables) Out[74]: 1
In [75]: failures = tables[0]
In [76]: failures.head() Out[76]: Bank Name City ST CERT \ 0 Allied Bank Mulberry AR 91 1 The Woodbury Banking Company Woodbury GA 11297 2 First CornerStone Bank King of Prussia PA 35312 3 Trust Company Bank Memphis TN 9956 4 North Milwaukee State Bank Milwaukee WI 20364 Acquiring Institution Closing Date Updated Date 0 Today's Bank September 23, 2016 November 17, 2016 1 United Bank August 19, 2016 November 17, 2016 2 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016 3 The Bank of Fayette County April 29, 2016 September 6, 2016 4 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016
因为failures有许多列,pandas插入了一个换行符。
这里,我们可以做一些数据清洗和分析(后面章节会进一步讲解),比如计算按年份计算倒闭的银行数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
In [77]: close_timestamps = pd.to_datetime(failures['Closing Date'])
<INDICATOR> <INDICATOR_SEQ>373889</INDICATOR_SEQ> <PARENT_SEQ></PARENT_SEQ> <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME> <INDICATOR_NAME>Escalator Availability</INDICATOR_NAME> <DESCRIPTION>Percent of the time that escalators are operational systemwide. The availability rate is based on physical observations performed the morning of regular business days only. This is a new indicator the agency began reporting in 2009.</DESCRIPTION> <PERIOD_YEAR>2011</PERIOD_YEAR> <PERIOD_MONTH>12</PERIOD_MONTH> <CATEGORY>Service Indicators</CATEGORY> <FREQUENCY>M</FREQUENCY> <DESIRED_CHANGE>U</DESIRED_CHANGE> <INDICATOR_UNIT>%</INDICATOR_UNIT> <DECIMAL_PLACES>1</DECIMAL_PLACES> <YTD_TARGET>97.00</YTD_TARGET> <YTD_ACTUAL></YTD_ACTUAL> <MONTHLY_TARGET>97.00</MONTHLY_TARGET> <MONTHLY_ACTUAL></MONTHLY_ACTUAL> </INDICATOR>
for elt in root.INDICATOR: el_data = {} for child in elt.getchildren(): if child.tag in skip_fields: continue el_data[child.tag] = child.pyval data.append(el_data)
最后,将这组字典转换为一个DataFrame:
1 2 3 4 5 6 7 8
In [81]: perf = pd.DataFrame(data)
In [82]: perf.head() Out[82]: Empty DataFrame Columns: [] Index: []
HDF5是一种存储大规模科学数组数据的非常好的文件格式。它可以被作为C标准库,带有许多语言的接口,如Java、Python和MATLAB等。HDF5中的HDF指的是层次型数据格式(hierarchical data format)。每个HDF5文件都含有一个文件系统式的节点结构,它使你能够存储多个数据集并支持元数据。与其他简单格式相比,HDF5支持多种压缩器的即时压缩,还能更高效地存储重复模式数据。对于那些非常大的无法直接放入内存的数据集,HDF5就是不错的选择,因为它可以高效地分块读写。
In [119]: issues = pd.DataFrame(data, columns=['number', 'title', .....: 'labels', 'state'])
In [120]: issues Out[120]: number title \ 017666 Period does notround down for frequencies les... 117665 DOC: improve docstring of function where 217664 COMPAT: skip 32-bit test on intrepr 317662 implement Delegator class 4 17654 BUG: Fix series rename called withstr alterin... .. ... ... 2517603 BUG: Correctly localize naive datetime strings... 2617599 core.dtypes.generic --> cython 2717596 Merge cdate_range functionality into bdate_range 2817587 Time Grouper bug fix when applied forlist gro... 2917583 BUG: fix tz-aware DatetimeIndex + TimedeltaInd... labels state 0 [] open 1 [{'id': 134699, 'url': 'https://api.github.com... open 2 [{'id': 563047854, 'url': 'https://api.github.... open 3 [] open 4 [{'id': 76811, 'url': 'https://api.github.com/... open .. ... ... 25 [{'id': 76811, 'url': 'https://api.github.com/... open 26 [{'id': 49094459, 'url': 'https://api.github.c... open 27 [{'id': 35818298, 'url': 'https://api.github.c... open 28 [{'id': 233160, 'url': 'https://api.github.com... open 29 [{'id': 76811, 'url': 'https://api.github.com/... open [30 rows x 4 columns]
In [134]: pd.DataFrame(rows, columns=[x[0] for x in cursor.description]) Out[134]: a b c d 0 Atlanta Georgia 1.256 1 Tallahassee Florida 2.603 2 Sacramento California 1.705