怎麽用python進行數據

pandas是本書後續內容的首選庫。pandas可以滿足以下需求：

具備按軸自動或顯式數據對齊功能的數據結構。這可以防止許多由於數據未對齊以及來自不同數據源（索引方式不同）的數據而導致的常見錯誤。.

集成時間序列功能

既能處理時間序列數據也能處理非時間序列數據的數據結構

數學運算和簡約（比如對某個軸求和）可以根據不同的元數據（軸編號）執行

靈活處理缺失數據

合並及其他出現在常見數據庫（例如基於SQL的）中的關系型運算

1、pandas數據結構介紹

兩個數據結構：Series和DataFrame。Series是壹種類似於以為NumPy數組的對象，它由壹組數據（各種NumPy數據類型）和與之相關的壹組數據標簽（即索引）組成的。可以用index和values分別規定索引和值。如果不規定索引，會自動創建 0 到 N-1 索引。

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdfrom pandas import Series,DataFrame#Series可以設置index，有點像字典，用index索引obj = Series([1,2,3],index=['a','b','c'])#print obj['a']#也就是說，可以用字典直接創建Seriesdic = dict(key = ['a','b','c'],value = [1,2,3])

dic = Series(dic)#下面註意可以利用壹個字符串更新鍵值key1 = ['a','b','c','d']#註意下面的語句可以將 Series 對象中的值提取出來，不過要知道的字典是不能這麽做提取的dic1 = Series(obj,index = key1)#print dic#print dic1#isnull 和 ?notnull 是用來檢測缺失數據#print pd.isnull(dic1)#Series很重要的功能就是按照鍵值自動對齊功能dic2 = Series([10,20,30,40],index = ['a','b','c','e'])#print dic1 + dic2#name屬性,可以起名字dic1.name = 's1'dic1.index.name = 'key1'#Series 的索引可以就地修改dic1.index = ['x','y','z','w']

DataFrame是壹種表格型結構，含有壹組有序的列，每壹列可以是不同的數據類型。既有行索引，又有列索引，可以被看做由Series組成的字典（使用***同的索引）。跟其他類似的數據結構（比如R中的data.frame），DataFrame面向行和列的操作基本是平衡的。其實，DataFrame中的數據是以壹個或者多個二維塊存放的（不是列表、字典或者其他）。

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdfrom pandas import Series,DataFrame#構建DataFrame可以直接傳入等長的列表或Series組成的字典#不等長會產生錯誤data = {'a':[1,2,3], 'c':[4,5,6], 'b':[7,8,9]

}#註意是按照列的名字進行列排序frame = DataFrame(data)#print frame#指定列之後就會按照指定的進行排序frame = DataFrame(data,columns=['a','c','b'])print frame#可以有空列,index是說行名frame1 = DataFrame(data,columns = ['a','b','c','d'],index = ['one','two','three'])print frame1#用字典方式取列數據print frame['a']print frame.b#列數據的修改直接選出來重新賦值即可#行，可以用行名或者行數來進行選取print frame1.ix['two']#為列賦值，如果是Series，規定了index後可以精確賦值frame1['d'] = Series([100,200,300],index = ['two','one','three'])print frame1#刪除列用del 函數del frame1['d']#警告：通過列名選出來的是Series的視圖，並不是副本，可用Series copy方法得到副本

另壹種常見的結構是嵌套字典，即字典的字典，這樣的結構會默認為外鍵為列，內列為行。

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdfrom pandas import Series,DataFrame#內層字典的鍵值會被合並、排序以形成最終的索引pop = {'Nevada':{2001:2.4,2002:2.9}, ? 'Ohio':{2000:1.5,2001:1.7,2002:3.6}}

frame3 = DataFrame(pop)#rint frame3#Dataframe也有行和列有name屬性，DataFrame有value屬性frame3.index.name = 'year'frame3.columns.name = 'state'print frame3print frame3.values

下面列出了DataFrame構造函數能夠接受的各種數據。

索引對象

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdfrom pandas import Series,DataFrame#pandas索引對象負責管理軸標簽和其他元數據，構建Series和DataFrame時，所用到的任何數組或其他序列的標簽都被轉換為Indexobj = Series(range(3),index = ['a','b','c'])

index = obj.index#print index#索引對象是無法修改的,這非常重要，因為這樣才會使得Index對象在多個數據結構之間安全***享index1 = pd.Index(np.arange(3))

obj2 = Series([1.5,-2.5,0],index = index1)print obj2.index is index1#除了長得像數組，Index的功能也類似壹個固定大小的集合print 'Ohio' in frame3.columnsprint 2003 in frame3.index

pandas中的Index是壹個類，pandas中主要的Index對象（什麽時候用到）。

下面是Index的方法與屬性，值得註意的是：index並不是數組。

2、基本功能

下面介紹基本的Series 和 DataFrame 數據處理手段。首先是索引：

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom pandas import Series,DataFrame#Series有壹個reindex函數，可以將索引重排，以致元素順序發生變化obj = Series([1,2,3,4],index=['a','b','c','d'])#註意這裏的reindex並不改變obj的值，得到的是壹個“副本”#fill_value 顯然是填充空的index的值#print obj.reindex(['a','c','d','b','e'],fill_value = 0)#print objobj2 = Series(['red','blue'],index=[0,4])#method = ffill，意味著前向值填充obj3 = obj2.reindex(range(6),method='ffill')#print obj3#DataFrame 的reindex可以修改行、列或者兩個都改frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California'])#只是傳入壹列數，是對行進行reindex,因為...frame的行參數叫index...(我這麽猜的)frame2 = frame.reindex(['a','b','c','d'])#print frame2#當傳入原來沒有的index是，當然返回的是空NaN#frame3 = frame.reindex(['e'])#print frame3states = ['Texas','Utah','California']#這是對行、列重排#註意：這裏的method是對index 也就是行進行的填充，列是不能填充的（不管method的位置如何）frame4 = frame.reindex(index = ['a','b','c','d'],columns=states,method = 'ffill')#print frame4#使用ix的標簽索引功能，重新索引變得比較簡潔print frame.ix[['a','d','c','b'],states]

丟棄指定軸上的項

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom pandas import Series,DataFrame#drop函數可以丟棄軸上的列、行值obj = Series(np.arange(3.),index = ['a','b','c'])#原Series並不丟棄obj.drop('b')#print obj#註意下面，行可以隨意丟棄，列需要加axis = 1print frame.drop(['a'])print frame.drop(['Ohio'],axis = 1)

下面說索引、選取和過濾

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom pandas import Series,DataFrame

obj = Series([1,2,3,4],index=['a','b','c','d'])

frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California'])#Series切片和索引#print obj[obj < 2]#註意：利用標簽的切片與python的切片不同，兩端都是包含的（有道理）print obj['b':'c']#對於DataFrame，列可以直接用名稱print frame['Ohio']#特殊情況：通過切片和bool型索引，得到的是行(有道理)print frame[:2]print frame[frame['Ohio'] != 0]#下面的方式是對frame所有元素都適用，不是行或者列,下面的得到的是numpy.ndarray類型的數據print frame[frame < 5],type(frame[frame < 5])

frame[frame < 5] = 0print frame#對於DataFrame上的標簽索引，用ix進行print frame.ix[['a','d'],['Ohio','Texas']]print frame.ix[2] #註意這裏默認取行#註意下面默認取行print frame.ix[frame.Ohio > 0]#註意下面的逗號後面是列標print frame.ix[frame.Ohio > 0,:2]

下面是常用的索引選項：

算術運算和數據對齊

#pandas 有壹個重要的功能就是能夠根據索引自動對齊,其中索引不重合的部分值為NaNs1 = Series([1,2,3],['a','b','c'])

s2 = Series([4,5,6],['b','c','d'])#print s1 + s2df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))

df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))#print df1 + df2#使用add方法，並傳入填充值,註意下面的fill_value函數是先對應填充再進行加和，而不是加和得到NaN之後再填充#print df1.add(df2,fill_value = 1000)#df1.reindex(columns = df2.columns,fill_value=0)

除了add之外，還有其他的方法：

DataFrame和Series之間的運算

#下面看壹下DataFrame和Series之間的計算過程arr = DataFrame(np.arange(12.).reshape((3,4)),columns = list('abcd'))#下面的結果標明，就是按行分別相減即可，叫做 broadcasting#註意：默認情況下，DataFrame和Series的計算會將Series的索引匹配到DataFrame的列，然後進行計算，再沿著行壹直向下廣播#註意：下面的式子中，如果寫arr - arr[0]是錯的，因為只有標簽索引函數ix後面加數字才表示行print arr - arr.ix[0]

Series2 = Series(range(3),index = list('cdf'))#按照規則，在不匹配的列會形成NaN值print arr + Series2#如果想匹配行且在列上廣播，需要用到算術運算方法Series3 = arr['d']#axis就是希望匹配的軸print arr.sub(Series3,axis = 0)

下面是函數應用和映射

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom pandas import Series,DataFrame#NumPy的元素級數組方法也適用於pandas對象frame = DataFrame(np.random.randn(4,3),columns = list('abc'),index = ['Ut','Oh','Te','Or'])print frame#下面是求絕對值：#print np.abs(frame)#另壹種常見的做法是：將壹個函數應用到行或者列上,用apply方法，與R語言類似fun = lambda x:x.max() - x.min()#默認是應用在每壹列上print frame.apply(fun)#下面是應用在列上print frame.apply(fun,axis = 1)#很多統計函數根本不用apply，直接調用方法就可以了print frame.sum()#除了標量值之外，apply函數後面還可以接返回多個值組成的的Series的函數,有沒有很漂亮？def f(x): return Series([x.min(),x.max()],index = ['min','max'])#print frame.apply(f)#元素級的python函數也是可以用的，但是要使用applymap函數format = lambda x: '%.2f' % xprint frame.applymap(format)#之所以要用applymap是因為Series有壹個應用於元素級函數的map方法#這裏的map很有用print frame['b'].map(format)

排序與排名

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom pandas import Series,DataFrame#用sort_index函數對行、列的索引進行排序obj = Series(range(4),index = ['d','a','b','c'])print obj.sort_index()

frame = DataFrame(np.arange(8).reshape((2,4)),index = ['three','one'],columns = ['d','a','b','c'])#默認是對行 “索引” 進行排序，如果對列 “索引” 進行排序，axis = 1 即可print frame.sort_index()print frame.sort_index(axis = 1)print frame.sort_index(axis = 1,ascending = False)#如果對值進行排序，用的是order函數,註意所有的缺失值會放到最後（如果有的話）print obj.order()#numpy中的sort也可以用來排序print np.sort(obj)#如果相對DataFrame的值進行排序，函數還是sort_index，只不過後面需要加壹個參數byframe = DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})print frame.sort_index(by = ['a','b'])#rank函數返回從小到大排序的下標，對於平級的數，rank是通過“為各組分配壹個平均排名”的方式破壞評級關系#下標從1開始obj = Series([7,-5,7,4,2,0,4])print obj.rank()#而numpy中的argsort函數比較奇怪，返回的是把數據進行排序之後，按照值得順序對應的下標，下標從0開始print np.argsort(obj) #打印結果為：1,5,4,3,6,0,2 按照這個下標順序恰好可以得到從小打到的值，見下面print obj[np.argsort(obj)]#rank函數中有壹個method選項，用來規定下標的方式print obj.rank(method = 'first',ascending=False)print obj.rank(method = 'max',ascending=False)print obj.rank(method = 'min',ascending=False)#對於DataFrame，rank函數默認把每壹列排好並返回坐標print frame.rank()print frame.rank(axis = 1)

帶有重復值的軸索引

#-*- encoding:utf-8 -*-import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom pandas import Series,DataFrame#雖然pandas的很多函數（如reindex）要求標簽唯壹，但是並不具有強制性obj = Series(range(5),index = list('aabbc'))print obj#索引是否唯壹用is_unique看是否唯壹print obj.index.is_unique#對於重復值的索引，選取的話返回壹個Series，唯壹的索引返回壹個標量print obj['a']#對於DataFrame也是如此df = DataFrame(np.random.randn(4,3),index = list('aabb'))print dfprint df.ix['b']#####自己導入數據的時候數據處理之前可以做壹下index唯壹性等，自己創建DataFrame註意不能這樣

3、匯總和計算描述統計

#-*- encoding:utf-8 -*-import numpy as npimport osimport pandas as pdfrom pandas import Series,DataFrameimport matplotlib.pyplot as pltimport time#pandas 對象擁有壹組常用的數學和統計方法，大部分屬於簡約統計，用於從Series中提取壹個值，或者 ? 從DataFrame中提取壹列或者壹行Series#註意：與NumPy數組相比，這些函數都是基於沒有缺失數據的建設構建的，也就是說：這些函數會自動忽略缺失值。df = DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index = list('abcd'),columns=['one','two'])print df.sum()print df.sum(axis = 1)#下面是壹些函數，idxmin 和 idmax 返回的是達到最小或者最大的索引print df.idxmin()print df.idxmin(axis=1)#關於累積型的函數print df.cumsum()#describe函數，與R語言中的describe函數基本相同print df.describe()#對於非數值型的數據，看看下面的結果obj = Series(['c','a','a','b','d'] * 4)print obj.describe()'''結果為：

count 20

unique 4

top a

freq ? 8

其中，freq是指字母出現的最高頻率'''

#-*- encoding:utf-8 -*-import numpy as npimport osimport pandas as pdfrom pandas import Series,DataFrameimport matplotlib.pyplot as pltimport time#下面看壹下cummin函數#註意：這裏的cummin函數是截止到目前為止的最小值，而不是加和以後的最小值frame = DataFrame([[1,2,3,4],[5,6,7,8],[-10,11,12,-13]],index = list('abc'),columns = ['one','two','three','four'])print frame.cummin()print frame

>>>

one? two? three? four

a 1 2? 3 4

b 1 2? 3 4

c? -10 2? 3? -13

one? two? three? four

a 1 2? 3 4

b 5 6? 7 8

c? -10? 11 12? -13

相關系數與協方差

有些匯總