概述
知乎居然不支持markdown......
抱歉了。。。本文markdown模式。。。
# DataFrame
DataFrame是一种**表格型**数据结构
DataFrame既有行索引(index),也有列索引(column)
可以看作是由Series组成的字典,不过这些Series公用一个索引
## 1. Create DataFrame
* Create DataFrame -- through creating dictionary
```python
data = {
'state':['Ohio','Ohio','Ohio','London','London'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame = pd.DataFrame(data) # build up dataframe structure
frame
```
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/Screenshot%20from%202020-03-18%2016-09-04.png)
* 设置索引--创建DataFrame时指定索引的值
```python
frame2 = pd.DataFrame(data,index=['one','two','three','four','five'],columns=['year','state','pop','debt'])
frame2
```
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/Screenshot%20from%202020-03-18%2016-47-08.png)
* 设置索引-- 嵌套字典方式设置索引
最外层key作为行索引,内层key作为列索引
```python
pop = {'London':{2001:2.4,2002:2.9},'Ohio':{2000:1.5,2001:1.7,2002:3.6},'NewYork':{2000:3.4,2001:3.3,2002:4.6},'Beijing':{2000:5.6,2002:3.6}}
frame3 = pd.DataFrame(pop)
frame3
```
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/Screenshot%20from%202020-03-18%2017-12-40.png)
* **访问Dataframe**
用行索引,列索引,values访问:某一行,某一列,和全部表中数值
```python
frame3.values # numpy DataFrame.to_numpy() ---> frame3.to_numpy()
```
return 二维的ndarry(numpy 结构):
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/Screenshot%20from%202020-03-18%2017-11-52.png)
* **Array问题**
一维数组打印成行,二维数组打印成矩阵,三维数组打印成矩阵列表
```python
[1 3 5]
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
```
## 2. 读取文件
读取文件生成DataFrame最常用的是read_csv,read_table方法。
| parameter | description |
|-----------|--------------|
| header | 默认第一行为columns,如果指定header=None,则表明没有索引行,第一行就是数据 |
| index_col | 默认作为索引的为第一列,可以设为index_col为-1,表明没有索引列 |
| nrows | 表明读取的行数 |
* create DataFrdame methods: read **mysql** or **json**
## 3. DataFrame axis
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/axis.png)
## 4. DataFrame性质
### Slice
* 根据列名选取一列,返回一个Series:
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/Screenshot%20from%202020-03-18%2016-47-08.png)
```python
frame2['year']
```
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/slice.png)
* 选取多列或多行
* create 4*4 DataFrame
```python
data = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
data
```
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/create4by4.png)
```python
data2 = pd.DataFrame(np.arange(16).reshape(4,4),index=list('abcd'),columns=list('wxyz'))
data2
```
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/create4*4array-index%26col.png)
* select 2 column
```python
data[['two','three']] # two [[ ]] for column label
```
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/select%202%20column.png)
* select 2 lines
```python
data[:2]
```
![Aaron Swartz](https://github.com/tsheng0315/markdown-image/raw/master/select%202%20lines.png)
```python
import pandas as pd
from pandas import Sereis, DataFrame
data = DataFrame(np.arange(16).reshape(4,4),index=list('abcd'),columns=list('wxyz'))
```
```python
import pandas as pd
data = pd.DataFrame(np.arange(16).reshape(4,4),index=list('abcd'),columns=list('wxyz'))
```
* Get value from dataframe structure
```python
data['w'] #选择表格中的'w'列,使用类字典属性,返回的是Series类型
data.w #选择表格中的'w'列,使用点属性,返回的是Series类型
data[['w']] #选择表格中的'w'列,返回的是DataFrame属性
data[['w','z']] #选择表格中的'w'、'z'列,返回的是DataFrame属性
data[0:2] #返回第1行到第2行的所有行,前闭后开,包括前不包括后
data[1:2] #返回第2行,通过有前后值的索引形式,
#如果采用data[1]则报错
data['a':'b'] #利用index值进行切片,返回的是**前闭后闭**的DataFrame,
data.irow(0) #取data的第一行
data.icol(0) #取data的第一列
data.head() #返回data的前几行数据,默认为前五行,需要前十行则data.head(10)
data.tail() #返回data的后几行数据,默认为后五行,需要后十行则data.tail(10)
data.iloc[-1] #用index,选取DataFrame最后一行,返回的是Series
data.iloc[-1:] #用index,选取DataFrame最后一行,返回的是DataFrame
data.loc['a',['w','x']] #用label,返回‘a’行'w'、'x'列,这种用于选取行索引列索引已知
data.iat[1,1] #选取第二行第二列,用于已知行、列位置的选取。
ser.iget_value(0) #选取ser序列中的第一个
ser.iget_value(-1) #选取ser序列中的最后一个,这种轴索引包含索引器的series不能采用ser[-1]去获取最后一个,这回引起歧义。
```
### Modify Data
* **Modify Data**--修改DataFrame中的某一列,此时这个标量会广播到DataFrame的每一行上
```python
data = {
'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame2 = pd.DataFrame(data,index=['one','two','three','four','five'],columns=['year','state','pop','debt'])
print(frame2)
print()
print('change debt:')
frame2['debt']=16.5 # this would change all the debt values into 16.5
print(frame2)
```
Output:
``` python
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
change debt:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
```
* **Modify Data**--用一个列表来修改,要保证列表的长度与DataFrame长度相同
```python
frame2.debt = np.arange(5) # np.arange(5)--return: numpy.ndarray
print(frame2)
```
Output:
```python
year state pop debt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
```
* **Modify Data**--创建Series,根据索引进行精确匹配,修改数据
```python
val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
print(val)
print()
frame2['debt'] = val
print(frame2)
```
Output:
```python
two -1.2
four -1.5
five -1.7
dtype: float64
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
```
* **Modify Index--reindex**
用reindex对DataFrame重新索引。
可重新索引lines,column或both; 如果只传入一个参数,则会reindex行:
```python
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index=[1,4,5],columns=['Ohio','Texas','California'])
print(frame)
print()
frame2 = frame.reindex([1,2,4,5]) # enlarge the dataframe, the absent values are filled with NaN
print(frame2)
frame3=frame.reindex(index=[1,2,4,5],columns=['Texas','Utah','California'])
print(frame3)
states = ['Texas','Utah','California']
frame4=frame.reindex(columns=states)
print(frame4)
```
Output:
```python
Ohio Texas California
1 0 1 2
4 3 4 5
5 6 7 8
Ohio Texas California
1 0.0 1.0 2.0
2 NaN NaN NaN
4 3.0 4.0 5.0
5 6.0 7.0 8.0
Texas Utah California
1 1.0 NaN 2.0
2 NaN NaN NaN
4 4.0 NaN 5.0
5 7.0 NaN 8.0
Texas Utah California
1 1 NaN 2
4 4 NaN 5
5 7 NaN 8
```
填充数据只能按行填充,此时只能对行进行重新索引:
```python
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California'])
print(frame)
frame.reindex(['a','b','c','d'],method = 'bfill')
#states = ['Texas','Utah','California']
#frame.reindex(['a','b','c','d'],method = 'bfill',columns=states) 报错,只能对行进行重新索引
```
Output:
```python
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
Ohio Texas California
a 0 1 2
b 3 4 5
c 3 4 5 # repeat this line from above
d 6 7 8
```
* **drop()** 指定轴上的值
丢弃指定轴上的值,并不对原DataFrame产生影响
```python
frame = pd.DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California'])
print(frame)
frame2=frame.drop('a') # drop lines
print(frame2)
frame3=frame.drop(['Ohio'],axis=1) # drop columns
print(frame3)
```
Output:
```python
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
Ohio Texas California
c 3 4 5
d 6 7 8
Texas California
a 1 2
c 4 5
d 7 8
```
### * 算术运算
DataFrame在进行算术运算时会进行补齐,在不重叠的部分补足NaN:
```python
df1 = pd.DataFrame(np.arange(9).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
df2 = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('bde'),index=['Utah','Ohio','Texas','Oregon'])
print(df1 + df2)
```
Output:
```python
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
```
### * fill in value
可以使用fill_value方法填充NA数据,不过df1,df2中都为NA的数据,该方法不会填充
```python
a=df1.add(df2,fill_value=0)
```
Output:
```python
b c d e
Colorado 6.0 7.0 8.0 NaN
Ohio 3.0 1.0 6.0 5.0
Oregon 9.0 NaN 10.0 11.0
Texas 9.0 4.0 12.0 8.0
Utah 0.0 NaN 1.0 2.0
```
### • Functions
```python
frame = pd.DataFrame(np.random.randn(3,3),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
print(frame)
frame2=np.abs(frame)
print(frame2)
```
Output:
```python
b c d
Ohio 1.080812 0.484312 0.579140
Texas -0.181583 1.410205 -0.374472
Colorado 0.275198 -0.960755 0.376927
b c d
Ohio 1.080812 0.484312 0.579140
Texas 0.181583 1.410205 0.374472
Colorado 0.275198 0.960755 0.376927
```
### * Apply function to dataframe
将函数应用到由各列或行所形成的一维数组上
```python
f = lambda x:x.max() - x.min() # define f function:max value- min value
a=frame2.apply(f) # by columns
print(a)
print()
b=frame2.apply(f,axis=1) # by lines
print(b)
def f(x):
return pd.Series(data=[x.min(),x.max()],index=['min','max'])
c=frame2.apply(f)
print(c)
```
Output
```python
b 0.899229
c 0.925892
d 0.204669
dtype: float64
Ohio 0.596500
Texas 1.228622
Colorado 0.685556
dtype: float64
b c d
min 0.181583 0.484312 0.374472
max 1.080812 1.410205 0.579140
```
* 元素级python函数
```python
format = lambda x:'%.2f'%x
frame2.applymap(format)
```
Output:
```python
b c d
Ohio 1.08 0.48 0.58
Texas 0.18 1.41 0.37
Colorado 0.28 0.96 0.38
```
### sort & rank
### •sort according to label
对于DataFrame,sort_index可以根据任意轴的索引进行排序,并指定升序降序
```python
frame = pd.DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c'])
print(frame)
a=frame.sort_index() # sort according to label in first column,ascend
print(a)
b=frame.sort_index(1,ascending=False) # sort according to label in first column, descend
print(b)
```
Output:
```python
d a b c
three 0 1 2 3
one 4 5 6 7
d a b c
one 4 5 6 7
three 0 1 2 3
d c b a
three 0 3 2 1
one 4 7 6 5
```
### * sort according to value
```python
#按照任意一列或多列进行排序
print('frame:')
print(frame)
print('sorted:')
a=frame.sort_values(by=['b','a'])
print(a)
```
Output:
```python
frame:
d a b c
three 0 1 2 3
one 4 5 6 7
sorted:
d a b c
three 0 1 2 3
one 4 5 6 7
```
### sum& describe
sum、mean、max等方法,可以指定进行汇总统计的轴,describe()查看所有的统计项:
```python
df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]],index=['a','b','c','d'],columns=['one','two'])
print('original:',df)
print('sum:')
a=df.sum(axis=1)
print(a)
print('mean:')
#Na会被自动排除,可以使用skipna选项来禁用该功能
b=df.mean(axis=1,skipna=False)
print(b)
print('maxId:')
#idxmax返回间接统计,是达到最大值的索引
c=df.idxmax()
print(c)
#describe返回的是DataFrame的汇总统计
#非数值型的与数值型的统计返回结果不同
d=df.describe()
print(d)
```
Output:
```python
original:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
sum:
a 1.40
b 2.60
c 0.00
d -0.55
dtype: float64
mean:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64
maxId:
one b
two d
dtype: object
describe:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
```
### **corr & cov**
DataFrame也实现了corr和cov方法来计算一个DataFrame的相关系数矩阵和协方差矩阵,同时DataFrame也可以与Series求解相关系数
```python
frame1 = pd.DataFrame(np.random.randn(3,3),index=list('123'),columns=list('abc'))
print(frame1)
a=frame1.corr()
print(a)
print('cov:')
b=frame1.cov() # covariance
print(b)
#corrwith用于计算每一列与Series的相关系数
frame1.corrwith(frame1['a']) # frame1['a'] is a series structure
```
output:
```python
original:
a b c
1 0.555027 0.196642 0.635890
2 1.771331 -0.222909 -0.741340
3 0.237062 0.754904 -0.259094
corr:
a b c
a 1.000000 -0.918033 -0.627437
b -0.918033 1.000000 0.267263
c -0.627437 0.267263 1.000000
cov:
a b c
a 0.655747 -0.364673 -0.355075
b -0.364673 0.240633 0.091622
c -0.355075 0.091622 0.488386
corrwith:
a 1.000000
b -0.918033
c -0.627437
dtype: float64
```
### • 处理缺失数据
* isnull方法用于判断数据是否为空数据;
* fillna方法用于填补缺失数据;
* dropna方法用于舍弃缺失数据。
上面两个方法返回一个新的Series或者DataFrame,对原数据没有影响.
如果想在原数据上进行直接修改,使用**inplace**参数:
* dropna:
```python
data = pd.DataFrame([[1,6.5,3],[1,np.nan,np.nan],[np.nan,np.nan,np.nan],[np.nan,6.5,3]])
print(data)
print('dropna:')
a=data.dropna() # drop all line if has one NaN in a line
print(a)
```
Output:
```python
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
dropna:
0 1 2
0 1.0 6.5 3.0
```
对DataFrame,dropna()发现NaN,会整行删除; but可以指定删除的方式,how=all 是当整行全是na时才进行删除,同时还可以指定删除的轴。
``` python
data.dropna(how='all',axis=1,inplace=True) # 在原数据上进行直接修改,使用**inplace**参数:
print(data)
data.dropna(how='all',axis=0,inplace=True)
print(data)
```
Output:
```python
axis=1
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
axis=0
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
```
* DataFrame填充缺失值可以统一填充,也可以按列填充,或者指定一种填充方式:
```python
data = pd.DataFrame([[1,6.5,3],[1,np.nan,np.nan],[np.nan,np.nan,np.nan],[np.nan,6.5,3]])
a=data.fillna({1:2,2:3}) # cilumn 1: fill in 2, column 2: fill in 3
print(a)
data.fillna(method='ffill') # back& forward propagation
```
Output:
```python
origianl:
0 1 2
0 1.0 6.5 3.0
1 1.0 2.0 3.0
2 NaN 2.0 3.0
3 NaN 6.5 3.0
method='ffill'
0 1 2
0 1.0 6.5 3.0
1 1.0 6.5 3.0
2 1.0 6.5 3.0
3 1.0 6.5 3.0
```
最后
以上就是内向草丛为你收集整理的dataframe scala 修改值_DataFrame的全部内容,希望文章能够帮你解决dataframe scala 修改值_DataFrame所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复