本篇主要是pandas50练习题的基础部分,也就是1~22题,还是较好理解的部分,后续更新++++++++
1.导入Pandas库并简写为pd,输出版本号
import pandas as pd
import numpy as np
pd.__version__
'1.1.3'
2.从列表创建Series
arr = [0,1,2,3,4]
df = pd.Series(arr) # 如果不做特殊指定说明,default从0开始
df
0 0
1 1
2 2
3 3
4 4
dtype: int64
3.从字典创建Series
d = {'a':1,'b':2,'c':3,'d':4,'e':5}
df = pd.Series(d)
df
a 1
b 2
c 3
d 4
e 5
dtype: int64
4.从numpy数组创建DataFrame
class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)[source]
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
Parameters
datandarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.
Changed in version 0.25.0: If data is a list of dicts, column order follows insertion-order.
indexIndex or array-like
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.
columnsIndex or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided.
dtypedtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
copybool, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input.
dates = pd.date_range('today',periods = 6) #定义时间序列作为 index
num_arr = np.random.randn(6,4) #传入numpy的 6行 × 4列随机数组
columns = ['A','B','C','D'] #将列表作为列名
df1 = pd.DataFrame(num_arr, index = dates, columns = columns)
df1
A | B | C | D | |
---|---|---|---|---|
2021-04-17 23:03:54.397660 | -1.520920 | 0.092495 | -0.487495 | -0.466914 |
2021-04-18 23:03:54.397660 | -0.289537 | 0.108166 | 0.192073 | -0.013956 |
2021-04-19 23:03:54.397660 | 0.693032 | -0.445103 | -0.425715 | 0.944692 |
2021-04-20 23:03:54.397660 | 0.403142 | 0.062311 | 0.544867 | 0.554797 |
2021-04-21 23:03:54.397660 | 1.535514 | -0.539361 | -0.096690 | 0.197693 |
2021-04-22 23:03:54.397660 | -0.676061 | 0.951746 | 0.312777 | 0.724948 |
5.从CSV中创建DataFrame,分隔符为:,编码格式为gbk
# df = pd.read_csv('test.csv',encoding = 'gbk,sep=';'')
6.从字典对象data创建DataFrame,设置索引为labels
import numpy as np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index = labels)
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 2.0 | 3 | no |
g | snake | 4.5 | 1 | no |
h | cat | NaN | 1 | yes |
i | dog | 7.0 | 2 | no |
j | dog | 3.0 | 1 | no |
7.显示DataFrame的基础信息,包括行的数量;列名;每一列值的数量、类型
df.info
<bound method DataFrame.info of animal age visits priority
a cat 2.5 1 yes
b cat 3.0 3 yes
c snake 0.5 2 no
d dog NaN 3 yes
e dog 5.0 2 no
f cat 2.0 3 no
g snake 4.5 1 no
h cat NaN 1 yes
i dog 7.0 2 no
j dog 3.0 1 no>
df.shape
(10, 4)
df.describe()
age | visits | |
---|---|---|
count | 8.000000 | 10.000000 |
mean | 3.437500 | 1.900000 |
std | 2.007797 | 0.875595 |
min | 0.500000 | 1.000000 |
25% | 2.375000 | 1.000000 |
50% | 3.000000 | 2.000000 |
75% | 4.625000 | 2.750000 |
max | 7.000000 | 3.000000 |
8.显示df的前3行
df.iloc[:3]
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
9.取出df的animal和age列
loc函数使用:loc[ rows , columns],其中rows/columns是列表
df.loc[:,['animal','age']]
animal | age | |
---|---|---|
a | cat | 2.5 |
b | cat | 3.0 |
c | snake | 0.5 |
d | dog | NaN |
e | dog | 5.0 |
f | cat | 2.0 |
g | snake | 4.5 |
h | cat | NaN |
i | dog | 7.0 |
j | dog | 3.0 |
10.取出索引为[3,4,8]行的animal和age列
df.loc[df.index[[3,4,8]],['animal','age']]
animal | age | |
---|---|---|
d | dog | NaN |
e | dog | 5.0 |
i | dog | 7.0 |
11.取出age值大于3的行
df[df['age'] > 3]
animal | age | visits | priority | |
---|---|---|---|---|
e | dog | 5.0 | 2 | no |
g | snake | 4.5 | 1 | no |
i | dog | 7.0 | 2 | no |
12.取出age值缺失的行
df[df['age'].isnull()]
animal | age | visits | priority | |
---|---|---|---|---|
d | dog | NaN | 3 | yes |
h | cat | NaN | 1 | yes |
13.取出age在2,4间的行(不含)
df[(df['age'] > 2)&(df['age'] < 4)]
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
j | dog | 3.0 | 1 | no |
df[df['age'].between(2, 4)]
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
f | cat | 2.0 | 3 | no |
j | dog | 3.0 | 1 | no |
14.f行的age改为1.5
df.loc['f','age'] = 1.5
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 1.5 | 3 | no |
g | snake | 4.5 | 1 | no |
h | cat | NaN | 1 | yes |
i | dog | 7.0 | 2 | no |
j | dog | 3.0 | 1 | no |
15.计算visits的总和
df['visits'].sum()
19
16.计算每个不同种类animal的age的平均数
df.groupby('animal')['age'].mean()
animal
cat 2.333333
dog 5.000000
snake 2.500000
Name: age, dtype: float64
17.计算df中每个种类animal的数量
df['animal'].value_counts()
dog 4
cat 4
snake 2
Name: animal, dtype: int64
18.先按age降序排列,后按visits升序排列
df.sort_values(by = ['age','visits'],ascending=[False,True])##排序筛选
animal | age | visits | priority | |
---|---|---|---|---|
i | dog | 7.0 | 2 | no |
e | dog | 5.0 | 2 | no |
g | snake | 4.5 | 1 | no |
j | dog | 3.0 | 1 | no |
b | cat | 3.0 | 3 | yes |
a | cat | 2.5 | 1 | yes |
f | cat | 1.5 | 3 | no |
c | snake | 0.5 | 2 | no |
h | cat | NaN | 1 | yes |
d | dog | NaN | 3 | yes |
19.将priority列中的yes, no替换为布尔值True, False
df['priority'] = df['priority'].map({'yes':True,'no':False})
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | True |
b | cat | 3.0 | 3 | True |
c | snake | 0.5 | 2 | False |
d | dog | NaN | 3 | True |
e | dog | 5.0 | 2 | False |
f | cat | 1.5 | 3 | False |
g | snake | 4.5 | 1 | False |
h | cat | NaN | 1 | True |
i | dog | 7.0 | 2 | False |
j | dog | 3.0 | 1 | False |
20.将animal列中的snake替换为python
df['animal'] = df['animal'].replace('snake','python')
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | True |
b | cat | 3.0 | 3 | True |
c | python | 0.5 | 2 | False |
d | dog | NaN | 3 | True |
e | dog | 5.0 | 2 | False |
f | cat | 1.5 | 3 | False |
g | python | 4.5 | 1 | False |
h | cat | NaN | 1 | True |
i | dog | 7.0 | 2 | False |
j | dog | 3.0 | 1 | False |
21.对每种animal的每种不同数量visits,计算平均age,即,返回一个表格,行是aniaml种类,列是visits数量,表格值是行动物种类列访客数量的平均年龄
#确定数据类型
df.dtypes
animal object
age float64
visits int64
priority bool
dtype: object
df.age = df.age.astype(float)
df.pivot_table(index = 'animal',columns = 'visits',values = 'age',aggfunc = 'mean')
visits | 1 | 2 | 3 |
---|---|---|---|
animal | |||
cat | 2.5 | NaN | 2.25 |
dog | 3.0 | 6.0 | NaN |
python | 4.5 | 0.5 | NaN |
22.在df中插入新行k,然后删除该行
df.loc['k'] = [5.5,'dog','no',2]
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | 1 |
b | cat | 3 | 3 | 1 |
c | python | 0.5 | 2 | 0 |
d | dog | NaN | 3 | 1 |
e | dog | 5 | 2 | 0 |
f | cat | 1.5 | 3 | 0 |
g | python | 4.5 | 1 | 0 |
h | cat | NaN | 1 | 1 |
i | dog | 7 | 2 | 0 |
j | dog | 3 | 1 | 0 |
k | 5.5 | dog | no | 2 |
df = df.drop('k')
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | 1 |
b | cat | 3 | 3 | 1 |
c | python | 0.5 | 2 | 0 |
d | dog | NaN | 3 | 1 |
e | dog | 5 | 2 | 0 |
f | cat | 1.5 | 3 | 0 |
g | python | 4.5 | 1 | 0 |
h | cat | NaN | 1 | 1 |
i | dog | 7 | 2 | 0 |
j | dog | 3 | 1 | 0 |