Bootstrap

pandas_Sample-Fundational

本篇主要是pandas50练习题的基础部分,也就是1~22题,还是较好理解的部分,后续更新++++++++

1.导入Pandas库并简写为pd,输出版本号

import pandas as pd
import numpy as np
pd.__version__
'1.1.3'

2.从列表创建Series

arr = [0,1,2,3,4]
df = pd.Series(arr) # 如果不做特殊指定说明,default从0开始
df
0    0
1    1
2    2
3    3
4    4
dtype: int64

3.从字典创建Series

d = {'a':1,'b':2,'c':3,'d':4,'e':5}
df = pd.Series(d)
df
a    1
b    2
c    3
d    4
e    5
dtype: int64

4.从numpy数组创建DataFrame

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)[source]
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Parameters
datandarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.

Changed in version 0.25.0: If data is a list of dicts, column order follows insertion-order.

indexIndex or array-like
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.

columnsIndex or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided.

dtypedtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.

copybool, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input.

dates = pd.date_range('today',periods = 6) #定义时间序列作为 index
num_arr = np.random.randn(6,4)  #传入numpy的 6行 × 4列随机数组
columns = ['A','B','C','D']   #将列表作为列名
df1 = pd.DataFrame(num_arr, index = dates, columns = columns)
df1
ABCD
2021-04-17 23:03:54.397660-1.5209200.092495-0.487495-0.466914
2021-04-18 23:03:54.397660-0.2895370.1081660.192073-0.013956
2021-04-19 23:03:54.3976600.693032-0.445103-0.4257150.944692
2021-04-20 23:03:54.3976600.4031420.0623110.5448670.554797
2021-04-21 23:03:54.3976601.535514-0.539361-0.0966900.197693
2021-04-22 23:03:54.397660-0.6760610.9517460.3127770.724948

5.从CSV中创建DataFrame,分隔符为:,编码格式为gbk

 # df = pd.read_csv('test.csv',encoding = 'gbk,sep=';'')

6.从字典对象data创建DataFrame,设置索引为labels

import numpy as np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index = labels)
df
animalagevisitspriority
acat2.51yes
bcat3.03yes
csnake0.52no
ddogNaN3yes
edog5.02no
fcat2.03no
gsnake4.51no
hcatNaN1yes
idog7.02no
jdog3.01no

7.显示DataFrame的基础信息,包括行的数量;列名;每一列值的数量、类型

df.info
<bound method DataFrame.info of   animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no>
df.shape
(10, 4)
df.describe()
agevisits
count8.00000010.000000
mean3.4375001.900000
std2.0077970.875595
min0.5000001.000000
25%2.3750001.000000
50%3.0000002.000000
75%4.6250002.750000
max7.0000003.000000

8.显示df的前3行

df.iloc[:3]
animalagevisitspriority
acat2.51yes
bcat3.03yes
csnake0.52no

9.取出df的animal和age列

loc函数使用:loc[ rows , columns],其中rows/columns是列表

df.loc[:,['animal','age']]
animalage
acat2.5
bcat3.0
csnake0.5
ddogNaN
edog5.0
fcat2.0
gsnake4.5
hcatNaN
idog7.0
jdog3.0

10.取出索引为[3,4,8]行的animal和age列

df.loc[df.index[[3,4,8]],['animal','age']]
animalage
ddogNaN
edog5.0
idog7.0

11.取出age值大于3的行

df[df['age'] > 3]
animalagevisitspriority
edog5.02no
gsnake4.51no
idog7.02no

12.取出age值缺失的行

df[df['age'].isnull()]
animalagevisitspriority
ddogNaN3yes
hcatNaN1yes

13.取出age在2,4间的行(不含)

df[(df['age'] > 2)&(df['age'] < 4)]
animalagevisitspriority
acat2.51yes
bcat3.03yes
jdog3.01no
df[df['age'].between(2, 4)]
animalagevisitspriority
acat2.51yes
bcat3.03yes
fcat2.03no
jdog3.01no

14.f行的age改为1.5

df.loc['f','age'] = 1.5
df
animalagevisitspriority
acat2.51yes
bcat3.03yes
csnake0.52no
ddogNaN3yes
edog5.02no
fcat1.53no
gsnake4.51no
hcatNaN1yes
idog7.02no
jdog3.01no

15.计算visits的总和

df['visits'].sum()
19

16.计算每个不同种类animal的age的平均数

df.groupby('animal')['age'].mean()
animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

17.计算df中每个种类animal的数量

df['animal'].value_counts()
dog      4
cat      4
snake    2
Name: animal, dtype: int64

18.先按age降序排列,后按visits升序排列

df.sort_values(by = ['age','visits'],ascending=[False,True])##排序筛选
animalagevisitspriority
idog7.02no
edog5.02no
gsnake4.51no
jdog3.01no
bcat3.03yes
acat2.51yes
fcat1.53no
csnake0.52no
hcatNaN1yes
ddogNaN3yes

19.将priority列中的yes, no替换为布尔值True, False

df['priority'] = df['priority'].map({'yes':True,'no':False})
df
animalagevisitspriority
acat2.51True
bcat3.03True
csnake0.52False
ddogNaN3True
edog5.02False
fcat1.53False
gsnake4.51False
hcatNaN1True
idog7.02False
jdog3.01False

20.将animal列中的snake替换为python

df['animal'] = df['animal'].replace('snake','python')
df
animalagevisitspriority
acat2.51True
bcat3.03True
cpython0.52False
ddogNaN3True
edog5.02False
fcat1.53False
gpython4.51False
hcatNaN1True
idog7.02False
jdog3.01False

21.对每种animal的每种不同数量visits,计算平均age,即,返回一个表格,行是aniaml种类,列是visits数量,表格值是行动物种类列访客数量的平均年龄

#确定数据类型
df.dtypes
animal       object
age         float64
visits        int64
priority       bool
dtype: object
df.age = df.age.astype(float)
df.pivot_table(index = 'animal',columns = 'visits',values = 'age',aggfunc = 'mean')
visits123
animal
cat2.5NaN2.25
dog3.06.0NaN
python4.50.5NaN

22.在df中插入新行k,然后删除该行

df.loc['k'] = [5.5,'dog','no',2]
df
animalagevisitspriority
acat2.511
bcat331
cpython0.520
ddogNaN31
edog520
fcat1.530
gpython4.510
hcatNaN11
idog720
jdog310
k5.5dogno2
df = df.drop('k')
df
animalagevisitspriority
acat2.511
bcat331
cpython0.520
ddogNaN31
edog520
fcat1.530
gpython4.510
hcatNaN11
idog720
jdog310
;