SC.Pandas 02 | 如何使用Pandas计算、统计地球科学数据？

Introduction

上一期，我们介绍了Pandas的基本数据结构和索引方式。在理解Pandas如何组织数据，以及我们如何从Pandas中选取我们需要的数据的基础上，这一期我们将对如何使用Pandas进行数据计算、统计进行介绍。

这些内容的重要性不言而喻，话不多说。我们直接开始！

Pandas计算

由于Pandas是基于NumPy构建的，因而其计算的内核与NumPy一致。通过索引出数据块，其默认支持向量化可以实现快速计算。

同时，Pandas还支持绝大多数NumPy内置函数。

下面我们通过构建一个伪气象要素数据集来演示Pandas的计算功能。

import pandas as pd
import numpy as np

# 创建一个伪气象要素数据集
df = pd.DataFrame({'temperature': [22, 23, 24, 25], 
                   'humidity': [60, 65, 70, 75], 
                   'pressure': [1013, 1015, 1017, 1019], 
                   'wind': [5, 10, 15, 20],
                   'precipitation': [0.5, 0.7, 0.9, 1.1],}, 
                   index=['Moscow', 'Boston', 'Rome', 'Tokyo'])
                   
print(df)

        temperature  humidity  pressure  wind  precipitation
Moscow           22        60      1013     5            0.5
Boston           23        65      1015    10            0.7
Rome             24        70      1017    15            0.9
Tokyo            25        75      1019    20            1.1

# 通过索引进行简单的四则运算，用于单位转换等
print(df['temperature'] + 273.15, '\n')       # Celsius转换为Kelvin
print(df['temperature'] * 9/5 + 32, '\n')     # Celsius转换为Fahrenheit
print(df['precipitation'] / 86400, '\n')      # mm/day转换为mm/hour
print(df['wind'] * 1.60934, '\n')             # m/s转换为mph
print(df['pressure'] * 0.750062)              # hPa转换为inHg

Moscow    295.15
Boston    296.15
Rome      297.15
Tokyo     298.15
Name: temperature, dtype: float64 

Moscow    71.6
Boston    73.4
Rome      75.2
Tokyo     77.0
Name: temperature, dtype: float64 

Moscow    0.000006
Boston    0.000008
Rome      0.000010
Tokyo     0.000013
Name: precipitation, dtype: float64 

Moscow     8.0467
Boston    16.0934
Rome      24.1401
Tokyo     32.1868
Name: wind, dtype: float64 

Moscow    759.812806
Boston    761.312930
Rome      762.813054
Tokyo     764.313178
Name: pressure, dtype: float64

# 除了列的计算，行运算同样支持，但这里对多行运算没有实际意义，我们首先对DataFrame转置，再进行行运算
df0 = df.T
print(df0, '\n')

print(df0.loc['temperature'] + 273.15, '\n')
print(df0.loc['precipitation', 'Tokyo'] * 30, '\n')               # 对特定行列计算

print(df0['Moscow'] - df0['Tokyo'], '\n')                         # 对不同进行计算也是可行的

# 于是，我们还可以对任意位置切片进行计算，尽管这里没有实际意义
print(df0.loc['humidity':'wind', ['Boston', 'Tokyo']] + 10000)

               Moscow  Boston    Rome   Tokyo
temperature      22.0    23.0    24.0    25.0
humidity         60.0    65.0    70.0    75.0
pressure       1013.0  1015.0  1017.0  1019.0
wind              5.0    10.0    15.0    20.0
precipitation     0.5     0.7     0.9     1.1 

Moscow    295.15
Boston    296.15
Rome      297.15
Tokyo     298.15
Name: temperature, dtype: float64 

33.0 

temperature      -3.0
humidity        -15.0
pressure         -6.0
wind            -15.0
precipitation    -0.6
dtype: float64 

           Boston    Tokyo
humidity  10065.0  10075.0
pressure  11015.0  11019.0
wind      10010.0  10020.0

# 同样，NumPy中常用的函数也受Pandas支持
df = pd.DataFrame({'lon': [-105, -90, 0, 10, 120], 'lat': [30, 40, 50, 60, 70]})
print(df, '\n')

# 转换经纬度为弧度制
df['lon_rad'] = np.deg2rad(df['lon'])
df['lat_rad'] = np.deg2rad(df['lat'])
print(df, '\n')

# 计算纬线长度
df['lat_len'] = 2 * np.pi * 6371 * np.cos(df['lat_rad'])
print(df)

   lon  lat
0 -105   30
1  -90   40
2    0   50
3   10   60
4  120   70 

   lon  lat   lon_rad   lat_rad
0 -105   30 -1.832596  0.523599
1  -90   40 -1.570796  0.698132
2    0   50  0.000000  0.872665
3   10   60  0.174533  1.047198
4  120   70  2.094395  1.221730 

   lon  lat   lon_rad   lat_rad       lat_len
0 -105   30 -1.832596  0.523599  34667.147249
1  -90   40 -1.570796  0.698132  30664.892037
2    0   50  0.000000  0.872665  25730.899599
3   10   60  0.174533  1.047198  20015.086796
4  120   70  2.094395  1.221730  13691.125709

通过组合Pandas与各种计算符号和函数，我们可以方便地实现很多Excel中的操作。

需要注意的一个点是：由于Pandas存在行列名的概念，因此当两个不同的Pandas数据运算时，会自动匹配行列名，不存在的行列名会填充为空值。

df0 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df1 = pd.DataFrame({'A': [10, 20, 30], 'C': [40, 50, 60]}, index=['a', 'b', 'd'])

print(df0, '\n', df1, '\n', df0 + df1)

   A  B
a  1  4
b  2  5
c  3  6 
     A   C
a  10  40
b  20  50
d  30  60 
       A   B   C
a  11.0 NaN NaN
b  22.0 NaN NaN
c   NaN NaN NaN
d   NaN NaN NaN

# 但是当我们直接从另一DataFrame中抽取数据赋值到DataFrame时，只会保留被赋值数组存在索引的行
df0['C'] = df1['C']
print(df0, '\n')            # 结果中不存在索引为'd'的行

# 抽取行赋值类似，不会创建被赋值数组中不存在的列
df0 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df1 = pd.DataFrame({'A': [10, 20, 30], 'C': [40, 50, 60]}, index=['a', 'b', 'd'])

df0.loc['d', :] = df1.loc['d', :]
print(df0)            # 结果中不存在列'C'

   A  B     C
a  1  4  40.0
b  2  5  50.0
c  3  6   NaN 

      A    B
a   1.0  4.0
b   2.0  5.0
c   3.0  6.0
d  30.0  NaN

此外，我们可以在Pandas内置的数学运算函数中使用axis参数指定行列，实现对行或列使用不同的数值计算。

df = pd.DataFrame({'temperature': [22, 23, 24, 25], 
                   'humidity': [60, 65, 70, 75], 
                   'pressure': [1013, 1015, 1017, 1019], 
                   'wind': [5, 10, 15, 20],
                   'precipitation': [0.5, 0.7, 0.9, 1.1],}, 
                   index=['Moscow', 'Boston', 'Rome', 'Tokyo'])

print(df, '\n')

print(df.add([1000, 2000, 3000, 4000, 5000], axis=1), '\n')             # 每行加上不同的值
print(df.sub([1000, 2000, 3000, 4000, 5000]), '\n')                     # 由于默认对列操作，axis=1可省略
print(df.mul([10, 20, 30, 40], axis=0), '\n')                           # 每列乘以不同的值
print(df.div([1000, 2000, 3000, 4000], axis=0), '\n')                   # 每列除以不同的值
print(df.pow([.2, .3, .4, .5], axis=0))                                 # 每行求幂

        temperature  humidity  pressure  wind  precipitation
Moscow           22        60      1013     5            0.5
Boston           23        65      1015    10            0.7
Rome             24        70      1017    15            0.9
Tokyo            25        75      1019    20            1.1 

        temperature  humidity  pressure  wind  precipitation
Moscow         1022      2060      4013  4005         5000.5
Boston         1023      2065      4015  4010         5000.7
Rome           1024      2070      4017  4015         5000.9
Tokyo          1025      2075      4019  4020         5001.1 

        temperature  humidity  pressure  wind  precipitation
Moscow         -978     -1940     -1987 -3995        -4999.5
Boston         -977     -1935     -1985 -3990        -4999.3
Rome           -976     -1930     -1983 -3985        -4999.1
Tokyo          -975     -1925     -1981 -3980        -4998.9 

        temperature  humidity  pressure  wind  precipitation
Moscow          220       600     10130    50            5.0
Boston          460      1300     20300   200           14.0
Rome            720      2100     30510   450           27.0
Tokyo          1000      3000     40760   800           44.0 

        temperature  humidity  pressure   wind  precipitation
Moscow      0.02200  0.060000   1.01300  0.005       0.000500
Boston      0.01150  0.032500   0.50750  0.005       0.000350
Rome        0.00800  0.023333   0.33900  0.005       0.000300
Tokyo       0.00625  0.018750   0.25475  0.005       0.000275 

        temperature  humidity   pressure      wind  precipitation
Moscow     1.855601  2.267933   3.991369  1.379730       0.870551
Boston     2.561642  3.498437   7.978841  1.995262       0.898523
Rome       3.565205  5.470654  15.956160  2.954177       0.958732
Tokyo      5.000000  8.660254  31.921779  4.472136       1.048809

Pandas统计

统计是表格存在的意义之一，通过对海量的数据统计分析，获得如均值、方差、中位数等，有利于我们深入了解数据，并从中解析出规律。

Pandas提供了丰富的统计函数，可以帮助我们快速计算出数据集的统计指标。

我们依然从一组伪气象数据集开始：

df = pd.DataFrame({'temperature': [-12, 23, 34, 30], 
                   'humidity': [90, 85, 60, 75], 
                   'pressure': [998, 1005, 1017, 1019], 
                   'wind': [10, 8, 2, 0.5],
                   'precipitation': [2, 1.7, 0, 0.1],}, 
                   index=['Stockholm', 'Vienna', 'Barcelona', 'San Francisco'])

# 我们可以直接通过一个函数了解数据集的基本信息
print(df.describe(), '\n')

# 也可以指定求解感兴趣的指标
print(df['temperature'].mean())                         # 均值
print(df['humidity'].median())                          # 中位数
print(df['pressure'].max())                             # 最大值
print(df['wind'].min())                                 # 最小值
print(df['precipitation'].std())                        # 标准差
print(df['temperature'].quantile([0.9]))                # 90%分位数
print(df['humidity'].sum())                             # 总和(尽管这里没有实际意义)

print(df.corr())                                        # 相关系数r

       temperature   humidity     pressure      wind  precipitation
count     4.000000   4.000000     4.000000   4.00000       4.000000
mean     18.750000  77.500000  1009.750000   5.12500       0.950000
std      20.998016  13.228757     9.979145   4.58939       1.047219
min     -12.000000  60.000000   998.000000   0.50000       0.000000
25%      14.250000  71.250000  1003.250000   1.62500       0.075000
50%      26.500000  80.000000  1011.000000   5.00000       0.900000
75%      31.000000  86.250000  1017.500000   8.50000       1.775000
max      34.000000  90.000000  1019.000000  10.00000       2.000000 

18.75
80.0
1019
0.5
1.0472185381603338
0.9    32.8
Name: temperature, dtype: float64
310
               temperature  humidity  pressure      wind  precipitation
temperature       1.000000 -0.777000  0.884070 -0.821071      -0.805687
humidity         -0.777000  1.000000 -0.839572  0.816698       0.902306
pressure          0.884070 -0.839572  1.000000 -0.992579      -0.977639
wind             -0.821071  0.816698 -0.992579  1.000000       0.983127
precipitation    -0.805687  0.902306 -0.977639  0.983127       1.000000

当我们拥有大量的数据条目时，为了聚合不同类型数据，我们可以使用groupby()函数对数据进行分组聚类。

# 生成一组国家的人口与GDP随机数据
country=['United States', 'United Kingdom', 'Switzerland', 'Finland', 'Russia']

df = pd.DataFrame({'Country': [country[x] for x in np.random.randint(0,len(country),2000)],
                   'Population':np.random.randint(1, 100, 2000),
                   'GDP':np.random.randint(1000, 10000, 2000),
                   'Season': np.random.choice(['Spring', 'Summer', 'Fall', 'Winter'], 2000),
                   })

print(df)

             Country  Population   GDP  Season
0        Switzerland           3  4883  Summer
1     United Kingdom          59  3654  Winter
2             Russia          14  3451  Summer
3            Finland          55  5472  Summer
4            Finland          85  8517  Summer
...              ...         ...   ...     ...
1995   United States          26  2834  Spring
1996  United Kingdom          64  8023    Fall
1997   United States          87  7304  Spring
1998     Switzerland          87  6711    Fall
1999          Russia          65  9695  Spring

[2000 rows x 4 columns]

print(df.groupby('Country').mean())                 # 计算对应国家所有季度人口、GDP均值
print(df.groupby('Country').median())               # 中位数
print(df.groupby('Country').min())                  # 最小值
print(df.groupby('Country').max())                  # 最大值
print(df.groupby('Season').count())                 # 各季度条数
print(df.groupby(['Season', 'Country']).count())    # 支持多个标签同时聚类

                Population          GDP
Country                                
Finland          50.799007  5500.885856
Russia           50.808824  5407.105392
Switzerland      49.598958  5706.822917
United Kingdom   49.260759  5486.015190
United States    51.509756  5458.958537
                Population     GDP
Country                           
Finland               51.0  5448.0
Russia                51.0  5480.0
Switzerland           49.5  5743.0
United Kingdom        46.0  5366.0
United States         52.0  5435.0
                Population   GDP Season
Country                                
Finland                  1  1033   Fall
Russia                   1  1000   Fall
Switzerland              1  1002   Fall
United Kingdom           1  1025   Fall
United States            1  1016   Fall
                Population   GDP  Season
Country                                 
Finland                 99  9948  Winter
Russia                  99  9989  Winter
Switzerland             99  9953  Winter
United Kingdom          99  9965  Winter
United States           99  9952  Winter
        Country  Population  GDP
Season                          
Fall        488         488  488
Spring      510         510  510
Summer      483         483  483
Winter      519         519  519
                       Population  GDP
Season Country                        
Fall   Finland                109  109
       Russia                  92   92
       Switzerland             93   93
       United Kingdom         105  105
       United States           89   89
Spring Finland                107  107
       Russia                 118  118
       Switzerland             97   97
       United Kingdom          87   87
       United States          101  101
Summer Finland                 88   88
       Russia                  91   91
       Switzerland             94   94
       United Kingdom          96   96
       United States          114  114
Winter Finland                 99   99
       Russia                 107  107
       Switzerland            100  100
       United Kingdom         107  107
       United States          106  106

后记

以上就是关于使用Pandas进行计算和统计的一些基本操作，显然其中诸多功能都是我们数据处理中必不可少的。通过操作DataFrame行列，诸如计算均方根误差、平均绝对误差等指标也变得简单。

而且，掌握了代码就可以摆脱Excel拖表的琐碎。如果是一次计算还好，忙中出错或者是重新出了一次数据，就必须重复劳动。

成年人的崩溃，往往就在那一瞬间。

所以，Let's Coding!

我们下期再见！

Manuscript: RitasCake

Proof: Philero; RitasCake

获取更多资讯，欢迎订阅微信公众号：Westerlies

跳转和鲸社区，云端运行本文案例。https://www.heywhale.com/mw/project/66221ce2e584e69fbfef87ba