Bootstrap

Python数据分析实例五、US 大选捐款数据分析

美国联邦选举委员会 (FEC) 公布了对政治竞选活动的贡献数据。这包括投稿人姓名、职业和雇主、地址和投款金额。2012 年美国总统大选的贡献数据以单个 150 MB 的 CSV 文件P00000001-ALL.csv形式提供,该文件可以通过以下pandas.read_csv加载:

import pandas as pd

fec = pd.read_csv("datasets/fec/P00000001-ALL.csv", low_memory=False)
print(fec.info())
print(fec.iloc[123]) # 示例记录

加载成DataFrame输出该对象内容信息:

864f6cce2f0a463285eb7b7fce3bcbf1.png

这个DataFrame 中的示例记录如下所示:

9af7a09de2c44067b0f7061e66240264.png

数据中没有政党隶属关系,因此添加此关系数据非常有用。可以使用 unique 获取所有唯一政治候选人的列表:

import pandas as pd

fec = pd.read_csv("datasets/fec/P00000001-ALL.csv", low_memory=False)

unique_cands = fec["cand_nm"].unique()
print(unique_cands)
print(unique_cands[2])

unique_cands 输出:

['Bachmann, Michelle' 'Romney, Mitt' 'Obama, Barack'
 "Roemer, Charles E. 'Buddy' III" 'Pawlenty, Timothy' 'Johnson, Gary Earl'
 'Paul, Ron' 'Santorum, Rick' 'Cain, Herman' 'Gingrich, Newt'
 'McCotter, Thaddeus G' 'Huntsman, Jon' 'Perry, Rick']

unique_cands[2] 输出: Obama, Barack

我们使用字典来表示党派关系:

# 用字典表示候选人所属政党
parties = {"Bachmann, Michelle": "Republican", 
           "Cain, Herman": "Republican", 
           "Gingrich, Newt": "Republican", 
           "Huntsman, Jon": "Republican", 
           "Johnson, Gary Earl": "Republican",
           "McCotter, Thaddeus G": "Republican",
           "Obama, Barack": "Democrat",
           "Paul, Ron": "Republican",
           "Pawlenty, Timothy": "Republican",
           "Perry, Rick": "Republican",
           "Roemer, Charles E. 'Buddy' III": "Republican",
           "Romney, Mitt": "Republican",
           "Santorum, Rick": "Republican"}

现在,使用此字典映射和 Series 对象上的 map 方法,可以从候选人名字中计算政党数组:

import pandas as pd

fec = pd.read_csv("datasets/fec/P00000001-ALL.csv", low_memory=False)
print(fec.info())
print(fec.iloc[123]) # 示例记录

unique_cands = fec["cand_nm"].unique()

# 用字典表示候选人所属政党
parties = {"Bachmann, Michelle": "Republican", 
           "Cain, Herman": "Republican", 
           "Gingrich, Newt": "Republican", 
           "Huntsman, Jon": "Republican", 
           "Johnson, Gary Earl": "Republican",
           "McCotter, Thaddeus G": "Republican",
           "Obama, Barack": "Democrat",
           "Paul, Ron": "Republican",
           "Pawlenty, Timothy": "Republican",
           "Perry, Rick": "Republican",
           "Roemer, Charles E. 'Buddy' III": "Republican",
           "Romney, Mitt": "Republican",
           "Santorum, Rick": "Republican"}

print(fec["cand_nm"][123456:123461])
print(fec["cand_nm"][123456:123461].map(parties))

# 将政党关系映射作为party列添加到fec对象
fec["party"] = fec["cand_nm"].map(parties)
fec_party_count = fec["party"].value_counts()
print(fec_party_count)

print(fec["cand_nm"][123456:123461]) 输出:

ea7f69d3e58a436b8f0ce34155fdbb82.png

print(fec["cand_nm"][123456:123461].map(parties)) 输出:

a7f36474ab724ee3b07149713a82461d.png

print(fec_party_count) 输出:

0fb67458b7754d6eb9eed1a7b594143e.png

准备几个数据准备点。此数据包括供款和退款,为了简化分析,我们将数据集限制为正贡献(捐款)。由于 Barack Obama 和 Mitt Romney 是主要的两位候选人,我们还准备一个子集,其中只包含对他们的竞选活动是有所贡献的数据子集:

import pandas as pd

fec = pd.read_csv("datasets/fec/P00000001-ALL.csv", low_memory=False)
print(fec.info())
print(fec.iloc[123]) # 示例记录

unique_cands = fec["cand_nm"].unique()

# 用字典表示候选人所属政党
parties = {"Bachmann, Michelle": "Republican", 
           "Cain, Herman": "Republican", 
           "Gingrich, Newt": "Republican", 
           "Huntsman, Jon": "Republican", 
           "Johnson, Gary Earl": "Republican",
           "McCotter, Thaddeus G": "Republican",
           "Obama, Barack": "Democrat",
           "Paul, Ron": "Republican",
           "Pawlenty, Timothy": "Republican",
           "Perry, Rick": "Republican",
           "Roemer, Charles E. 'Buddy' III": "Republican",
           "Romney, Mitt": "Republican",
           "Santorum, Rick": "Republican"}

# 将政党关系映射作为party列添加到fec对象
fec["party"] = fec["cand_nm"].map(parties)
fec_party_count = fec["party"].value_counts()

temp = (fec["contb_receipt_amt"] > 0).value_counts()
print(temp)
fec = fec[fec["contb_receipt_amt"] > 0]
fec_mrbo = fec[fec["cand_nm"].isin(["Obama, Barack", "Romney, Mitt"])]

一、按职业及雇主划分的捐款统计数字

按职业划分的捐款量是一个经常被研究的统计数据。例如,律师倾向于向民主党人捐赠更多的钱,而企业高管倾向于向共和党人捐赠更多。首先,按职业划分的捐赠总数可以用 value_counts 计算:

import pandas as pd

fec = pd.read_csv("datasets/fec/P00000001-ALL.csv", low_memory=False)
#print(fec.info())
#print(fec.iloc[123]) # 示例记录

unique_cands = fec["cand_nm"].unique()

# 用字典表示候选人所属政党
parties = {"Bachmann, Michelle": "Republican", 
           "Cain, Herman": "Republican", 
           "Gingrich, Newt": "Republican", 
           "Huntsman, Jon": "Republican", 
           "Johnson, Gary Earl": "Republican",
           "McCotter, Thaddeus G": "Republican",
           "Obama, Barack": "Democrat",
           "Paul, Ron": "Republican",
           "Pawlenty, Timothy": "Republican",
           "Perry, Rick": "Republican",
           "Roemer, Charles E. 'Buddy' III": "Republican",
           "Romney, Mitt": "Republican",
           "Santorum, Rick": "Republican"}

# 将政党关系映射作为party列添加到fec对象
fec["party"] = fec["cand_nm"].map(parties)
fec_party_count = fec["party"].value_counts()

temp = (fec["contb_receipt_amt"] > 0).value_counts()
#print(temp)
fec = fec[fec["contb_receipt_amt"] > 0]
fec_mrbo = fec[fec["cand_nm"].isin(["Obama, Barack", "Romney, Mitt"])]

# 按职业划分 统计捐赠量 由于数据量太大取前10个观察
temp10 = fec["contbr_occupation"].value_counts()[:10]
print(temp10)

输出按职业划分统计的捐赠量(前10个):

cbc8c5b154b74849a3f0714044cda7ee.png

从上面的职业输出可以看出,许多人的职业名称虽然不同,但指的是相同的基本工作类型。下面我们用代码实现从一个职业映射到另一个职业来,清理其中一些相同类型职业。请注意下面代码中使用 dict.get 允许没有映射的职业也能“传递”的 “技巧”:

import pandas as pd

fec = pd.read_csv("datasets/fec/P00000001-
;