Bootstrap

Python爬虫爬取智联招聘(进阶版)_爬虫智行


至此,职位详细信息的获取及保存的工作已经完成,来看一下此时的main函数:



def main(city, keyword, region, pages):
   ‘’’
   主函数
   ‘’’
   csv_filename = ‘zl_’ + city + ‘_’ + keyword + ‘.csv’
   txt_filename = ‘zl_’ + city + ‘_’ + keyword + ‘.txt’
   headers = [‘job’, ‘years’, ‘education’, ‘salary’, ‘company’, ‘scale’, ‘job_url’]
   write_csv_headers(csv_filename, headers)
   for i in range(pages):
       ‘’’
       获取该页中所有职位信息,写入csv文件
       ‘’’
       job_dict = {}
       html = get_one_page(city, keyword, region, i)
       items = parse_one_page(html)
       for item in items:
           html = get_detail_page(item.get(‘job_url’))
           job_detail = get_job_detail(html)

job_dict[‘job’] = item.get(‘job’)
           job_dict[‘years’] = job_detail.get(‘years’)
           job_dict[‘education’] = job_detail.get(‘education’)
           job_dict[‘salary’] = item.get(‘salary’)
           job_dict[‘company’] = item.get(‘company’)
           job_dict[‘scale’] = job_detail.get(‘scale’)
           job_dict[‘job_url’] = item.get(‘job_url’)

# 对数据进行清洗,将标点符号等对词频统计造成影响的因素剔除
           pattern = re.compile(r’[一-龥]+')
           filterdata = re.findall(pattern, job_detail.get(‘requirement’))
           write_txt_file(txt_filename, ‘’.join(filterdata))
           write_csv_rows(csv_filename, headers, job_dict)


## 4、数据分析


本节内容为此版本的重点。


### 4.1 工资统计


我们对各个阶段工资的占比进行统计,分析该行业的薪资分布水平。前面我们已经把数据保存到csv文件里了,接下来要读取`salary`列:



def read_csv_column(path, column):
   ‘’’
   读取一列
   ‘’’
   with open(path, ‘r’, encoding=‘gb18030’, newline=‘’) as f:
       reader = csv.reader(f)
       return [row[column] for row in reader]

main函数里添加

print(read_csv_column(csv_filename, 3))

#下面为打印结果
[‘salary’, ‘7000’, ‘5000’, ‘25000’, ‘12500’, ‘25000’, ‘20000’, ‘32500’, ‘20000’, ‘15000’, ‘9000’, ‘5000’, ‘5000’, ‘12500’, ‘24000’, ‘15000’, ‘18000’, ‘25000’, ‘20000’, ‘0’, ‘20000’, ‘12500’, ‘17500’, ‘17500’, ‘20000’, ‘11500’, ‘25000’, ‘12500’, ‘17500’, ‘25000’, ‘22500’, ‘22500’, ‘25000’, ‘17500’, ‘7000’, ‘25000’, ‘3000’, ‘22500’, ‘15000’, ‘25000’, ‘20000’, ‘22500’, ‘15000’, ‘15000’, ‘25000’, ‘17500’, ‘22500’, ‘10500’, ‘20000’, ‘17500’, ‘22500’, ‘17500’, ‘25000’, ‘20000’, ‘11500’, ‘11250’, ‘12500’, ‘14000’, ‘12500’, ‘17500’, ‘15000’]


从结果可以看出,除了第一项,其他的都为平均工资,但是此时的工资为字符串,为了方便统计,我们将其转换成整形:



salaries = []
sal = read_csv_column(csv_filename, 3)
   # 撇除第一项,并转换成整形,生成新的列表
   for i in range(len(sal) - 1):
       # 工资为’0’的表示招聘上写的是’面议’,不做统计
       if not sal[i] == ‘0’:
           salaries.append(int(sal[i + 1]))
   print(salaries)

下面为打印结果

[7000, 5000, 25000, 12500, 25000, 20000, 32500, 20000, 15000, 9000, 5000, 5000, 12500, 24000, 15000, 18000, 25000, 20000, 0, 20000, 12500, 20000, 11500, 17500, 25000, 12500, 17500, 25000, 25000, 22500, 22500, 17500, 17500, 7000, 25000, 3000, 22500, 15000, 25000, 20000, 22500, 15000, 22500, 10500, 20000, 15000, 17500, 17500, 25000, 17500, 22500, 25000, 12500, 20000, 11250, 11500, 14000, 12500, 15000, 17500]


我们用直方图进行展示:



plt.hist(salaries, bins=10 ,)
plt.show()


生成效果图如下:


![](https://img-blog.csdn.net/20180425203353549?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3podXNvbmd6aXll/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve
;