Bootstrap

【聚类分析】《数学建模算法与应用》第十章 多元分析 第一节 聚类分析 python实现

第十章 多元分析

第一节 聚类分析

介绍
这里是司守奎教授的《数学建模算法与应用》全书案例代码python实现,欢迎加入此项目将其案例代码用python实现
GitHub项目地址:Mathematical-modeling-algorithm-and-Application
CSDN专栏:数学建模
知乎专栏:数学建模算法与应用
联系作者
作者:STL_CC
邮箱:[email protected]

由于作者还是大一学生,才疏学浅,难免会有错误,欢迎指正
同时作者精力有限,希望更多大佬加入此项目,一来可以提高建模水平,二来可以分享建模经验

系统聚类法

例10.1
设有5个销售员 w 1 w_1 w1, w 2 w_2 w2, w 3 w_3 w3, w 4 w_4 w4, w 5 w_5 w5,他们的销售业绩由二维变量( v 1 v_1 v1, v 2 v_2 v2)描述,见表1 销售员业绩表

销售员 v 1 v_1 v1(销售量)百件 v 2 v_2 v2(回收款项)万元
w 1 w_1 w110
w 2 w_2 w211
w 3 w_3 w332
w 4 w_4 w443
w 5 w_5 w525

方法一 系统聚类法中的最长距离法

import numpy as np
import numpy.matlib
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
a = np.array([[1,0],[1,1],[3,2],[4,3],[2,5]])
(m,n)=a.shape
d = np.zeros((m,m))
for i in range(m):
    for j in range(i+1,m):
        d[i,j]=sum(map(abs,(a[i,:]-a[j,:])))
nd=d.ravel()[np.flatnonzero(d)]
nd=np.unique(nd)
l=len(nd)
for i in range(0,l):
    nd_min=min(nd)
    (row,col)=np.where(d==nd_min)
    tm=np.union1d(row,col)
    print('第{:d}次合成,平台高度为{}时的分类结果为:{}'.format(i+1,nd_min,tm))
    nd=np.delete(nd,np.where(nd==min(nd)))
    if(not len(nd)):
        break
第1次合成,平台高度为1.0时的分类结果为:[0 1]
第2次合成,平台高度为2.0时的分类结果为:[2 3]
第3次合成,平台高度为3.0时的分类结果为:[1 2]
第4次合成,平台高度为4.0时的分类结果为:[0 2 3 4]
第5次合成,平台高度为5.0时的分类结果为:[1 3 4]
第6次合成,平台高度为6.0时的分类结果为:[0 3 4]

方法二 使用python的scipy库

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
a = np.array([[1,0],[1,1],[3,2],[4,3],[2,5]])
y=pdist(a,'cityblock')
yc=squareform(y)
z=linkage(y)
dendrogram(z)
{'icoord': [[15.0, 15.0, 25.0, 25.0],
  [35.0, 35.0, 45.0, 45.0],
  [20.0, 20.0, 40.0, 40.0],
  [5.0, 5.0, 30.0, 30.0]],
 'dcoord': [[0.0, 1.0, 1.0, 0.0],
  [0.0, 2.0, 2.0, 0.0],
  [1.0, 3.0, 3.0, 2.0],
  [0.0, 4.0, 4.0, 3.0]],
 'ivl': ['4', '0', '1', '2', '3'],
 'leaves': [4, 0, 1, 2, 3],
 'color_list': ['g', 'r', 'b', 'b']}

在这里插入图片描述

T = fcluster(z,3,'maxclust')
for i in range(1,3+1):
    tm=np.where(T==i)
    print('第{}类的有'.format(i),tm)
第1类的有 (array([0, 1], dtype=int64),)
第2类的有 (array([2, 3], dtype=int64),)
第3类的有 (array([4], dtype=int64),)

变量聚类法

例10.2 服装标准制定中的变量聚类法。
在服装标准制定中,对某地成年女子的各部位尺寸进行了统计,通过14个部位的测量资料,获得各因素之间的相关系数表(见表5)。
注:具体资料见教材,部分数据将在程序中给出

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
a=pd.read_csv('ch.txt',header=None,sep='	')
a
012345678910111213
01.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
10.3661.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
20.2420.2331.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
30.2800.1940.5901.000NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
40.3600.3240.4760.4351.000NaNNaNNaNNaNNaNNaNNaNNaNNaN
50.2820.2620.4830.4700.4521.000NaNNaNNaNNaNNaNNaNNaNNaN
60.2450.2650.5400.4780.5350.6631.000NaNNaNNaNNaNNaNNaNNaN
70.4480.3450.4520.4040.4310.3220.2661.000NaNNaNNaNNaNNaNNaN
80.4860.3670.3650.3570.4290.2830.2870.8201.000NaNNaNNaNNaNNaN
90.6480.6620.2160.0320.4290.2830.2630.5270.5471.000NaNNaNNaNNaN
100.6890.6710.2430.3130.4300.3020.2940.5200.5580.9571.000NaNNaNNaN
110.4860.6360.1740.2430.3750.2960.2550.4030.4170.8570.8521.000NaNNaN
120.1330.1530.7320.4770.3390.3920.4460.2660.2410.0540.0990.0551.000NaN
130.3760.2520.6760.5810.4410.4470.4400.4240.3720.3630.3760.3210.6271.0
d=1-abs(a)
for i in range(len(d)):
    for j in range(i+1,len(d)):
        d.iloc[i,j]=d.iloc[j,i]
z=linkage(d,'complete')
y=fcluster(z,2,'maxclust')
dendrogram(z)
{'icoord': [[25.0, 25.0, 35.0, 35.0],
  [15.0, 15.0, 30.0, 30.0],
  [5.0, 5.0, 22.5, 22.5],
  [55.0, 55.0, 65.0, 65.0],
  [45.0, 45.0, 60.0, 60.0],
  [13.75, 13.75, 52.5, 52.5],
  [85.0, 85.0, 95.0, 95.0],
  [75.0, 75.0, 90.0, 90.0],
  [125.0, 125.0, 135.0, 135.0],
  [115.0, 115.0, 130.0, 130.0],
  [105.0, 105.0, 122.5, 122.5],
  [82.5, 82.5, 113.75, 113.75],
  [33.125, 33.125, 98.125, 98.125]],
 'dcoord': [[0.0, 0.29809729955167313, 0.29809729955167313, 0.0],
  [0.0, 0.4028870809544531, 0.4028870809544531, 0.29809729955167313],
  [0.0, 0.7535436284648688, 0.7535436284648688, 0.4028870809544531],
  [0.0, 0.28955310393777517, 0.28955310393777517, 0.0],
  [0.0, 0.9263082640244553, 0.9263082640244553, 0.28955310393777517],
  [0.7535436284648688,
   1.158768311613672,
   1.158768311613672,
   0.9263082640244553],
  [0.0, 0.49697082409332655, 0.49697082409332655, 0.0],
  [0.0, 0.8415325305655154, 0.8415325305655154, 0.49697082409332655],
  [0.0, 0.5600321419347287, 0.5600321419347287, 0.0],
  [0.0, 0.7182241989796779, 0.7182241989796779, 0.5600321419347287],
  [0.0, 0.850338756026091, 0.850338756026091, 0.7182241989796779],
  [0.8415325305655154,
   1.2178604189314965,
   1.2178604189314965,
   0.850338756026091],
  [1.158768311613672,
   2.1077737070188536,
   2.1077737070188536,
   1.2178604189314965]],
 'ivl': ['1',
  '11',
  '9',
  '10',
  '0',
  '7',
  '8',
  '4',
  '5',
  '6',
  '12',
  '3',
  '2',
  '13'],
 'leaves': [1, 11, 9, 10, 0, 7, 8, 4, 5, 6, 12, 3, 2, 13],
 'color_list': ['g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'b']}

在这里插入图片描述

说明
1.在教材原来的matlab程序的数据预处理中,原始数据被去掉了0,向量长度不一。然而这样长度不一的向量是不能在python中被处理的。笔者将其转换为了一个对称矩阵再行操作,更能适应python操作。
2.在matlab中,相关系数矩阵计算使用的corrcoef列指标是变量,然而np.corrcoef行指标是向量,所以要先转置
参考文献
1.python中scipy库实现计算各种距离
2.Python之向量(Vector)距离矩阵计算
3.scipy.spatial.distance.pdist官方文档
4.使用Python进行层次聚类(一)——基本使用+主成分分析绘图观察结果+绘制热图
5.python的scipy层次聚类参数详解
6.皮尔逊积矩相关系数–NumPy–corrcoef()

案例研究

聚类分析案例—我国各地区普通高等教育发展状况分析
具体资料见教材,部分数据将在程序中给出

R型聚类分析

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
gj=pd.read_csv('gj.txt',header=None,sep='	').T
r=np.corrcoef(gj)
d=1-r
d=d.T
z=linkage(d,'average')
dendrogram(z)
D:\Program Files\anaconda\lib\site-packages\ipykernel_launcher.py:4: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  after removing the cwd from sys.path.





{'icoord': [[5.0, 5.0, 15.0, 15.0],
  [55.0, 55.0, 65.0, 65.0],
  [85.0, 85.0, 95.0, 95.0],
  [75.0, 75.0, 90.0, 90.0],
  [60.0, 60.0, 82.5, 82.5],
  [45.0, 45.0, 71.25, 71.25],
  [35.0, 35.0, 58.125, 58.125],
  [25.0, 25.0, 46.5625, 46.5625],
  [10.0, 10.0, 35.78125, 35.78125]],
 'dcoord': [[0.0, 0.8525484864006131, 0.8525484864006131, 0.0],
  [0.0, 0.024863305626884467, 0.024863305626884467, 0.0],
  [0.0, 0.02757546216575882, 0.02757546216575882, 0.0],
  [0.0, 0.04009972160627508, 0.04009972160627508, 0.02757546216575882],
  [0.024863305626884467,
   0.14222971620143948,
   0.14222971620143948,
   0.04009972160627508],
  [0.0, 0.3129510501346711, 0.3129510501346711, 0.14222971620143948],
  [0.0, 0.4732207840475411, 0.4732207840475411, 0.3129510501346711],
  [0.0, 1.0379457746449512, 1.0379457746449512, 0.4732207840475411],
  [0.8525484864006131,
   1.7688040008982475,
   1.7688040008982475,
   1.0379457746449512]],
 'ivl': ['6', '7', '9', '8', '0', '4', '5', '1', '2', '3'],
 'leaves': [6, 7, 9, 8, 0, 4, 5, 1, 2, 3],
 'color_list': ['g', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'b']}

在这里插入图片描述

T=fcluster(z,6,'maxclust')
for i in range(1,6+1):
    tm=np.where(T==i)
    print('第{}类的有'.format(i),tm)

Q型聚类分析

第1类的有 (array([ 1,  2,  5,  6,  9, 14, 16, 20, 22, 27, 28], dtype=int64),)
第2类的有 (array([11, 17, 19], dtype=int64),)
第3类的有 (array([ 3,  4,  7,  8, 10, 12, 13, 15, 18, 23, 25, 26], dtype=int64),)
第4类的有 (array([ 0, 29], dtype=int64),)
第5类的有 (array([24], dtype=int64),)
第6类的有 (array([21], dtype=int64),)
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
from scipy.stats import zscore
gj=pd.read_csv('gj.txt',header=None,sep='	')
gj.drop([2,3,4,5],axis=1,inplace=True)
gj=zscore(gj)
y=pdist(gj)
z=linkage(y,'average')
dendrogram(z)
{'icoord': [[25.0, 25.0, 35.0, 35.0],
  [55.0, 55.0, 65.0, 65.0],
  [45.0, 45.0, 60.0, 60.0],
  [115.0, 115.0, 125.0, 125.0],
  [105.0, 105.0, 120.0, 120.0],
  [95.0, 95.0, 112.5, 112.5],
  [135.0, 135.0, 145.0, 145.0],
  [175.0, 175.0, 185.0, 185.0],
  [165.0, 165.0, 180.0, 180.0],
  [155.0, 155.0, 172.5, 172.5],
  [140.0, 140.0, 163.75, 163.75],
  [205.0, 205.0, 215.0, 215.0],
  [195.0, 195.0, 210.0, 210.0],
  [151.875, 151.875, 202.5, 202.5],
  [103.75, 103.75, 177.1875, 177.1875],
  [245.0, 245.0, 255.0, 255.0],
  [235.0, 235.0, 250.0, 250.0],
  [225.0, 225.0, 242.5, 242.5],
  [285.0, 285.0, 295.0, 295.0],
  [275.0, 275.0, 290.0, 290.0],
  [265.0, 265.0, 282.5, 282.5],
  [233.75, 233.75, 273.75, 273.75],
  [140.46875, 140.46875, 253.75, 253.75],
  [85.0, 85.0, 197.109375, 197.109375],
  [75.0, 75.0, 141.0546875, 141.0546875],
  [52.5, 52.5, 108.02734375, 108.02734375],
  [30.0, 30.0, 80.263671875, 80.263671875],
  [15.0, 15.0, 55.1318359375, 55.1318359375],
  [5.0, 5.0, 35.06591796875, 35.06591796875]],
 'dcoord': [[0.0, 2.165069616081611, 2.165069616081611, 0.0],
  [0.0, 1.2583609996700658, 1.2583609996700658, 0.0],
  [0.0, 1.8417771551214015, 1.8417771551214015, 1.2583609996700658],
  [0.0, 0.5102264872107092, 0.5102264872107092, 0.0],
  [0.0, 0.6832038743184952, 0.6832038743184952, 0.5102264872107092],
  [0.0, 1.0534720455396414, 1.0534720455396414, 0.6832038743184952],
  [0.0, 0.5781192315086323, 0.5781192315086323, 0.0],
  [0.0, 0.39279767533924836, 0.39279767533924836, 0.0],
  [0.0, 0.6920728341799272, 0.6920728341799272, 0.39279767533924836],
  [0.0, 0.8587702247881565, 0.8587702247881565, 0.6920728341799272],
  [0.5781192315086323,
   0.951351143405573,
   0.951351143405573,
   0.8587702247881565],
  [0.0, 0.5422575852550412, 0.5422575852550412, 0.0],
  [0.0, 0.9931260197301512, 0.9931260197301512, 0.5422575852550412],
  [0.951351143405573,
   1.1337053131831187,
   1.1337053131831187,
   0.9931260197301512],
  [1.0534720455396414,
   1.3091848224406866,
   1.3091848224406866,
   1.1337053131831187],
  [0.0, 0.7003307018910611, 0.7003307018910611, 0.0],
  [0.0, 0.8455264592324916, 0.8455264592324916, 0.7003307018910611],
  [0.0, 1.183614496022609, 1.183614496022609, 0.8455264592324916],
  [0.0, 0.7408663414479949, 0.7408663414479949, 0.0],
  [0.0, 0.7928025480682006, 0.7928025480682006, 0.7408663414479949],
  [0.0, 1.43244903784588, 1.43244903784588, 0.7928025480682006],
  [1.183614496022609,
   1.4564970761510314,
   1.4564970761510314,
   1.43244903784588],
  [1.3091848224406866,
   1.7995913531962198,
   1.7995913531962198,
   1.4564970761510314],
  [0.0, 2.389167635724778, 2.389167635724778, 1.7995913531962198],
  [0.0, 2.4533056587669138, 2.4533056587669138, 2.389167635724778],
  [1.8417771551214015,
   2.801906683718659,
   2.801906683718659,
   2.4533056587669138],
  [2.165069616081611, 4.23103702123038, 4.23103702123038, 2.801906683718659],
  [0.0, 5.540472787063805, 5.540472787063805, 4.23103702123038],
  [0.0, 8.531300289191663, 8.531300289191663, 5.540472787063805]],
 'ivl': ['0',
  '24',
  '1',
  '2',
  '29',
  '27',
  '28',
  '3',
  '9',
  '20',
  '16',
  '14',
  '22',
  '13',
  '19',
  '21',
  '26',
  '18',
  '25',
  '17',
  '15',
  '23',
  '5',
  '6',
  '4',
  '7',
  '12',
  '10',
  '8',
  '11'],
 'leaves': [0,
  24,
  1,
  2,
  29,
  27,
  28,
  3,
  9,
  20,
  16,
  14,
  22,
  13,
  19,
  21,
  26,
  18,
  25,
  17,
  15,
  23,
  5,
  6,
  4,
  7,
  12,
  10,
  8,
  11],
 'color_list': ['g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'b']}

在这里插入图片描述

for k in range(3,6):
    print('划分成{}类的结果如下'.format(k))
    T=fcluster(z,k,'maxclust')
    for i in range(1,k+1):
        tm=np.where(T==i)
        print('第{}类的有{}'.format(i,tm))
划分成3类的结果如下
第1类的有(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29], dtype=int64),)
第2类的有(array([24], dtype=int64),)
第3类的有(array([0], dtype=int64),)
划分成4类的结果如下
第1类的有(array([1, 2], dtype=int64),)
第2类的有(array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 25, 26, 27, 28, 29], dtype=int64),)
第3类的有(array([24], dtype=int64),)
第4类的有(array([0], dtype=int64),)
划分成5类的结果如下
第1类的有(array([1, 2], dtype=int64),)
第2类的有(array([27, 28, 29], dtype=int64),)
第3类的有(array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 25, 26], dtype=int64),)
第4类的有(array([24], dtype=int64),)
第5类的有(array([0], dtype=int64),)
;