2025-1-10-sklearn学习(36、37) 数据集转换-无监督降维+随机投影沙上并禽池上暝。云破月来花弄影。

文章目录

sklearn学习(36、37) 数据集转换-无监督降维+随机投影
sklearn学习(36) 数据集转换-无监督降维
sklearn学习(37) 数据集转换-随机投影

sklearn学习(36、37) 数据集转换-无监督降维+随机投影

文章参考网站：
https://sklearn.apachecn.org/
和
https://scikit-learn.org/stable/

sklearn学习(36) 数据集转换-无监督降维

如果你的特征数量很多, 在监督步骤之前, 可以通过无监督的步骤来减少特征. 很多的无监督学习方法实现了一个名为 transform 的方法, 它可以用来降低维度. 下面我们将讨论大量使用这种模式的两个具体示例.

Pipelining 非监督数据约简和监督估计器可以链接起来。请看 Pipeline: 链式评估器.

36.1 PCA: 主成份分析

decomposition.PCA 寻找能够捕捉原始特征的差异的特征的组合. 请参阅分解成分中的信号（矩阵分解问题）.

示例 * Faces recognition example using eigenfaces and SVMs

36.2 随机投影

模块: random_projection 提供了几种用于通过随机投影减少数据的工具. 请参阅文档的相关部分: 随机投影.

示例 * The Johnson-Lindenstrauss bound for embedding with random projections

36.3 特征聚集

cluster.FeatureAgglomeration 应用层次聚类将行为类似的特征分组在一起.

示例 * Feature agglomeration vs. univariate selection * Feature agglomeration

特征缩放

请注意，如果功能具有明显不同的缩放或统计属性，则 cluster.FeatureAgglomeration 可能无法捕获相关特征之间的关系.使用一个 preprocessing.StandardScaler 可以在这些设置中使用.

sklearn学习(37) 数据集转换-随机投影

sklearn.random_projection 模块实现了一个简单且高效率的计算方式来减少数据维度，通过牺牲一定的精度（作为附加变量）来加速处理时间及更小的模型尺寸。这个模型实现了两类无结构化的随机矩阵: Gaussian random matrix 和 sparse random matrix.

随机投影矩阵的维度和分布是受控制的，所以可以保存任意两个数据集的距离。因此随机投影适用于基于距离的方法。

参考资料:

Sanjoy Dasgupta. 2000. Experiments with random projection. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence (UAI‘00), Craig Boutilier and Moisés Goldszmidt (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 143-151.
Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘01). ACM, New York, NY, USA, 245-250.

37.1 Johnson-Lindenstrauss 辅助定理

支撑随机投影效率的主要理论成果是Johnson-Lindenstrauss lemma (quoting Wikipedia):

在数学中，johnson - lindenstrauss 引理是一种将高维的点从高维到低维欧几里得空间的低失真嵌入的方案。引理阐释了高维空间下的一小部分的点集可以内嵌到非常低维的空间，这种方式下点之间的距离几乎全部被保留。内嵌所用到的映射至少符合 Lipschitz 条件,甚至可以被当做正交投影。

有了样本数量， sklearn.random_projection.johnson_lindenstrauss_min_dim 会保守估计随机子空间的最小大小来保证随机投影导致的变形在一定范围内：

>>> from sklearn.random_projection import johnson_lindenstrauss_min_dim
>>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=0.5)
663
>>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=[0.5, 0.1, 0.01])
array([    663,   11841, 1112658])
>>> johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1)
array([ 7894,  9868, 11841])`

示例:

查看 The Johnson-Lindenstrauss bound for embedding with random projections 里面有Johnson-Lindenstrauss引理的理论说明和使用稀疏随机矩阵的经验验证。

参考资料:

Sanjoy Dasgupta and Anupam Gupta, 1999. An elementary proof of the Johnson-Lindenstrauss Lemma.

37.2 高斯随机投影

The sklearn.random_projection.GaussianRandomProjection 通过将原始输入空间投影到随机生成的矩阵（该矩阵的组件由以下分布中抽取）: $\frac{1}{n_{components}})$ 降低维度。

以下小片段演示了任何使用高斯随机投影转换器:

>>> import numpy as np
>>> from sklearn import random_projection
>>> X = np.random.rand(100, 10000)
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)

37.3 稀疏随机矩阵

sklearn.random_projection.SparseRandomProjection 使用稀疏随机矩阵，通过投影原始输入空间来降低维度。

稀疏矩阵可以替换高斯随机投影矩阵来保证相似的嵌入质量，且内存利用率更高、投影数据的计算更快。

如果我们定义 s = 1 / density, 随机矩阵的元素由下式抽取。
$\begin{split}\left\{ \begin{array}{c c l} -\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\ 0 &\text{with probability} & 1 - 1 / s \\ +\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\ \end{array} \right.\end{split}$
其中 $n_{\text{components}}$ 是投影后的子空间大小。默认非零元素的浓密度设置为最小浓密度，该值由Ping Li et al.:推荐，根据公式: $\sqrt{n_{\text{features}}}$ 计算。

以下小片段演示了如何使用稀疏随机投影转换器:

>>> import numpy as np
>>> from sklearn import random_projection
>>> X = np.random.rand(100,10000)
>>> transformer = random_projection.SparseRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)

参考资料:

D. Achlioptas. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences 66 (2003) 671–687
Ping Li, Trevor J. Hastie, and Kenneth W. Church. 2006. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘06). ACM, New York, NY, USA, 287-296.