机器学习
机器:machine 不是机器设备,而是“计算机软硬件组织”
学习:learn
分类:
- 根据是否有标签:
- 有监督学习 supervised learning
- 样本:有特征X,也有对应的标签y
- 分类算法:预测一个离散量
- 逻辑回归
- KNN回归
- 朴素贝叶斯
- 决策树
- 支持向量机
- 集成学习
- 回归算法:预测一个连续量
- 线性回归
- KNN回归
- 决策树回归
- 支持向量机回归 SVR
- 集成学习
- 无监督学习 unsupervised learning
- 样本:有特征X,但是没有标签y
- 聚类算法
- KMeans
- 预处理类算法:
- 降维算法
- PCA
- 中心化
- x - mu
- 标准化
- (x - mu) / sigma
- 归一化
- (x - _min) / (_max - _min)
- 降维算法
- 有监督学习 supervised learning
集成学习
前面算法的思想:
- 打造一个系列独立、强大的算法
- 单个算法打天下
- 单个算法解决问题
- 类似于个人英雄主义,孤单新英雄
Ensemble Learning 集成学习
- 群狼打败猛虎的策略
- 是一种管理理念
- 招募一批二流货色,组合成一个强大的团队,干掉那个一流货色
- 像包工头,管理了一堆打工人。
- 三个臭皮匠,干掉诸葛亮
- 使用一系列的弱评估器(分类器和回归器),通过头脑风暴,干掉一个强分类器
- 要素:
- 一批弱分类器
- 组合策略
- 集成学习思想的变体:
- DropOut
- MoE Mixture of Experts 混合专家
- 策略:
- Voting:投票策略(少数服从多数)
- Bagging:对数据采样策略
- Stacking:双阶段策略
- Boosting:错题本思想
- 算法:
- 随机森林算法
- XGBoost(机器学习Kaggle比赛大杀器)
- LightGBM(微软维护的,顶流!机器学习中的大数据算法)
- 结构化数据 / 表格类数据 tabular data
- 最终还是得靠机器学习
- 集成学习
- 深度学习算法可以做,但是效果不如集成学习
- 最终还是得靠机器学习
- 小场景用随机森林算法、XGBoost目前基本已淘汰,其他都可用LightGBM
1. Voting思想
"""
Voting思想
算法不同
数据相同
"""
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import VotingRegressor
"""
Soft Voting/Majority Rule classifier for unfitted estimators.
unfitted 未拟合(主动不拟合,主动构建的弱分类器)
- 主动构建一批弱鸡分类器
- 集成学习策略
overfit 过拟合(训练大了,入戏太深,书呆子,把训练集上的错误也学习了)
- 训练集表现非常好,测试集表现非常差
- 模型被训练废了,不能用了
underfit 欠拟合(训练不够,没有充分学习全部有效的知识)
- 训练集表现不够好,测试集表现也不够好
- 原因:训练不够
"""
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
eclf1 = VotingClassifier(estimators=[('lr', clf1),
('rf', clf2),
('gnb', clf3)],
voting='hard')
eclf1 = eclf1.fit(X, y)
print(eclf1.predict(X))
2. Bagging思想
"""
2. Bagging思想
bootstrap aggregating (先采样,再聚合;分组聚合)
算法相同
数据不同
"""
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import BaggingRegressor
"""
A Bagging classifier is an ensemble meta-estimator that fits base
classifiers each on random subsets of the original dataset and then
aggregate their individual predictions (either by voting or by averaging)
to form a final prediction.
"""
# base estimator
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100,
n_features=4,
n_informative=2,
n_redundant=0,
random_state=0,
shuffle=False)
clf = BaggingClassifier(estimator=SVC(), n_estimators=10, random_state=0).fit(X, y)
clf.predict([[0, 0, 0, 0]])
科研的本质:站在别人的肩上,挑别人的毛病?
博览群书、将所有研究策略都看一遍,然后精准找出他们的问题点,再纠正这些问题,这就是你的创新。
Voting思想是数据一样,算法不一样;Bagging思想是算法一样,数据不一样。它们都是规则性策略,它们的毛病都是整合策略不够好。
数据科学讲究的是将决策权交给数据,将人看作傻子,人不做任何决策, 这是最好的算法。
深度学习的本质是将所有的决策权都交给数据。
3. Stacking 双阶段思想
"""
3. Stacking 双阶段思想
整合策略
"""
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import StackingRegressor
"""
Stacked generalization consists in stacking the output of individual
estimator and use a classifier to compute the final prediction.
"""
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
X, y = load_iris(return_X_y=True)
estimators = [
('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
('svr', make_pipeline(StandardScaler(),
LinearSVC(dual="auto", random_state=42)))]
clf = StackingClassifier(
estimators=estimators, final_estimator=LogisticRegression()
)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, random_state=42
)
clf.fit(X_train, y_train).score(X_test, y_test)
4. Boosting 思想
"""
4. Boosting 思想
错题本
吾日三省吾身
"""
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
"""
An AdaBoost [1]_ classifier is a meta-estimator that begins by fitting a
classifier on the original dataset and then fits additional copies of the
classifier on the same dataset but where the weights of incorrectly
classified instances are adjusted such that subsequent classifiers focus
more on difficult cases.
"""
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
clf = AdaBoostClassifier(n_estimators=100, algorithm="SAMME", random_state=0)
clf.fit(X, y)
AdaBoostClassifier(algorithm='SAMME', n_estimators=100, random_state=0)
clf.predict([[0, 0, 0, 0]])
clf.score(X, y)
5. 核心集成学习算法
"""
5. 核心集成学习算法
RandomForestXX
"""
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
"""
A random forest is a meta estimator that fits a number of decision tree
classifiers on various sub-samples of the dataset and uses averaging to
improve the predictive accuracy and control over-fitting.
Trees in the forest use the best split strategy, i.e. equivalent to passing
`splitter="best"` to the underlying :class:`~sklearn.tree.DecisionTreeRegressor`.
The sub-sample size is controlled with the `max_samples` parameter if
`bootstrap=True` (default), otherwise the whole dataset is used to build
each tree.
"""
RandomForestClassifier()
下载安装xgboost
from xgboost import XGBClassifier
from xgboost import XGBRegressor
下载安装lightgbm
"""
终极推荐:
- 训练速度
- 推理速度
- 处理数据量
- 结果层面
- 综合来看,最优!!!!!!
"""
from lightgbm import LGBMClassifier
from lightgbm import LGBMRegressor