Bootstrap

Linear Regression线性回归(一元、多元)

目录

介绍: 

一、一元线性回归

1.1数据处理 

1.2建模

 二、多元线性回归

2.1数据处理

2.2数据分为训练集和测试集

2.3建模

介绍: 

线性回归是一种用于预测数值输出的统计分析方法。它通过建立自变量(也称为特征变量)和因变量之间的线性关系来进行预测。在线性回归中,自变量和因变量之间的关系可以用一条直线来表示。

线性回归的目标是找到最佳拟合直线,使得预测值和真实值之间的差异最小化。常用的求解方法是最小二乘法,即通过最小化预测值与真实值之间的平方差来确定最佳拟合直线的参数。

线性回归模型的表示形式如下:

Y = β0 + β1X1 + β2X2 + ... + βnXn + ε

其中,Y是因变量,X1到Xn是自变量,β0到βn是模型的参数,ε是误差项。

线性回归的优点包括模型简单易解释、计算效率高等。然而,线性回归的局限性在于它假设自变量和因变量之间的关系是线性的,如果真实关系是非线性的,线性回归可能无法提供准确的预测。此外,线性回归还对异常值和多重共线性敏感。

一、一元线性回归

1.1数据处理 

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression#应用Linear Regression线性回归 是个class
x=np.array([10,20,30,40,50,60])#自变量
y=np.array([15,45,50,70,110,130])#因变量
X=x.reshape(-1,1)
'''结果:
X array([[10],
       [20],
       [30],
       [40],
       [50],
       [60]])
y array([ 15,  45,  50,  70, 110, 130]
'''

1.2建模

linreg.fit(X,y)#X和y赋给线性回归这个算法
y_predict=linreg.predict(X)#预测值
plt.scatter(X,y)#实际的点
plt.plot(X,y_predict,'Red')#线性回归这条线
plt.show()

print(linreg.coef_)#斜率
#[2.25714286]

print(linreg.intercept_)#y轴的切入点
#-9.000000000000014

 二、多元线性回归

2.1数据处理

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('50_Startups.csv')
dataset
'''结果:
 	R&D Spend 	Administration 	Marketing Spend 	State 	Profit
0 	165349.20 	136897.80 	471784.10 	New York 	192261.83
1 	162597.70 	151377.59 	443898.53 	California 	191792.06
2 	153441.51 	101145.55 	407934.54 	Florida 	191050.39
3 	144372.41 	118671.85 	383199.62 	New York 	182901.99
4 	142107.34 	91391.77 	366168.42 	Florida 	166187.94
5 	131876.90 	99814.71 	362861.36 	New York 	156991.12
6 	134615.46 	147198.87 	127716.82 	California 	156122.51
7 	130298.13 	145530.06 	323876.68 	Florida 	155752.60
8 	120542.52 	148718.95 	311613.29 	New York 	152211.77
9 	123334.88 	108679.17 	304981.62 	California 	149759.96
10 	101913.08 	110594.11 	229160.95 	Florida 	146121.95
11 	100671.96 	91790.61 	249744.55 	California 	144259.40
12 	93863.75 	127320.38 	249839.44 	Florida 	141585.52
13 	91992.39 	135495.07 	252664.93 	California 	134307.35
14 	119943.24 	156547.42 	256512.92 	Florida 	132602.65
15 	114523.61 	122616.84 	261776.23 	New York 	129917.04
16 	78013.11 	121597.55 	264346.06 	California 	126992.93
17 	94657.16 	145077.58 	282574.31 	New York 	125370.37
18 	91749.16 	114175.79 	294919.57 	Florida 	124266.90
19 	86419.70 	153514.11 	0.00 	New York 	122776.86
20 	76253.86 	113867.30 	298664.47 	California 	118474.03
21 	78389.47 	153773.43 	299737.29 	New York 	111313.02
22 	73994.56 	122782.75 	303319.26 	Florida 	110352.25
23 	67532.53 	105751.03 	304768.73 	Florida 	108733.99
24 	77044.01 	99281.34 	140574.81 	New York 	108552.04
25 	64664.71 	139553.16 	137962.62 	California 	107404.34
26 	75328.87 	144135.98 	134050.07 	Florida 	105733.54
27 	72107.60 	127864.55 	353183.81 	New York 	105008.31
28 	66051.52 	182645.56 	118148.20 	Florida 	103282.38
29 	65605.48 	153032.06 	107138.38 	New York 	101004.64
30 	61994.48 	115641.28 	91131.24 	Florida 	99937.59
31 	61136.38 	152701.92 	88218.23 	New York 	97483.56
32 	63408.86 	129219.61 	46085.25 	California 	97427.84
33 	55493.95 	103057.49 	214634.81 	Florida 	96778.92
34 	46426.07 	157693.92 	210797.67 	California 	96712.80
35 	46014.02 	85047.44 	205517.64 	New York 	96479.51
36 	28663.76 	127056.21 	201126.82 	Florida 	90708.19
37 	44069.95 	51283.14 	197029.42 	California 	89949.14
38 	20229.59 	65947.93 	185265.10 	New York 	81229.06
39 	38558.51 	82982.09 	174999.30 	California 	81005.76
40 	28754.33 	118546.05 	172795.67 	California 	78239.91
41 	27892.92 	84710.77 	164470.71 	Florida 	77798.83
42 	23640.93 	96189.63 	148001.11 	California 	71498.49
43 	15505.73 	127382.30 	35534.17 	New York 	69758.98
44 	22177.74 	154806.14 	28334.72 	California 	65200.33
45 	1000.23 	124153.04 	1903.93 	New York 	64926.08
46 	1315.46 	115816.21 	297114.46 	Florida 	49490.75
47 	0.00 	135426.92 	0.00 	California 	42559.73
48 	542.05 	51743.15 	0.00 	New York 	35673.41
49 	0.00 	116983.80 	45173.06 	California 	14681.40
'''
X = dataset.iloc[:,0:4]#取前四列作为特征变量,第五行为y
y=dataset.iloc[:,4]#直接取第五列,为y值

X["State"].unique()#有几个不同的
#结果:array(['New York', 'California', 'Florida'], dtype=object

#因为State是类别性,需要进行加工,转化成数值
pd.get_dummies(X['State'])
'''结果:
 	California 	Florida 	New York
0 	0 	0 	1
1 	1 	0 	0
2 	0 	1 	0
3 	0 	0 	1
4 	0 	1 	0
5 	0 	0 	1
6 	1 	0 	0
7 	0 	1 	0
8 	0 	0 	1
9 	1 	0 	0
10 	0 	1 	0
11 	1 	0 	0
12 	0 	1 	0
13 	1 	0 	0
14 	0 	1 	0
15 	0 	0 	1
16 	1 	0 	0
17 	0 	0 	1
18 	0 	1 	0
19 	0 	0 	1
20 	1 	0 	0
21 	0 	0 	1
22 	0 	1 	0
23 	0 	1 	0
24 	0 	0 	1
25 	1 	0 	0
26 	0 	1 	0
27 	0 	0 	1
28 	0 	1 	0
29 	0 	0 	1
30 	0 	1 	0
31 	0 	0 	1
32 	1 	0 	0
33 	0 	1 	0
34 	1 	0 	0
35 	0 	0 	1
36 	0 	1 	0
37 	1 	0 	0
38 	0 	0 	1
39 	1 	0 	0
40 	1 	0 	0
41 	0 	1 	0
42 	1 	0 	0
43 	0 	0 	1
44 	1 	0 	0
45 	0 	0 	1
46 	0 	1 	0
47 	1 	0 	0
48 	0 	0 	1
49 	1 	0 	0
'''

statesdump=pd. get_dummies(X['State'],drop_first=True) #节省空间,florida和newyork为0就意味california
'''结果:
 	Florida 	New York
0 	0 	1
1 	0 	0
2 	1 	0
3 	0 	1
4 	1 	0
5 	0 	1
6 	0 	0
7 	1 	0
8 	0 	1
9 	0 	0
10 	1 	0
11 	0 	0
12 	1 	0
13 	0 	0
14 	1 	0
15 	0 	1
16 	0 	0
17 	0 	1
18 	1 	0
19 	0 	1
20 	0 	0
21 	0 	1
22 	1 	0
23 	1 	0
24 	0 	1
25 	0 	0
26 	1 	0
27 	0 	1
28 	1 	0
29 	0 	1
30 	1 	0
31 	0 	1
32 	0 	0
33 	1 	0
34 	0 	0
35 	0 	1
36 	1 	0
37 	0 	0
38 	0 	1
39 	0 	0
40 	0 	0
41 	1 	0
42 	0 	0
43 	0 	1
44 	0 	0
45 	0 	1
46 	1 	0
47 	0 	0
48 	0 	1
49 	0 	0
'''
X=X.drop('State',axis=1)#去掉State那一列
X=pd.concat([X,statesdump],axis=1)#将statesdump加入
'''结果:
 	R&D Spend 	Administration 	Marketing Spend 	Florida 	New York
0 	165349.20 	136897.80 	471784.10 	0 	1
1 	162597.70 	151377.59 	443898.53 	0 	0
2 	153441.51 	101145.55 	407934.54 	1 	0
3 	144372.41 	118671.85 	383199.62 	0 	1
4 	142107.34 	91391.77 	366168.42 	1 	0
5 	131876.90 	99814.71 	362861.36 	0 	1
6 	134615.46 	147198.87 	127716.82 	0 	0
7 	130298.13 	145530.06 	323876.68 	1 	0
8 	120542.52 	148718.95 	311613.29 	0 	1
9 	123334.88 	108679.17 	304981.62 	0 	0
10 	101913.08 	110594.11 	229160.95 	1 	0
11 	100671.96 	91790.61 	249744.55 	0 	0
12 	93863.75 	127320.38 	249839.44 	1 	0
13 	91992.39 	135495.07 	252664.93 	0 	0
14 	119943.24 	156547.42 	256512.92 	1 	0
15 	114523.61 	122616.84 	261776.23 	0 	1
16 	78013.11 	121597.55 	264346.06 	0 	0
17 	94657.16 	145077.58 	282574.31 	0 	1
18 	91749.16 	114175.79 	294919.57 	1 	0
19 	86419.70 	153514.11 	0.00 	0 	1
20 	76253.86 	113867.30 	298664.47 	0 	0
21 	78389.47 	153773.43 	299737.29 	0 	1
22 	73994.56 	122782.75 	303319.26 	1 	0
23 	67532.53 	105751.03 	304768.73 	1 	0
24 	77044.01 	99281.34 	140574.81 	0 	1
25 	64664.71 	139553.16 	137962.62 	0 	0
26 	75328.87 	144135.98 	134050.07 	1 	0
27 	72107.60 	127864.55 	353183.81 	0 	1
28 	66051.52 	182645.56 	118148.20 	1 	0
29 	65605.48 	153032.06 	107138.38 	0 	1
30 	61994.48 	115641.28 	91131.24 	1 	0
31 	61136.38 	152701.92 	88218.23 	0 	1
32 	63408.86 	129219.61 	46085.25 	0 	0
33 	55493.95 	103057.49 	214634.81 	1 	0
34 	46426.07 	157693.92 	210797.67 	0 	0
35 	46014.02 	85047.44 	205517.64 	0 	1
36 	28663.76 	127056.21 	201126.82 	1 	0
37 	44069.95 	51283.14 	197029.42 	0 	0
38 	20229.59 	65947.93 	185265.10 	0 	1
39 	38558.51 	82982.09 	174999.30 	0 	0
40 	28754.33 	118546.05 	172795.67 	0 	0
41 	27892.92 	84710.77 	164470.71 	1 	0
42 	23640.93 	96189.63 	148001.11 	0 	0
43 	15505.73 	127382.30 	35534.17 	0 	1
44 	22177.74 	154806.14 	28334.72 	0 	0
45 	1000.23 	124153.04 	1903.93 	0 	1
46 	1315.46 	115816.21 	297114.46 	1 	0
47 	0.00 	135426.92 	0.00 	0 	0
48 	542.05 	51743.15 	0.00 	0 	1
49 	0.00 	116983.80 	45173.06 	0 	0
'''

2.2数据分为训练集和测试集

from  sklearn.model_selection import train_test_split#将数据分成测试和训练集
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)#测试集占百分之三十,random_state=0随机抽取数据集里的成为测试集

X_train.count()
'''结果:
R&D Spend          35
Administration     35
Marketing Spend    35
Florida            35
New York           35
dtype: int64
'''

X_test.count()
'''结果:
R&D Spend          15
Administration     15
Marketing Spend    15
Florida            15
New York           15
dtype: int64
'''

2.3建模

from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
model=regressor.fit(X_train,y_train)#将训练集给到线性回归模型
y_predict=regressor.predict(X_test)#测试集因变量的预测值
from sklearn.metrics import r2_score

score1=r2_score(y_test,y_predict)#测试集的值与预测的值进行比较评估
score1#进行评估,越接近1,越好
#0.9358680970046241

model.coef_#参数,特征变量有五个,五元的线性回归
#结果:array([7.90840255e-01, 3.01968165e-02, 3.10148566e-02, #4.63028992e+02,3.04799573e+02])

model.score(X,y)#整个模型的值
#结果:0.9489303683771293

model.intercept_#y轴的切入
#结果:42403.8708705279

;