This is Part-2 of a two-part series. Please read Part-1 here: https://medium.com/@kala.shagun/stock-market-prediction-using-news-sentiments-f9101e5ee1f4
这是由两部分组成的系列文章的第2部分。 请在此处阅读第1部分: https : //medium.com/@kala.shagun/stock-market-prediction-using-news-sentiments-f9101e5ee1f4
造型 (Modeling)
The ML Models used here are selected based on the production requirement. We want to deploy the model. As we know that time series model needs to be trained every time in production with the new data points for accurate prediction so we will be using only those models which have low time complexity in training i.e. which trains faster with new data.
根据生产要求选择此处使用的ML模型。 我们要部署模型。 我们知道,每次生产时都需要使用新的数据点对时间序列模型进行训练以进行准确的预测,因此我们将仅使用训练中时间复杂度较低的模型,即使用新数据进行更快的训练。
1. ARIMA (1. ARIMA)
An ARIMA model is a class of statistical models for analyzing and forecasting time series data.
ARIMA模型是用于分析和预测时间序列数据的一类统计模型。
ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration.
ARIMA是首字母缩写词,代表自动回归综合移动平均线。 它是对更简单的自回归移动平均线的概括,并增加了积分的概念。
This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are:
该首字母缩写是描述性的,捕获了模型本身的关键方面。 简而言之,它们是:
AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations.
AR : 自回归 。 一种模型,它使用观察值和一些滞后观察值之间的依赖关系。
I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from observation at the previous time step) in order to make the time series stationary.
我 : 综合 。 为了使时间序列固定,使用原始观测值的差异(例如,从上一个时间步长的观测值中减去观测值)。
MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.
MA : 移动平均线 。 一种模型,该模型使用观察值与应用于滞后观察值的移动平均模型的残差之间的依赖关系。
Each of these components is explicitly specified in the model as a parameter. The parameters of the ARIMA model are defined as follows:
这些组件中的每一个都在模型中明确指定为参数。 ARIMA模型的参数定义如下:
p: The number of lag observations included in the model, also called the lag order.
p :模型中包含的滞后观测值的数量,也称为滞后阶数。
d: The number of times that the raw observations are differenced also called the degree of differencing.
d :原始观测值的差异次数也称为差异程度。
q: The size of the moving average window, also called the order of moving average.
q :移动平均窗口的大小,也称为移动平均的顺序。
Let’s plot Autocorrelation and Partial Autocorrelation Plot to identify the above parameter values.
让我们绘制自相关和部分自相关图以标识上述参数值。
p — The lag value where the PACF chart crosses the upper confidence interval for the first time. If you notice closely, in this case, p=2.
p — PACF图表首次超过上限置信区间的滞后值。 如果密切注意,在这种情况下,p = 2。
q — The lag value where the ACF chart crosses the upper confidence interval for the first time. If you notice closely, in this case, q=2.
q — ACF图表首次超过上限置信区间的滞后值。 如果密切注意,在这种情况下,q = 2。
d — In differencing method, a shift of 1 period produced a stationary timer series. So we will use d = 1.
d —在微分方法中,每移位1个周期便产生一个固定的计时器序列。 因此,我们将使用d = 1。
We forecast stationary time-series which we got after the differencing method using ARIMA. Then transform the results to get our original time series.
我们使用ARIMA预测通过微分方法得到的平稳时间序列。 然后转换结果以获得我们的原始时间序列。
RMSE from ARIMA = 1707.77
ARIMA的RMSE = 1707.77
Let’s try to improve the prediction using more advanced methods.
让我们尝试使用更高级的方法来改善预测。
2. SARIMAX (2. SARIMAX)
ARIMA model considers only trends information in the data and ignores seasonal variation. SARIMAX is a variation of the ARIMA model which considers seasonal variation in the data as well. Though, our data do not have high seasonality but why not give it a try.
ARIMA模型仅考虑数据中的趋势信息,而忽略了季节性变化。 SARIMAX是ARIMA模型的一种变体,它也考虑了数据的季节性变化。 虽然,我们的数据没有很高的季节性,但为什么不尝试一下。
RMSE from SARIMAX = 964.97
SARIMAX的RMSE = 964.97
Woah! RMSE got down to 964 from 1707. SARIMAX really works well.
哇! RMSE从1707年下降到964。SARIMAX确实运行良好。
3. Facebook先知 (3. Facebook Prophet)
The prophet is an open-source library published by Facebook that is based on decomposable (trend+seasonality+holidays) models. It provides us with the ability to make time-series predictions with good accuracy using simple intuitive parameters and has support for including the impact of custom seasonality and holidays!
该先知是Facebook发布的一个开源库,它基于可分解(趋势+季节性+假日)模型 。 它使我们能够使用简单直观的参数准确地进行时间序列预测,并支持包括自定义季节和假日的影响!
RMSE from Facebook Prophet = 709.70
来自Facebook Prophet的RMSE = 709.70
Nice! RMSE has further reduced to 709 from 964. It is still far from acceptable prediction. Let’s try deep learning models now.
真好! RMSE已从964进一步降低至709。这仍远未达到可接受的预测。 让我们现在尝试深度学习模型。
Before going ahead, let’s look at some useful plots Facebook Prophet provides:
在继续之前,让我们看一下Facebook Prophet提供的一些有用的图:
Our data has some seasonal information present. This is why SARIMAX also performed well.
我们的数据提供了一些季节性信息。 这就是SARIMAX也表现出色的原因。
Following points can be observed from the above graphs:
从上图可以看出以下几点:
- Our data shows an upward trend. 我们的数据显示出上升趋势。
- Stock price gets up on Saturday and remains almost fl