Simulink中便于搭建各类动力学与控制模型,通过将原有的控制器替换为AI控制器,可以方便使用已有模型,提供增量效果。
本节的重点是如何引入Simulink模型作为env,其他的内容在之前的文章中已有说明。
以水箱模型watertank为例,如下图所示:
采用PI控制器,控制效果如下所示:
将此PI控制器替换为神经网络控制器后,系统架构如下图所示:
具体替换策略如下所示:
(1)删去PID控制器;
(2)增加RL Agent模块;
(3)观测器模块:
观测向量为:,其中为水箱高度,,是设定的水位高度。
(4)奖励函数定义如下
如下所示:
(5)终止条件
创建接口
观测器信息
obsInfo = rlNumericSpec([3 1],...
'LowerLimit',[-inf -inf 0 ]',...
'UpperLimit',[ inf inf inf]');
obsInfo.Name = 'observations';
obsInfo.Description = 'integrated error, error, and measured height';
numObservations = obsInfo.Dimension(1);
动作器信息
actInfo = rlNumericSpec([1 1]);
actInfo.Name = 'flow';
numActions = actInfo.Dimension(1);
构建接口【最核心的步骤】
env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent',...
obsInfo,actInfo);
定义重置函数
env.ResetFcn = @(in)localResetFcn(in);
配置仿真总时间和仿真步长
Ts = 1.0;
Tf = 200;
重置随机数种子
rng(0)
生成DDPG Agent
具体见前一篇文章。
state网络
statePath = [
featureInputLayer(numObservations,'Normalization','none','Name','State')
fullyConnectedLayer(50,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(25,'Name','CriticStateFC2')];
action网络
actionPath = [
featureInputLayer(numActions,'Normalization','none','Name','Action')
fullyConnectedLayer(25,'Name','CriticActionFC1')];
合并网络
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','CriticOutput')];
组成critic网络
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
critic网络参数配置
criticOpts = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'State'},'Action',{'Action'},criticOpts);
actor网络
actorNetwork = [
featureInputLayer(numObservations,'Normalization','none','Name','State')
fullyConnectedLayer(3, 'Name','actorFC')
tanhLayer('Name','actorTanh')
fullyConnectedLayer(numActions,'Name','Action')
];
actor网络参数配置
actorOptions = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'State'},'Action',{'Action'},actorOptions);
构建最终的DDPG Agent
agentOpts = rlDDPGAgentOptions(...
'SampleTime',Ts,...
'TargetSmoothFactor',1e-3,...
'DiscountFactor',1.0, ...
'MiniBatchSize',64, ...
'ExperienceBufferLength',1e6);
agentOpts.NoiseOptions.StandardDeviation = 0.3;
agentOpts.NoiseOptions.StandardDeviationDecayRate = 1e-5;
agent = rlDDPGAgent(actor,critic,agentOpts);
训练网络
maxepisodes = 5000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes, ...
'MaxStepsPerEpisode',maxsteps, ...
'ScoreAveragingWindowLength',20, ...
'Verbose',false, ...
'Plots','training-progress',...
'StopTrainingCriteria','AverageReward',...
'StopTrainingValue',800);
开始训练
trainingStats = train(agent,env,trainOpts);
训练过程可以看到主程序一直在调用simulink模型
最后验证训练结果
simOpts = rlSimulationOptions('MaxSteps',maxsteps,'StopOnError','on');
experiences = sim(env,agent,simOpts);
得到如下结果