Bootstrap

【MATLAB强化学习工具箱】学习笔记--在Simulink环境中训练智能体Create Simulink Environment and Train Agent

Simulink中便于搭建各类动力学与控制模型,通过将原有的控制器替换为AI控制器,可以方便使用已有模型,提供增量效果。

本节的重点是如何引入Simulink模型作为env,其他的内容在之前的文章中已有说明。

以水箱模型watertank为例,如下图所示:

 采用PI控制器,控制效果如下所示:

 将此PI控制器替换为神经网络控制器后,系统架构如下图所示:

 具体替换策略如下所示:

(1)删去PID控制器;

(2)增加RL Agent模块;

(3)观测器模块:

观测向量为:[\int e\mathrm{dt} \ e \ h]^T,其中h为水箱高度,e=r-hr是设定的水位高度。

(4)奖励函数定义如下

\mathrm{reward}=10(|e|<0.1)-1(|e|\geq 0.1)-100(h\leq 0||h\geq 20)

如下所示:

 (5)终止条件

h\leq 0 || h\geq 20

创建接口

观测器信息

obsInfo = rlNumericSpec([3 1],...
    'LowerLimit',[-inf -inf 0  ]',...
    'UpperLimit',[ inf  inf inf]');
obsInfo.Name = 'observations';
obsInfo.Description = 'integrated error, error, and measured height';
numObservations = obsInfo.Dimension(1);

 动作器信息

actInfo = rlNumericSpec([1 1]);
actInfo.Name = 'flow';
numActions = actInfo.Dimension(1);

构建接口【最核心的步骤】

env = rlSimulinkEnv('rlwatertank','rlwatertank/RL Agent',...
    obsInfo,actInfo);

定义重置函数

env.ResetFcn = @(in)localResetFcn(in);

配置仿真总时间和仿真步长

Ts = 1.0;
Tf = 200;

重置随机数种子

rng(0)

生成DDPG Agent

具体见前一篇文章。

state网络

statePath = [
    featureInputLayer(numObservations,'Normalization','none','Name','State')
    fullyConnectedLayer(50,'Name','CriticStateFC1')
    reluLayer('Name','CriticRelu1')
    fullyConnectedLayer(25,'Name','CriticStateFC2')];

action网络

actionPath = [
    featureInputLayer(numActions,'Normalization','none','Name','Action')
    fullyConnectedLayer(25,'Name','CriticActionFC1')];

合并网络

commonPath = [
    additionLayer(2,'Name','add')
    reluLayer('Name','CriticCommonRelu')
    fullyConnectedLayer(1,'Name','CriticOutput')];

组成critic网络

criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');

critic网络参数配置

criticOpts = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'State'},'Action',{'Action'},criticOpts);

actor网络

actorNetwork = [
    featureInputLayer(numObservations,'Normalization','none','Name','State')
    fullyConnectedLayer(3, 'Name','actorFC')
    tanhLayer('Name','actorTanh')
    fullyConnectedLayer(numActions,'Name','Action')
    ];

actor网络参数配置

actorOptions = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);

actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'State'},'Action',{'Action'},actorOptions);

构建最终的DDPG Agent

agentOpts = rlDDPGAgentOptions(...
    'SampleTime',Ts,...
    'TargetSmoothFactor',1e-3,...
    'DiscountFactor',1.0, ...
    'MiniBatchSize',64, ...
    'ExperienceBufferLength',1e6); 
agentOpts.NoiseOptions.StandardDeviation = 0.3;
agentOpts.NoiseOptions.StandardDeviationDecayRate = 1e-5;
agent = rlDDPGAgent(actor,critic,agentOpts);

训练网络

maxepisodes = 5000;
maxsteps = ceil(Tf/Ts);
trainOpts = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes, ...
    'MaxStepsPerEpisode',maxsteps, ...
    'ScoreAveragingWindowLength',20, ...
    'Verbose',false, ...
    'Plots','training-progress',...
    'StopTrainingCriteria','AverageReward',...
    'StopTrainingValue',800);

开始训练 

    trainingStats = train(agent,env,trainOpts);

训练过程可以看到主程序一直在调用simulink模型

 

最后验证训练结果

simOpts = rlSimulationOptions('MaxSteps',maxsteps,'StopOnError','on');
experiences = sim(env,agent,simOpts);

得到如下结果

 

 

 

;