目录
一、背景介绍
二、侦听客户端长轮询
三、服务端推送变更配置
一、背景介绍
在之前上一篇文章中我们一起看了nacos配置中心动态刷新client端的长轮询逻辑监听、事件发布变更配置和client端怎么和SpringBoot整合。
所以,本篇我们主要来看nacos服务端是怎么处理客户端长轮询监听配置变更和主动推送变更配置的。
二、侦听客户端长轮询
从上一篇中我们知道nacos客户端在长轮询逻辑中通过checkUpdateDataIds()方法发出http请求到服务端探测是否有配置发生变更。
// check server config
List<String> changedGroupKeys = checkUpdateDataIds(cacheDatas, inInitializingCacheList);
if (!CollectionUtils.isEmpty(changedGroupKeys)) {
LOGGER.info("get changedGroupKeys:" + changedGroupKeys);
}
在checkUpdateDataIds()底层我们能够发现调用服务端的地址链接是/v1/cs/configs/listener。
HttpRestResult<String> result = agent
.httpPost(Constants.CONFIG_CONTROLLER_PATH + "/listener", headers, params, agent.getEncode(),
readTimeoutMs);
所以,我们直接看nacos服务端/v1/cs/configs/listener对应的方法。
@PostMapping("/listener")
@Secured(action = ActionTypes.READ, parser = ConfigResourceParser.class)
public void listener(HttpServletRequest request, HttpServletResponse response) throws IOException {
request.setAttribute("org.apache.catalina.ASYNC_SUPPORTED", true);
String probeModify = request.getParameter("Listening-Configs");
if (StringUtils.isBlank(probeModify)) {
throw new IllegalArgumentException("invalid probeModify");
}
probeModify = URLDecoder.decode(probeModify, Constants.ENCODE);
Map<String, String> clientMd5Map;
try {
//计算 MD5 值
clientMd5Map = MD5Util.getClientMd5Map(probeModify);
} catch (Throwable e) {
throw new IllegalArgumentException("invalid probeModify");
}
// 进行长轮询逻辑
inner.doPollingConfig(request, response, clientMd5Map, probeModify.length());
}
该方法前面都是一些校验以及生成clientMd5Map,这个Map里面包含了监听配置变更的key以及value,下面会用到,核心主要看inner.doPollingConfig()这个逻辑。
public void doPollingConfig(HttpServletRequest request, HttpServletResponse response,
Map<String, String> clientMd5Map, int probeRequestSize) throws IOException {
//长轮询
if (LongPollingService.isSupportLongPolling(request)) {
longPollingService.addLongPollingClient(request, response, clientMd5Map, probeRequestSize);
return;
}
// 兼容短轮询逻辑.
List<String> changedGroups = MD5Util.compareMd5(request, response, clientMd5Map);
// 兼容短轮询结果.
String oldResult = MD5Util.compareMd5OldResult(changedGroups);
String newResult = MD5Util.compareMd5ResultString(changedGroups);
String version = request.getHeader(Constants.CLIENT_VERSION_HEADER);
if (version == null) {
version = "2.0.0";
}
int versionNum = Protocol.getVersionNumber(version);
// Befor 2.0.4 version, return value is put into header.
if (versionNum < START_LONG_POLLING_VERSION_NUM) {
response.addHeader(Constants.PROBE_MODIFY_RESPONSE, oldResult);
response.addHeader(Constants.PROBE_MODIFY_RESPONSE_NEW, newResult);
} else {
request.setAttribute("content", newResult);
}
Loggers.AUTH.info("new content:" + newResult);
// Disable cache.
response.setHeader("Pragma", "no-cache");
response.setDateHeader("Expires", 0);
response.setHeader("Cache-Control", "no-cache,no-store");
response.setStatus(HttpServletResponse.SC_OK);
}
该方法首先判断客户端发送过来的请求是长轮询还是短轮询,如果是长轮询直接执行长轮询逻辑然后return,下面的逻辑是兼容短轮询的逻辑实现。
在兼容的短轮询逻辑中拿到刚才的clientMd5Map进行解析、比对是否和服务端发生了变更,如果发生了变更就将老数据和新数据都放到响应的header里面,直接返回给nacos客户端了。
是否是长轮询其实就是判断请求头中是否设置了Long-Pulling-Timeout,刚才在ClientWorker的请求头中设置了这个参数。
确定是长轮询逻辑,我们直接看addLongPollingClient()添加客户长轮询这个逻辑。
public void addLongPollingClient(HttpServletRequest req, HttpServletResponse rsp, Map<String, String> clientMd5Map, int probeRequestSize) {
......
int delayTime = SwitchService.getSwitchInteger(SwitchService.FIXED_DELAY_TIME, 500);
//为 LoadBalance 添加延迟时间,并提前 500ms 返回响应,避免客户端超时(即超时时间减 500ms 后赋值给 timeout 变量)
long timeout = Math.max(10000, Long.parseLong(str) - delayTime);
//判断是否为固定轮询,是则 30s 后执行;否则 29.5s 后执行
if (isFixedPolling()) {
timeout = Math.max(10000, getFixedPollingInterval());
//除了设置修复轮询超时之外什么都不做
} else {
long start = System.currentTimeMillis();
//和服务端的数据进行 MD5 对比,发生变化则直接返回
List<String> changedGroups = MD5Util.compareMd5(req, rsp, clientMd5Map);
if (changedGroups.size() > 0) {
generateResponse(req, rsp, changedGroups);
LogUtil.CLIENT_LOG.info("{}|{}|{}|{}|{}|{}|{}", System.currentTimeMillis() - start, "instant",
RequestUtil.getRemoteIp(req), "polling", clientMd5Map.size(), probeRequestSize,
changedGroups.size());
return;
} else if (noHangUpFlag != null && noHangUpFlag.equalsIgnoreCase(TRUE_STR)) {
LogUtil.CLIENT_LOG.info("{}|{}|{}|{}|{}|{}|{}", System.currentTimeMillis() - start, "nohangup",
RequestUtil.getRemoteIp(req), "polling", clientMd5Map.size(), probeRequestSize,
changedGroups.size());
return;
}
}
String ip = RequestUtil.getRemoteIp(req);
// 一定要由 HTTP 线程调用,否则离开容器会立即发送响应
final AsyncContext asyncContext = req.startAsync();
// AsyncContext.setTimeout() is incorrect, Control by oneself
asyncContext.setTimeout(0L);
//执行 ClientLongPolling 线程
ConfigExecutor.executeLongPolling(new ClientLongPolling(asyncContext, clientMd5Map, ip, probeRequestSize, timeout, appName, tag));
}
在该逻辑中核心有三步:
①、设置提前 500ms 返回响应,避免客户端超时
因为客户端请求设置的超时时间是30s,所以这里就是最长29.5s后返回给客户端响应,避免造成客户端超时非正常的交互。
private void init(Properties properties) {
timeout = Math.max(ConvertUtils.toInt(properties.getProperty(PropertyKeyConst.CONFIG_LONG_POLL_TIMEOUT),
Constants.CONFIG_LONG_POLL_TIMEOUT), Constants.MIN_CONFIG_LONG_POLL_TIMEOUT);
}
/**
* millisecond.
*/
public static final int CONFIG_LONG_POLL_TIMEOUT = 30000;
②、和服务端的数据进行 MD5 对比,发生变化则直接返回
和服务端的数据进行 MD5 对比,发生变化则直接返回这个逻辑和刚才controller层的逻辑大致相同。
③、下面就是开启异步上下文,将主线程交还给Tomcat,开启异步上下文之后并不会立即返回响应给客户端,需要执行asyncContext.complete(),这也就是服务端保持hold住客户端请求不挂断的关键。
开启的异步上下文我们可以看到是直接交给线程池去执行了。
ConfigExecutor.executeLongPolling(new ClientLongPolling(asyncContext, clientMd5Map, ip, probeRequestSize, timeout, appName, tag));
public static void executeLongPolling(Runnable runnable) {
LONG_POLLING_EXECUTOR.execute(runnable);
}
private static final ScheduledExecutorService LONG_POLLING_EXECUTOR = ExecutorFactory.Managed
.newSingleScheduledExecutorService(ClassUtils.getCanonicalName(Config.class),
new NameThreadFactory("com.alibaba.nacos.config.LongPolling"));
切记,最好不要直接使用下面这样的方式开启线程去异步执行任务。
asyncContext.start(new Runnable() {
@Override
public void run() {
}
});
因为直接使用asyncContext.start()方法开启的线程还是Tomcat的线程池的线程。
交给线程池执行的是类ClientLongPolling,这个类实现了Runnable接口,所以接下来核心就是看类ClientLongPolling的run()逻辑。
@Override
public void run() {
//服务端收到请求之后,不立即返回,没有变更则在延后 (30-0.5)s 把请求结果返回给客户端;
asyncTimeoutFuture = ConfigExecutor.scheduleLongPolling(new Runnable() {
@Override
public void run() {
try {
getRetainIps().put(ClientLongPolling.this.ip, System.currentTimeMillis());
// Delete subsciber's relations.
allSubs.remove(ClientLongPolling.this);
//判断是否为固定轮询
if (isFixedPolling()) {
LogUtil.CLIENT_LOG
.info("{}|{}|{}|{}|{}|{}", (System.currentTimeMillis() - createTime), "fix",
RequestUtil.getRemoteIp((HttpServletRequest) asyncContext.getRequest()),
"polling", clientMd5Map.size(), probeRequestSize);
//比较数据的 MD5 值,判断是否发生变更
List<String> changedGroups = MD5Util
.compareMd5((HttpServletRequest) asyncContext.getRequest(),
(HttpServletResponse) asyncContext.getResponse(), clientMd5Map);
if (changedGroups.size() > 0) {
//并将变更的结果通过response返回给客户端
sendResponse(changedGroups);
} else {
sendResponse(null);
}
} else {
LogUtil.CLIENT_LOG
.info("{}|{}|{}|{}|{}|{}", (System.currentTimeMillis() - createTime), "timeout",
RequestUtil.getRemoteIp((HttpServletRequest) asyncContext.getRequest()),
"polling", clientMd5Map.size(), probeRequestSize);
sendResponse(null);
}
} catch (Throwable t) {
LogUtil.DEFAULT_LOG.error("long polling error:" + t.getMessage(), t.getCause());
}
}
}, timeoutTime, TimeUnit.MILLISECONDS);
allSubs.add(this);
}
在改逻辑中,ConfigExecutor.scheduleLongPolling()使用了一个单线程定时任务延迟一段时间去执行任务。
public static ScheduledFuture<?> scheduleLongPolling(Runnable runnable, long period, TimeUnit unit) {
return LONG_POLLING_EXECUTOR.schedule(runnable, period, unit);
}
private static final ScheduledExecutorService LONG_POLLING_EXECUTOR = ExecutorFactory.Managed
.newSingleScheduledExecutorService(ClassUtils.getCanonicalName(Config.class),
new NameThreadFactory("com.alibaba.nacos.config.LongPolling"));
run()方法里面的timeOut正是从开启异步上下文那个地方传进来的,也就是29.5s。
然后isFixedPolling()判断是否为固定轮询,因为客户端并没有设置这个参数,所以直接执行了下面的响应逻辑。
private static boolean isFixedPolling() {
return SwitchService.getSwitchBoolean(SwitchService.FIXED_POLLING, false);
}
public static final String FIXED_POLLING = "isFixedPolling";
最后,通过allSubs.add(this)将所有的订阅者缓存到Queue<ClientLongPolling> allSubs当中,以供主动推送数据时使用。
三、服务端推送变更配置
服务端通知nacos客户端配置数据变更其实是在类LongPollingService的构造方法中注册的监听器。
public LongPollingService() {
......
// 注册一个订阅者来订阅 LocalDataChangeEvent
NotifyCenter.registerSubscriber(new Subscriber() {
@Override
public void onEvent(Event event) {
if (isFixedPolling()) {
// Ignore.
} else {
//通过线程池执行 DataChangeTask 任务
if (event instanceof LocalDataChangeEvent) {
LocalDataChangeEvent evt = (LocalDataChangeEvent) event;
ConfigExecutor.executeLongPolling(new DataChangeTask(evt.groupKey, evt.isBeta, evt.betaIps));
}
}
}
@Override
public Class<? extends Event> subscribeType() {
return LocalDataChangeEvent.class;
}
});
}
将LocalDataChangeEvent封装成DataChangeTask交给线程池去执行,DataChangeTask实现了Runnable接口,所以我们直接看其run()方法。
@Override
public void run() {
try {
ConfigCacheService.getContentBetaMd5(groupKey);
//遍历 allSubs 中的客户端长轮询请求
for (Iterator<ClientLongPolling> iter = allSubs.iterator(); iter.hasNext(); ) {
ClientLongPolling clientSub = iter.next();
//比较每一个客户端长轮询请求携带的groupKey,如果服务端变更的配置和客户端请求关注的配置一致,则直接返回
if (clientSub.clientMd5Map.containsKey(groupKey)) {
// 如果 beta 发布且不在 beta 列表,则直接跳过
if (isBeta && !CollectionUtils.contains(betaIps, clientSub.ip)) {
continue;
}
// 如果 tag 发布且不在 tag 列表,则直接跳过
if (StringUtils.isNotBlank(tag) && !tag.equals(clientSub.tag)) {
continue;
}
getRetainIps().put(clientSub.ip, System.currentTimeMillis());
iter.remove(); // Delete subscribers' relationships.
LogUtil.CLIENT_LOG
.info("{}|{}|{}|{}|{}|{}|{}", (System.currentTimeMillis() - changeTime), "in-advance",
RequestUtil
.getRemoteIp((HttpServletRequest) clientSub.asyncContext.getRequest()),
"polling", clientSub.clientMd5Map.size(), clientSub.probeRequestSize, groupKey);
//发送响应
clientSub.sendResponse(Collections.singletonList(groupKey));
}
}
} catch (Throwable t) {
LogUtil.DEFAULT_LOG.error("data change error: {}", ExceptionUtil.getStackTrace(t));
}
}
}
这里的核心逻辑就是从上面客户端长轮询请求进来的时候放进Queue<ClientLongPolling> allSubs订阅者中进行遍历,如果和groupKey匹配并且配置数据发生了变更,则从注册进allSubs中获取ClientLongPolling clientSub。
void sendResponse(List<String> changedGroups) {
// Cancel time out task.
if (null != asyncTimeoutFuture) {
asyncTimeoutFuture.cancel(false);
}
generateResponse(changedGroups);
}
void generateResponse(List<String> changedGroups) {
if (null == changedGroups) {
// Tell web container to send http response.
asyncContext.complete();
return;
}
HttpServletResponse response = (HttpServletResponse) asyncContext.getResponse();
try {
final String respString = MD5Util.compareMd5ResultString(changedGroups);
// Disable cache.
response.setHeader("Pragma", "no-cache");
response.setDateHeader("Expires", 0);
response.setHeader("Cache-Control", "no-cache,no-store");
response.setStatus(HttpServletResponse.SC_OK);
response.getWriter().println(respString);
asyncContext.complete();
} catch (Exception ex) {
PULL_LOG.error(ex.toString(), ex);
asyncContext.complete();
}
}
拿到之前的异步上下文asyncContext响应客户端的请求。
至此,客户端就能获取到变更的配置数据了。
接下来再简单看下nacos集群之间是怎么同步变更数据的。
通过nacos可视化界面操作我们能够知道配置数据变更后调用的接口是这个接口。
@PostMapping
@Secured(action = ActionTypes.WRITE, parser = ConfigResourceParser.class)
public Boolean publishConfig(HttpServletRequest request, HttpServletResponse response,
@RequestParam(value = "dataId") String dataId, @RequestParam(value = "group") String group,
@RequestParam(value = "tenant", required = false, defaultValue = StringUtils.EMPTY) String tenant,
@RequestParam(value = "content") String content, @RequestParam(value = "tag", required = false) String tag,
@RequestParam(value = "appName", required = false) String appName,
@RequestParam(value = "src_user", required = false) String srcUser,
@RequestParam(value = "config_tags", required = false) String configTags,
@RequestParam(value = "desc", required = false) String desc,
@RequestParam(value = "use", required = false) String use,
@RequestParam(value = "effect", required = false) String effect,
@RequestParam(value = "type", required = false) String type,
@RequestParam(value = "schema", required = false) String schema) throws NacosException {
final String srcIp = RequestUtil.getRemoteIp(request);
final String requestIpApp = RequestUtil.getAppName(request);
// check tenant
ParamUtils.checkTenant(tenant);
ParamUtils.checkParam(dataId, group, "datumId", content);
ParamUtils.checkParam(tag);
Map<String, Object> configAdvanceInfo = new HashMap<String, Object>(10);
MapUtils.putIfValNoNull(configAdvanceInfo, "config_tags", configTags);
MapUtils.putIfValNoNull(configAdvanceInfo, "desc", desc);
MapUtils.putIfValNoNull(configAdvanceInfo, "use", use);
MapUtils.putIfValNoNull(configAdvanceInfo, "effect", effect);
MapUtils.putIfValNoNull(configAdvanceInfo, "type", type);
MapUtils.putIfValNoNull(configAdvanceInfo, "schema", schema);
ParamUtils.checkParam(configAdvanceInfo);
if (AggrWhitelist.isAggrDataId(dataId)) {
LOGGER.warn("[aggr-conflict] {} attemp to publish single data, {}, {}", RequestUtil.getRemoteIp(request),
dataId, group);
throw new NacosException(NacosException.NO_RIGHT, "dataId:" + dataId + " is aggr");
}
final Timestamp time = TimeUtils.getCurrentTime();
String betaIps = request.getHeader("betaIps");
ConfigInfo configInfo = new ConfigInfo(dataId, group, tenant, appName, content);
configInfo.setType(type);
if (StringUtils.isBlank(betaIps)) {
if (StringUtils.isBlank(tag)) {
persistService.insertOrUpdate(srcIp, srcUser, configInfo, time, configAdvanceInfo, true);
ConfigChangePublisher.notifyConfigChange(new ConfigDataChangeEvent(false, dataId, group, tenant, time.getTime()));
} else {
persistService.insertOrUpdateTag(configInfo, tag, srcIp, srcUser, time, true);
ConfigChangePublisher.notifyConfigChange(new ConfigDataChangeEvent(false, dataId, group, tenant, tag, time.getTime()));
}
} else {
// beta publish
persistService.insertOrUpdateBeta(configInfo, betaIps, srcIp, srcUser, time, true);
ConfigChangePublisher.notifyConfigChange(new ConfigDataChangeEvent(true, dataId, group, tenant, time.getTime()));
}
ConfigTraceService.logPersistenceEvent(dataId, group, tenant, requestIpApp, time.getTime(), InetUtils.getSelfIp(), ConfigTraceService.PERSISTENCE_EVENT_PUB, content);
return true;
}
使用betaIps进行标识是否是测试版使用,而且参数中也没有tag,所以我们直接看核心逻辑ConfigChangePublisher.notifyConfigChange()是怎么样进行数据变更推送的。
接下来的实现比较单一,所以我们直接最终进入这个逻辑。
private static boolean publishEvent(final Class<? extends Event> eventType, final Event event) {
final String topic = ClassUtils.getCanonicalName(eventType);
if (ClassUtils.isAssignableFrom(SlowEvent.class, eventType)) {
return INSTANCE.sharePublisher.publish(event);
}
if (INSTANCE.publisherMap.containsKey(topic)) {
EventPublisher publisher = INSTANCE.publisherMap.get(topic);
return publisher.publish(event);
}
LOGGER.warn("There are no [{}] publishers for this event, please register", topic);
return false;
}
ClassUtils.isAssignableFrom(SlowEvent.class, eventType)是用来判断这个event是不是SlowEvent类型的,从刚才进入到这个方法当中我们知道这个event是ConfigDataChangeEvent类型的,所以直接进入类DefaultPublisher这个方法。
publisher.publish(event)
@Override
public boolean publish(Event event) {
checkIsStart();
boolean success = this.queue.offer(event);
if (!success) {
LOGGER.warn("Unable to plug in due to interruption, synchronize sending time, event : {}", event);
receiveEvent(event);
return true;
}
return true;
}
可以看到在这个类当中直接把任务放入了阻塞队列当中。
既然在这里直接把配置变更事件直接放进了阻塞队列当中,那么肯定有线程从队列当中获取配置变更事件进行后续处理,因为类DefaultPublisher继承了Thread这个类,所以我们直接看重写的run()方法。
@Override
public void run() {
openEventHandler();
}
void openEventHandler() {
try {
// This variable is defined to resolve the problem which message overstock in the queue.
int waitTimes = 60;
// To ensure that messages are not lost, enable EventHandler when
// waiting for the first Subscriber to register
for (; ; ) {
if (shutdown || hasSubscriber() || waitTimes <= 0) {
break;
}
ThreadUtils.sleep(1000L);
waitTimes--;
}
for (; ; ) {
if (shutdown) {
break;
}
final Event event = queue.take();
receiveEvent(event);
updater.compareAndSet(this, lastEventSequence, Math.max(lastEventSequence, event.sequence()));
}
} catch (Throwable ex) {
LOGGER.error("Event listener exception : {}", ex);
}
}
在这里我们直接看核心逻辑从queue中take()阻塞式地获取配置变更事件进行处理,也就是receiveEvent(event)这个核心逻辑。
void receiveEvent(Event event) {
...... // 省略遍历逻辑
// Because unifying smartSubscriber and subscriber, so here need to think of compatibility.
// Remove original judge part of codes.
notifySubscriber(subscriber, event);
}
}
@Override
public void notifySubscriber(final Subscriber subscriber, final Event event) {
LOGGER.debug("[NotifyCenter] the {} will received by {}", event, subscriber);
final Runnable job = new Runnable() {
@Override
public void run() {
subscriber.onEvent(event);
}
};
final Executor executor = subscriber.executor();
if (executor != null) {
executor.execute(job);
} else {
try {
job.run();
} catch (Throwable e) {
LOGGER.error("Event callback exception : {}", e);
}
}
}
可以看到,遍历所有的订阅者,忽略过期事件,通知单事件监听器执行事件。
最终则是使用线程去执行subscriber.onEvent(event)。
既然在这里是使用事件监听器去执行事件,那么一定有一个地方注册了事件监听器,如果debug或者查看器实现类,我们能够发现是在这个AsyncNotifyService类当中注册的。
在这里使用队列放进去包装的数据变更任务以供后续异步执行。
queue.add(new NotifySingleTask(dataId, group, tenant, tag, dumpTs, member.getAddress(), evt.isBeta));
ConfigExecutor.executeAsyncNotify(new AsyncTask(httpclient, queue));
所以,直接看AsyncTask的run()方法。
@Override
public void run() {
executeAsyncInvoke();
}
private void executeAsyncInvoke() {
while (!queue.isEmpty()) {
NotifySingleTask task = queue.poll();
String targetIp = task.getTargetIP();
if (memberManager.hasMember(targetIp)) {
// start the health check and there are ips that are not monitored, put them directly in the notification queue, otherwise notify
boolean unHealthNeedDelay = memberManager.isUnHealth(targetIp);
if (unHealthNeedDelay) {
// target ip is unhealthy, then put it in the notification list
ConfigTraceService.logNotifyEvent(task.getDataId(), task.getGroup(), task.getTenant(), null,
task.getLastModified(), InetUtils.getSelfIp(), ConfigTraceService.NOTIFY_EVENT_UNHEALTH,
0, task.target);
// get delay time and set fail count to the task
asyncTaskExecute(task);
} else {
HttpGet request = new HttpGet(task.url);
request.setHeader(NotifyService.NOTIFY_HEADER_LAST_MODIFIED, String.valueOf(task.getLastModified()));
request.setHeader(NotifyService.NOTIFY_HEADER_OP_HANDLE_IP, InetUtils.getSelfIp());
if (task.isBeta) {
request.setHeader("isBeta", "true");
}
//通知数据变更
httpclient.execute(request, new AsyncNotifyCallBack(httpclient, task));
}
}
}
}
}
在这里从queue队列中获取任务,然后判断集群服务是否正常,如果不正常,后续延迟执行。如果集群服务正常,就直接通知配置数据变更。