Android 14源码参考:Search
一、Android WatchDog 概述
WatchDog 是 Android 系统中的一个关键组件,负责监控系统性能并检测是否存在应用或系统服务的长时间无响应(ANR: Application Not Responding)。它通过检测系统中的主要线程(如主线程、Binder 线程)的响应情况,来判断系统是否处于正常运行状态。一旦发现长时间无响应,WatchDog 会采取相应的措施,如记录日志、重启系统服务等,以保持系统的流畅性和稳定性。
1.1 工作原理
- 初始化与启动
-
- 初始化:在SystemServer启动过程中,通过Watchdog.getInstance().init(context, mActivityManagerService);完成初始化,并注册必要的回调和监控线程。
- 启动:在AMS(ActivityManagerService)的systemReady方法执行完毕后,通过Watchdog.getInstance().start();启动Watchdog线程。
- 监控线程
-
- Watchdog通过HandlerChecker对象来监控特定的线程。这些HandlerChecker对象与特定的Handler(即Looper和Thread)关联,用于检查线程是否处于正常工作状态。
- 监控的线程包括前台线程(FgThread)、主线程(MainThread)、UI线程、IO线程、Display线程等关键系统线程。
- 超时检测
-
- Watchdog会定期检查每个被监控线程的Handler消息队列,如果在设定的超时时间内没有消息处理,则认为该线程可能出现了死锁或异常。
- 超时时间可以在创建HandlerChecker时指定,默认为一定的毫秒数(如60秒)。
- 异常处理
-
- 一旦检测到线程超时,Watchdog会触发异常处理流程,包括记录异常日志、尝试恢复线程、以及最终重启system_server进程。
- 重启system_server进程是Watchdog作为最后手段的恢复措施,旨在通过重启来清除可能存在的死锁或异常状态。
二、WatchDog初始化
2.1 SystemServer.startBootstrapServices
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
...
t.traceBegin("StartWatchdog");
//创建watchdog【见小节2.2】
final Watchdog watchdog = Watchdog.getInstance();
// watchdog启动【见小节3.1】
watchdog.start();
mDumper.addDumpable(watchdog);
t.traceEnd();
....
t.traceBegin("InitWatchdog");
//注册reboot广播【见小节2.3】
watchdog.init(mSystemContext, mActivityManagerService);
t.traceEnd();
}
system_server进程启动的过程中初始化WatchDog,主要有:
- 创建watchdog对象,该对象本身继承于Thread
- 调用start()开始工作
- 注册reboot广播
从源码看到Android10开始将Watchdog初始化、启动放到了startBootstrapServices中,启动放到了注册reboot广播前
2.2 getInstance
Watchdog.java
public static Watchdog getInstance() {
if (sWatchdog == null) {
//单例模式,创建实例对象【见小节2.3】
sWatchdog = new Watchdog();
}
return sWatchdog;
}
2.3 创建Watchdog
public class Watchdog implements Dumpable {
//所有的HandlerChecker对象组成的列表,HandlerChecker对象类型【见小节2.3.1】
/* This handler will be used to post message back onto the main thread */
private final ArrayList<HandlerCheckerAndTimeout> mHandlerCheckers = new ArrayList<>();
.....
private Watchdog() {
mThread = new Thread(this::run, "watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
//
// Use a custom thread to check monitors to avoid lock contention from impacted other
// threads.
ServiceThread t = new ServiceThread("watchdog.monitor",
android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
t.start();
mMonitorChecker = new HandlerChecker(new Handler(t.getLooper()), "monitor thread");
mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker));
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(FgThread.getHandler(), "foreground thread")));
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(new Handler(Looper.getMainLooper()), "main thread")));
// Add checker for shared UI thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(UiThread.getHandler(), "ui thread")));
// And also check IO thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(IoThread.getHandler(), "i/o thread")));
// And the display thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(DisplayThread.getHandler(), "display thread")));
// And the animation thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(AnimationThread.getHandler(), "animation thread")));
// And the surface animation thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread")));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
mInterestingJavaPids.add(Process.myPid());
// See the notes on DEFAULT_TIMEOUT.
assert DB ||
DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
mTraceErrorLogger = new TraceErrorLogger();
}
}
mHandlerCheckers队列包括、 主线程,fg, ui, io, display,animation线程的HandlerChecker对象等。
2.3.1 HandlerChecker
public final class HandlerChecker implements Runnable {
public final class HandlerChecker implements Runnable {
private final Handler mHandler;//Handler对象
private final String mName; //线程描述名
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
private final ArrayList<Monitor> mMonitorQueue = new ArrayList<Monitor>();
private long mWaitMaxMillis;//最长等待时间
private boolean mCompleted;//开始检查时先设置成false
private Monitor mCurrentMonitor;
private long mStartTimeMillis; //开始准备检查的时间点
private int mPauseCount;
HandlerChecker(Handler handler, String name) {
mHandler = handler;
mName = name;
mCompleted = true;
}
}
2.3.2 addMonitor
public class Watchdog implements Dumpable {
public void addMonitor(Monitor monitor) {
synchronized (mLock) {
//此处mMonitorChecker数据类型为HandlerChecker
mMonitorChecker.addMonitorLocked(monitor);
}
}
public final class HandlerChecker implements Runnable {
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
void addMonitorLocked(Monitor monitor) {
// We don't want to update mMonitors when the Handler is in the middle of checking
// all monitors. We will update mMonitors on the next schedule if it is safe
mMonitorQueue.add(monitor);
}
...
}
}
监控Binder线程, 将monitor添加到HandlerChecker的成员变量mMonitors列表中。 在这里是将BinderThreadMonitor对象加入该线程。
private static final class BinderThreadMonitor implements Watchdog.Monitor {
@Override
public void monitor() {
Binder.blockUntilThreadAvailable();
}
}
blockUntilThreadAvailable最终调用的是IPCThreadState,等待有空闲的binder线程
void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
mProcess->mWaitingForThreads++;
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
static_cast<unsigned long>(mProcess->mMaxThreads));
//等待正在执行的binder线程小于进程最大binder线程上限(16个)
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
mProcess->mWaitingForThreads--;
pthread_mutex_unlock(&mProcess->mThreadCountLock);
}
可见addMonitor(new BinderThreadMonitor())是将Binder线程添加到android.fg线程的handler(mMonitorChecker)来检查是否工作正常。
2.3 init
[-> Watchdog.java]
public void init(Context context, ActivityManagerService activity) {
mActivity = activity;
//注册reboot广播接收者【见小节2.3.1】
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
2.3.1 RebootRequestReceiver
final class RebootRequestReceiver extends BroadcastReceiver {
@Override
public void onReceive(Context c, Intent intent) {
if (intent.getIntExtra("nowait", 0) != 0) {
//【见小节2.3.2】
rebootSystem("Received ACTION_REBOOT broadcast");
return;
}
Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
}
}
2.3.2 rebootSystem
void rebootSystem(String reason) {
Slog.i(TAG, "Rebooting system because: " + reason);
IPowerManager pms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
try {
//通过PowerManager执行reboot操作
pms.reboot(false, reason, false);
} catch (RemoteException ex) {
}
}
最终是通过PowerManagerService来完成重启操作,具体的重启流程后续会单独讲述。
三、Watchdog检测机制
当调用Watchdog.getInstance().start()时,则进入线程“watchdog”的run()方法, 该方法分成两部分:
- 前半部 [小节3.1] 用于监测是否触发超时;
- 后半部 [小节4.1], 当触发超时则输出各种信息。
3.1 run
private void run() {
boolean waitedHalf = false;
while (true) {
List<HandlerChecker> blockedCheckers = Collections.emptyList();
String subject = "";
boolean allowRestart = true;
int debuggerWasConnected = 0;
boolean doWaitedHalfDump = false;
The value of mWatchdogTimeoutMillis might change while we are executing the loop.
// We store the current value to use a consistent value for all handlers.
final long watchdogTimeoutMillis = mWatchdogTimeoutMillis;
final long checkIntervalMillis = watchdogTimeoutMillis / 2;
final ArrayList<Integer> pids;
synchronized (mLock) {
long timeout = checkIntervalMillis;
// Make sure we (re)spin the checkers that have become idle within
// this wait-and-check interval
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerCheckerAndTimeout hc = mHandlerCheckers.get(i);
// We pick the watchdog to apply every time we reschedule the checkers. The
// default timeout might have changed since the last run.
//执行所有的Checker的监控方法, 每个Checker记录当前的mStartTime[见小节3.2]
hc.checker().scheduleCheckLocked(hc.customTimeoutMillis()
.orElse(watchdogTimeoutMillis * Build.HW_TIMEOUT_MULTIPLIER));
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
// NOTE: We use uptimeMillis() here because we do not want to increment the time we
// wait while asleep. If the device is asleep then the thing that we are waiting
// to timeout on is asleep as well and won't have a chance to run, causing a false
// positive on when to kill things.
long start = SystemClock.uptimeMillis();
//通过循环,保证执行30s才会继续往下执行
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
//触发中断,直接捕获异常,继续等待.
mLock.wait(timeout);
// Note: mHandlerCheckers and mMonitorChecker may have changed after waiting
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = checkIntervalMillis - (SystemClock.uptimeMillis() - start);
}
//评估Checker状态【见小节3.3】
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
// The monitors have returned; reset
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
// still waiting but within their configured intervals; back off and recheck
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
Slog.i(TAG, "WAITED_HALF");
//首次进入等待时间过半的状态
waitedHalf = true;
// We've waited half, but we'd need to do the stack trace dump w/o the lock.
blockedCheckers = getCheckersWithStateLocked(WAITED_HALF);
//【见小节3.5】
subject = describeCheckersLocked(blockedCheckers);
pids = new ArrayList<>(mInterestingJavaPids);
doWaitedHalfDump = true;
} else {
continue;
}
} else {
// something is overdue!
blockedCheckers = getCheckersWithStateLocked(OVERDUE);
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
pids = new ArrayList<>(mInterestingJavaPids);
}
} // END synchronized (mLock)
//如果我们到了这里,这意味着系统很可能挂起了。
//首先从系统进程的所有线程收集堆栈跟踪。
//然后,如果我们达到了完全超时,请终止此进程,以便系统重新启动。如果我们达到了超时时间的一半,只需记录一些信息并继续。
logWatchog(doWaitedHalfDump, subject, pids);
if (doWaitedHalfDump) {
// We have waited for only half of the timeout, we continue to wait for the duration
// of the full timeout before killing the process.
continue;
}
IActivityController controller;
synchronized (mLock) {
controller = mController;
}
if (controller != null) {
Slog.i(TAG, "Reporting stuck state to activity controller");
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
// 1 = keep waiting, -1 = kill system
int res = controller.systemNotResponding(subject);
if (res >= 0) {
Slog.i(TAG, "Activity controller requested to coninue to wait");
waitedHalf = false;
continue;
}
} catch (RemoteException e) {
}
}
// Only kill the process if the debugger is not attached.
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
Slog.w(TAG, "*** GOODBYE!");
if (!Build.IS_USER && isCrashLoopFound()
&& !WatchdogProperties.should_ignore_fatal_count().orElse(false)) {
breakCrashLoop();
}
//杀死进程system_server【见小节4.5】
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
该方法主要功能:
- 执行所有的Checker的监控方法scheduleCheckLocked()
-
- 当mMonitor个数为0(除了android.fg线程之外都为0)且处于poll状态,则设置mCompleted = true;
- 当上次check还没有完成, 则直接返回.
- 等待30s后, 再调用evaluateCheckerCompletionLocked来评估Checker状态;
- 根据waitState状态来执行不同的操作:
-
- 当COMPLETED或WAITING,则相安无事;
- 当WAITED_HALF(超过30s)且为首次, 则输出system_server和3个Native进程的traces;
- 当OVERDUE, 则输出更多信息.
由此,可见当触发一次Watchdog, 则必然会调用两次AMS.dumpStackTraces, 也就是说system_server和3个Native进程的traces 的traces信息会输出两遍,且时间间隔超过30s.
收集完信息后便会杀死system_server进程。此处allowRestart默认值为true, 当执行am hang操作则设置不允许重启(allowRestart =false), 则不会杀死system_server进程.
3.2 scheduleCheckLocked
public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) {
mWaitMaxMillis = handlerCheckerTimeoutMillis;
if (mCompleted) {
// Safe to update monitors in queue, Handler is not in the middle of work
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0)) {
// Don't schedule until after resume OR
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if we have no monitors since those would need to
// be executed at this point.
mCompleted = true;//当目标looper正在轮询状态则返回。
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;//有一个check正在处理中,则无需重复发送
}
mCompleted = false;
mCurrentMonitor = null;
// 记录当下的时间
mStartTimeMillis = SystemClock.uptimeMillis();
//发送消息,插入消息队列最开头, 见下方的run()方法
mHandler.postAtFrontOfQueue(this);
@Override
public void run() {
// Once we get here, we ensure that mMonitors does not change even if we call
// #addMonitorLocked because we first add the new monitors to mMonitorQueue and
// move them to mMonitors on the next schedule when mCompleted is true, at which
// point we have completed execution of this method.
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (mLock) {
mCurrentMonitor = mMonitors.get(i);
}
//回调具体服务的monitor方法
mCurrentMonitor.monitor();
}
synchronized (mLock) {
mCompleted = true;
mCurrentMonitor = null;
}
}
该方法主要功能: 向Watchdog的监控线程的Looper池的最头部执行该HandlerChecker.run()方法, 在该方法中调用monitor(),执行完成后会设置mCompleted = true. 那么当handler消息池当前的消息, 导致迟迟没有机会执行monitor()方法, 则会触发watchdog.
其中postAtFrontOfQueue(this),该方法输入参数为Runnable对象,根据消息机制, 最终会回调HandlerChecker中的run方法,该方法会循环遍历所有的Monitor接口,具体的服务实现该接口的monitor()方法。
可能的问题,如果有其他消息不断地调用postAtFrontOfQueue()也可能导致watchdog没有机会执行;或者是每个monitor消耗一些时间,雷加起来超过1分钟造成的watchdog. 这些都是非常规的Watchdog.
3.3 evaluateCheckerCompletionLocked
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i).checker();
【见小节3.4】
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
获取mHandlerCheckers列表中等待状态值最大的state.
3.4 getCompletionStateLocked
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTimeMillis;
if (latency < mWaitMaxMillis / 2) {
return WAITING;
} else if (latency < mWaitMaxMillis) {
return WAITED_HALF;
}
}
return OVERDUE;
}
- COMPLETED = 0:等待完成;
- WAITING = 1:等待时间小于DEFAULT_TIMEOUT的一半,即30s;
- WAITED_HALF = 2:等待时间处于30s~60s之间;
- OVERDUE = 3:等待时间大于或等于60s。
3.5describeCheckersLocked
String describeBlockedStateLocked() {
final String prefix;
/非前台线程进入该分支
if (mCurrentMonitor == null) {
prefix = "Blocked in handler on ";
//前台线程进入该分支
} else {
prefix = "Blocked in monitor " + mCurrentMonitor.getClass().getName();
}
long latencySeconds = (SystemClock.uptimeMillis() - mStartTimeMillis) / 1000;
return prefix + " on " + mName + " (" + getThread().getName() + ")"
+ " for " + latencySeconds + "s";
}
将所有执行时间超过1分钟的handler线程或者monitor都记录下来.
- 当输出的信息是Blocked in handler,意味着相应的线程处理当前消息时间超过1分钟;
- 当输出的信息是Blocked in monitor,意味着相应的线程处理当前消息时间超过1分钟,或者monitor迟迟拿不到锁;
四. Watchdog处理流程
4.1 logWatchog
private void logWatchog(boolean halfWatchdog, String subject, ArrayList<Integer> pids) {
// Get critical event log before logging the half watchdog so that it doesn't
// occur in the log.
String criticalEvents =
CriticalEventLog.getInstance().logLinesForSystemServerTraceFile();
final UUID errorId = mTraceErrorLogger.generateErrorId();
if (mTraceErrorLogger.isAddErrorIdEnabled()) {
mTraceErrorLogger.addProcessInfoAndErrorIdToTrace("system_server", Process.myPid(),
errorId);
mTraceErrorLogger.addSubjectToTrace(subject, errorId);
}
final String dropboxTag;
if (halfWatchdog) {
dropboxTag = "pre_watchdog";
CriticalEventLog.getInstance().logHalfWatchdog(subject);
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_PRE_WATCHDOG_OCCURRED);
} else {
dropboxTag = "watchdog";
CriticalEventLog.getInstance().logWatchdog(subject, errorId);
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
// Log the atom as early as possible since it is used as a mechanism to trigger
// Perfetto. Ideally, the Perfetto trace capture should happen as close to the
// point in time when the Watchdog happens as possible.
FrameworkStatsLog.write(FrameworkStatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);
}
long anrTime = SystemClock.uptimeMillis();
StringBuilder report = new StringBuilder();
report.append(ResourcePressureUtil.currentPsiState());
ProcessCpuTracker processCpuTracker = new ProcessCpuTracker(false);
StringWriter tracesFileException = new StringWriter();
//【见小节4.2】
final File stack = StackTracesDumpHelper.dumpStackTraces(
pids, processCpuTracker, new SparseBooleanArray(),
CompletableFuture.completedFuture(getInterestingNativePids()), tracesFileException,
subject, criticalEvents, Runnable::run, /* latencyTracker= */null);
// Give some extra time to make sure the stack traces get written.
// The system's been hanging for a whlie, another second or two won't hurt much.
SystemClock.sleep(5000);
processCpuTracker.update();
report.append(processCpuTracker.printCurrentState(anrTime));
report.append(tracesFileException.getBuffer());
if (!halfWatchdog) {
// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the
// kernel log
doSysRq('w');
doSysRq('l');
}
// Try to add the error to the dropbox, but assuming that the ActivityManager
// itself may be deadlocked. (which has happened, causing this statement to
// deadlock and the watchdog as a whole to be ineffective)
Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
// If a watched thread hangs before init() is called, we don't have a
// valid mActivity. So we can't log the error to dropbox.
if (mActivity != null) {
mActivity.addErrorToDropBox(
dropboxTag, null, "system_server", null, null, null,
null, report.toString(), stack, null, null, null,
errorId);
}
}
};
dropboxThread.start();
try {
dropboxThread.join(2000); // wait up to 2 seconds for it to return.
} catch (InterruptedException ignored) { }
}
Watchdog检测到异常的信息收集工作:
- dumpStackTraces:输出Java和Native进程的栈信息;
- doSysRq
- dropBox
4.2 StackTracesDumpHelper.dumpStackTraces
/* package */ static File dumpStackTraces(ArrayList<Integer> firstPids,
ProcessCpuTracker processCpuTracker, SparseBooleanArray lastPids,
Future<ArrayList<Integer>> nativePidsFuture, StringWriter logExceptionCreatingFile,
AtomicLong firstPidEndOffset, String subject, String criticalEventSection,
String memoryHeaders, @NonNull Executor auxiliaryTaskExecutor,
Future<File> firstPidFilePromise, AnrLatencyTracker latencyTracker) {
try {
if (latencyTracker != null) {
latencyTracker.dumpStackTracesStarted();
}
Slog.i(TAG, "dumpStackTraces pids=" + lastPids);
// Measure CPU usage as soon as we're called in order to get a realistic sampling
// of the top users at the time of the request.
Supplier<ArrayList<Integer>> extraPidsSupplier = processCpuTracker != null
? () -> getExtraPids(processCpuTracker, lastPids, latencyTracker) : null;
Future<ArrayList<Integer>> extraPidsFuture = null;
if (extraPidsSupplier != null) {
extraPidsFuture =
CompletableFuture.supplyAsync(extraPidsSupplier, auxiliaryTaskExecutor);
}
final File tracesDir = new File(ANR_TRACE_DIR);
// NOTE: We should consider creating the file in native code atomically once we've
// gotten rid of the old scheme of dumping and lot of the code that deals with paths
// can be removed.
File tracesFile;
try {
tracesFile = createAnrDumpFile(tracesDir);
} catch (IOException e) {
Slog.w(TAG, "Exception creating ANR dump file:", e);
if (logExceptionCreatingFile != null) {
logExceptionCreatingFile.append(
"----- Exception creating ANR dump file -----\n");
e.printStackTrace(new PrintWriter(logExceptionCreatingFile));
}
if (latencyTracker != null) {
latencyTracker.anrSkippedDumpStackTraces();
}
return null;
}
if (subject != null || criticalEventSection != null || memoryHeaders != null) {
appendtoANRFile(tracesFile.getAbsolutePath(),
(subject != null ? "Subject: " + subject + "\n" : "")
+ (memoryHeaders != null ? memoryHeaders + "\n\n" : "")
+ (criticalEventSection != null ? criticalEventSection : ""));
}
long firstPidEndPos = dumpStackTraces(
tracesFile.getAbsolutePath(), firstPids, nativePidsFuture,
extraPidsFuture, firstPidFilePromise, latencyTracker);
if (firstPidEndOffset != null) {
firstPidEndOffset.set(firstPidEndPos);
}
// Each set of ANR traces is written to a separate file and dumpstate will process
// all such files and add them to a captured bug report if they're recent enough.
maybePruneOldTraces(tracesDir);
return tracesFile;
} finally {
if (latencyTracker != null) {
latencyTracker.dumpStackTracesEnded();
}
}
}
输出system_server和mediaserver,/sdcard,surfaceflinger这3个native进程的traces信息。
4.3 doSysRq
private void doSysRq(char c) {
try {
FileWriter sysrq_trigger = new FileWriter("/proc/sysrq-trigger");
sysrq_trigger.write(c);
sysrq_trigger.close();
} catch (IOException e) {
Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
}
}
通过向节点/proc/sysrq-trigger写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log。
4.4 dropBox
输出文件到/data/system/dropbox。对于触发watchdog时,生成的dropbox文件的tag是system_server_watchdog,内容是traces以及相应的blocked信息。
4.5 killProcess
Process.killProces通过发送信号9给目标进程来完成杀进程的过程。
当杀死system_server进程,从而导致zygote进程自杀,进而触发init执行重启Zygote进程,这便出现了手机framework重启的现象。
五. 总结
Watchdog是一个运行在system_server进程的名为”watchdog”的线程::
- Watchdog运作过程,当阻塞时间超过1分钟则触发一次watchdog,会杀死system_server,触发上层重启;
- mHandlerCheckers记录所有的HandlerChecker对象的列表,包括foreground, main, ui, i/o, display线程的handler;
- mHandlerChecker.mMonitors记录所有Watchdog目前正在监控Monitor,所有的这些monitors都运行在foreground线程。
- 有两种方式加入Watchdog监控:
-
- addThread():用于监测Handler线程,默认超时时长为60s.这种超时往往是所对应的handler线程消息处理得慢;
- addMonitor(): 用于监控实现了Watchdog.Monitor接口的服务.这种超时可能是”android.fg”线程消息处理得慢,也可能是monitor迟迟拿不到锁;
以下情况,即使触发了Watchdog,也不会杀掉system_server进程:
- monkey: 设置IActivityController,拦截systemNotResponding事件, 比如monkey.
- hang: 执行am hang命令,不重启;
- debugger: 连接debugger的情况, 不重启;
5.1 监控Handler线程
Watchdog监控的线程有:默认地DEFAULT_TIMEOUT=60s,调试时才为10s方便找出潜在的ANR问题。
线程名 | 对应handler | 说明 | Timeout |
main | new Handler(Looper.getMainLooper()) | 当前主线程 | 1min |
android.fg | FgThread.getHandler | 前台线程 | 1min |
android.ui | UiThread.getHandler | UI线程 | 1min |
android.io | IoThread.getHandler | I/O线程 | 1min |
android.display | DisplayThread.getHandler | display线程 | 1min |
ActivityManager | AMS.MainHandler | AMS线程 | 1min |
PowerManagerService | PMS.PowerManagerHandler | PMS线程 | 1min |
PackageManager | PKMS.PackageHandler | PKMS线程 | 10min |
目前watchdog会监控system_server进程中的以上8个线程:
- 前7个线程的Looper消息处理时间不得超过1分钟;
- PackageManager线程的处理时间不得超过10分钟;
5.2 监控同步锁
能够被Watchdog监控的系统服务都实现了Watchdog.Monitor接口,并实现其中的monitor()方法。运行在android.fg线程, 系统中实现该接口类主要有:
- ActivityManagerService
- WindowManagerService
- InputManagerService
- PowerManagerService
- NetworkManagementService
- MountService
- NativeDaemonConnector
- BinderThreadMonitor
- MediaProjectionManagerService
- MediaRouterService
- MediaSessionService
- BinderThreadMonitor
5.3 输出信息
watchdog在check过程中出现阻塞1分钟的情况,则会输出:
- AMS.dumpStackTraces:输出system_server和3个native进程的traces
-
- 该方法会输出两次,第一次在超时30s的地方;第二次在超时1min;
- doSysRq, 触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log;
-
- 节点/proc/sysrq-trigger
- dropBox,输出文件到/data/system/dropbox,内容是trace + blocked信息
- 杀掉system_server,进而触发zygote进程自杀,从而重启上层framework。
到这里分析结束了,有什么问题欢迎指正