Android上层WatchDog学习笔记_2

发布时间 2023-09-27 14:31:44作者: Hello-World3

一、简述

1. 了解 WatchDog 的原理,可以更好的理解系统服务的运行机制。


二、WatchDog实现

1. 代码实现位置

//frameworks/base/services/core/java/com/android/server/Watchdog.java

public class Watchdog extends Thread {
    ...
}

可见 Watchdog 是一个线程。

2. WatchDog 在 SystemServer.java 中启动

run() //SystemServer.java
    startBootstrapServices() //SystemServer.java
        traceBeginAndSlog("StartWatchdog");
        final Watchdog watchdog = Watchdog.getInstance();
        watchdog.start();
        traceEnd();
        ...
        traceBeginAndSlog("InitWatchdog");
        watchdog.init(mSystemContext, mActivityManagerService);
        traceEnd();

可见 Watchdog 是运行在 SystemServer 中的一个辅线程。因为是线程,所以,只要start即可。

3. WatchDog构造方法

private Watchdog() {
    super("watchdog");
    // not checking the background thread,shared foreground thread is the main checker. 线程名 "android.fg"
    mMonitorChecker = new HandlerChecker(FgThread.getHandler(), "foreground thread", DEFAULT_TIMEOUT);
    mHandlerCheckers.add(mMonitorChecker);
    // Add checker for main thread. only do a quick check since there can be UI running on the thread.
    mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()), "main thread", DEFAULT_TIMEOUT));
    // Add checker for shared UI thread. 线程名 "android.ui"
    mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(), "ui thread", DEFAULT_TIMEOUT));
    // And also check IO thread. 线程名 "android.io"
    mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(), "i/o thread", DEFAULT_TIMEOUT));
    // And the display thread. 线程名 "android.display"
    mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(), "display thread", DEFAULT_TIMEOUT));
    // And the animation thread. 线程名 "android.anim"
    mHandlerCheckers.add(new HandlerChecker(AnimationThread.getHandler(), "animation thread", DEFAULT_TIMEOUT));
    // And the surface animation thread. 线程名 "android.anim.lf"
    mHandlerCheckers.add(new HandlerChecker(SurfaceAnimationThread.getHandler(), "surface animation thread", DEFAULT_TIMEOUT));

    // Initialize monitor for Binder threads.
    addMonitor(new BinderThreadMonitor());
    mOpenFdMonitor = OpenFdMonitor.create();

    HandlerThread handlerThread = new HandlerThread("workThread"); //SS下的"workThread"线程
    handlerThread.start();
    mWorkHandler = new Handler(handlerThread.getLooper()) {
        @Override
        public void handleMessage(Message msg) {
            switch (msg.what) {
                case MESSAGE_AFE_CHECK_ERROR:
                    checkAfeStatus(false);
                    break;

                case MESSAGE_AFE_CHECK_OVER:
                    Slog.i(TAG, "release observer");
                    mFileObserver.stopWatching();
                    mFileObserver = null;
                    checkAfeStatus(true);
                    getLooper().quitSafely();
                    mWorkHandler = null;
                    break;
            }
        }
    };

    // See the notes on DEFAULT_TIMEOUT.
    assert DB || DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
}

重点关注两个对象:mMonitorChecker 和 mHandlerCheckers。

其中 mHandlerCheckers 列表元素的来源:

(1)构造对象的导入:UiThread、IoThread、DisplatyThread、FgThread加入

(2)外部导入:Watchdog.getInstance().addThread(handler);

mMonitorChecker 列表元素的来源:

(1) 外部导入:Watchdog.getInstance().addMonitor(monitor);

(2) 特别说明:addMonitor(new BinderThreadMonitor());


3. WatchDog的run()方法

public void run() {
    while (true) {
        ...
        synchronized (this) {
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                HandlerChecker hc = mHandlerCheckers.get(i);
                hc.scheduleCheckLocked();
            }            
        }
        ...
    }
    ...
    // Trigger the kernel to dump all blocked threads, and backtraces
    // on all CPUs to the kernel log
    doSysRq('w');
    doSysRq('l');
    ...
    Thread dropboxThread = new Thread("watchdogWriteToDropbox")
    dropboxThread.start();
    ...
}

对 mHandlerCheckers 列表元素进行检测,若发现卡住了,触发 show-backtrace-all-active-cpus(l) show-blocked-tasks(w) 这两个sysrq来获取active cpu和D状态线程的栈回溯。


4. HandlerChecker 的 scheduleCheckLocked()

public void scheduleCheckLocked() {
    if (mCompleted) {
        // Safe to update monitors in queue, Handler is not in the middle of work
        mMonitors.addAll(mMonitorQueue);
        mMonitorQueue.clear();
    }
    if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) || (mPauseCount > 0)) {
        mCompleted = true;
        return;
    }
    if (!mCompleted) {
        // we already have a check in flight, so no need
        return;
    }

    mCompleted = false;
    mCurrentMonitor = null;
    mStartTime = SystemClock.uptimeMillis();
    mHandler.postAtFrontOfQueue(this);
}

mMonitors.size() == 0 的情況,主要为了检查 mHandlerCheckers 中的元素是否超时,运用的手段:mHandler.getLooper().getQueue().isPolling().

mMonitorChecker 对象的列表元素一定是大于0,此时,关注点在 mHandler.postAtFrontOfQueue(this):


5. HandlerChecker 的 run()

public final class HandlerChecker implements Runnable {
    ...
    @Override
    public void run() {
        final int size = mMonitors.size();
        for (int i = 0 ; i < size ; i++) {
            synchronized (Watchdog.this) {
                mCurrentMonitor = mMonitors.get(i);
            }
            mCurrentMonitor.monitor();
        }

        synchronized (Watchdog.this) {
            mCompleted = true;
            mCurrentMonitor = null;
        }
    }
    ...
}

运用的手段,监听 monitor 方法。

(1) 这里是对 mMonitors 进行 monitor,而能够满足条件的只有:mMonitorChecker,例如,各种服务通过 addMonitor 加入列表。

Watchdog.getInstance().addMonitor(this); //ActivityManagerService.java
Watchdog.getInstance().addMonitor(this); //InputManagerService.java
Watchdog.getInstance().addMonitor(this); //PowerManagerService.java
Watchdog.getInstance().addMonitor(this); //WindowManagerService.java

而被执行的 monitor 方法很简单,例如 ActivityManagerService 的:

public void monitor() {
    synchronized (this) { }
}

这里仅仅是检查系统服务是否长时间被锁住

(2) 特别说明,检查 BinderThreadMonitor 方法

private static final class BinderThreadMonitor implements Watchdog.Monitor {
    @Override
    public void monitor() {
        Binder.blockUntilThreadAvailable();
    }
}

//frameworks/base/core/java/android/os/Binder.java
public static final native void blockUntilThreadAvailable();

//frameworks/native/libs/binder/IPCThreadState.cpp
void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        ALOGW("Waiting for thread to be free. mExecutingThreadsCount=%lu mMaxThreads=%lu\n",
                static_cast<unsigned long>(mProcess->mExecutingThreadsCount),
                static_cast<unsigned long>(mProcess->mMaxThreads));
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

这里仅仅是检查进程中包含的可执行线程的数量不能超过 mMaxThreads,如果超过了最大值(31个),就需要等待。默认每个进程最大15个binder线程,但是SS将自己的改成31个了:

//frameworks/native/libs/binder/ProcessState.cpp
#define DEFAULT_MAX_BINDER_THREADS 15

//frameworks/base/services/java/com/android/server/SystemServer.java
public final class SystemServer {
    private static final int sMaxBinderThreads = 31;

    private void run() {
        BinderInternal.setMaxThreads(sMaxBinderThreads); //在启动所有服务之前就设置了
        ...
        startBootstrapServices();
    ]
}


6. 超时后WatchDog会做什么

private void checkAfeStatus(boolean success) {
    public void run() {
        ...
        Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
        WatchdogDiagnostics.diagnoseCheckers(blockedCheckers);
        Slog.w(TAG, "*** GOODBYE!");
        Process.killProcess(Process.myPid());
        System.exit(10);
}

kill自己所在进程(system_server),并退出。


三、WatchDog日志打印

1. process stack traces

保存路径由 dalvik.vm.stack-trace-file 或 dalvik.vm.stack-trace-dir 控制,常规为 /data/anr 。调用 ActivityManagerService.dumpStackTraces() 进行打印。

public final class HandlerChecker implements Runnable { //Watchdog.java
    public void run() {
        while (true) {
            if (!fdLimitTriggered) {
                if (waitState == WAITED_HALF) {
                    if (!waitedHalf) {
                        Slog.i(TAG, "WAITED_HALF");
                        // We've waited half the deadlock-detection interval.  Pull a stack
                        // trace and wait another half.
                        ArrayList<Integer> pids = new ArrayList<Integer>();
                        pids.add(Process.myPid());
                        ActivityManagerService.dumpStackTraces(pids, null, null, getInterestingNativePids());
                    }
                }
            }
            final File stack = ActivityManagerService.dumpStackTraces(pids, null, null, getInterestingNativePids());
        }
    }
}

注意,堵塞一半时即 WAITED_HALF,也会打印 process stack traces。


2. slog

Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);

Slog.w(TAG, "*** GOODBYE!");


3. event log

EventLog.writeEvent(EventLogTags.WATCHDOG, subject);


4. kernel stack traces

// Trigger the kernel to dump all blocked threads, and backtraces on all CPUs to the kernel log
doSysRq('w');
doSysRq('l');

触发 show-backtrace-all-active-cpus(l) show-blocked-tasks(w) 这两个sysrq来获取active cpu和D状态线程的栈回溯,打印到内核log中。


5. dropbox

Thread dropboxThread = new Thread("watchdogWriteToDropbox") {
    public void run() {
        // If a watched thread hangs before init() is called, we don't have a
        // valid mActivity. So we can't log the error to dropbox.
        if (mActivity != null) {
            mActivity.addErrorToDropBox("watchdog", null, "system_server", null, null, null, subject, null, stack, null);
        }
        StatsLog.write(StatsLog.SYSTEM_SERVER_WATCHDOG_OCCURRED, subject);
    }
};
dropboxThread.start();

注意,dropbox 一般放在 /data/system/dropbox 目录下,指定目录的位置是:

//frameworks/base/services/core/java/com/android/server/DropBoxManagerService.java

public DropBoxManagerService(final Context context) {
    this(context, new File("/data/system/dropbox"), FgThread.get().getLooper());
}

 

四、监测UiThread、IoThread、DisplatyThread、FgThread的原因

1. 这4个类,继承 ServiceThread,是单例模式。例如 UiThread.java

//frameworks/base/services/core/java/com/android/server/UiThread.java

public final class UiThread extends ServiceThread {

    private UiThread() {
        super("android.ui", Process.THREAD_PRIORITY_FOREGROUND, false /*allowIo*/);
    }

    @Override
    public void run() {
        // Make sure UiThread is in the fg stune boost group
        Process.setThreadGroup(Process.myTid(), Process.THREAD_GROUP_TOP_APP);
        super.run();
    }

    private static void ensureThreadLocked() {
        if (sInstance == null) {
            sInstance = new UiThread();
            sInstance.start();
            final Looper looper = sInstance.getLooper();
            looper.setTraceTag(Trace.TRACE_TAG_SYSTEM_SERVER);
            looper.setSlowLogThresholdMs(SLOW_DISPATCH_THRESHOLD_MS, SLOW_DELIVERY_THRESHOLD_MS);
            sHandler = new Handler(sInstance.getLooper());
        }
    }

    public static UiThread get() {
        synchronized (UiThread.class) {
            ensureThreadLocked();
            return sInstance;
        }
    }

    public static Handler getHandler() {
        synchronized (UiThread.class) {
            ensureThreadLocked();
            return sHandler;
        }
    }
}

(1) 通过 get() 获取对象。

(2) 通过 getHandler() 获取各自线程里面的 Handler 对象。

(3) 注意看,创建自身对象 ensureThreadLocked 的时候,就进行了 start 动作。也就是说,这个线程。在创建对象的时候就,就已经启动了。

其次,这四个类都继承 ServiceThread ,而 ServiceThread 继承 HandlerThread。我们重点关注线程中的 Handler,因为 AMS、WMS、PMS 等系统服务都涉及调用它们。

//frameworks/base/services/core/java/com/android/server/am/ActivityManagerService.java

final class UiHandler extends Handler {
    public UiHandler() {
        super(com.android.server.UiThread.get().getLooper(), null, true);
    }

    @Override
    public void handleMessage(Message msg) {
        switch (msg.what) {
            case SHOW_ERROR_UI_MSG: 
            case SHOW_NOT_RESPONDING_UI_MSG: 
            case SHOW_STRICT_MODE_VIOLATION_UI_MSG:
            case WAIT_FOR_DEBUGGER_UI_MSG:
            case DISPATCH_PROCESSES_CHANGED_UI_MSG:
            case DISPATCH_PROCESS_DIED_UI_MSG:
            case DISPATCH_UIDS_CHANGED_UI_MSG:
            case DISPATCH_OOM_ADJ_OBSERVER_MSG:
        }
    }
} 

UiHandler 是直接获取的 UiThread 里面的 Looper。我们清楚一个线程一个 Looper,一个 MessageQueue,但是可以有多个 Handler.

我们看 handleMessage 里面的处理方式,说明并不一定是主线程才能更新Ui。(但是Android有说明必须主线程才能更新UI)。


2. 使用的场景差异

UiThread --> ActivityManagerService

DisplayThread --> WindowManagerService、InputManagerService、DisplayMangerService

IoThread --> PackageInstallerService、StorageManagerService、BluetoothManagerService

 

五、总结

1. Watchdog 的核心对象为 mHandlerCheckers 和 mMonitorChecker。

mHandlerCheckers:监控消息队列是否发生阻塞

mMonitorChecker:监控系统核心服务是否发生长时间持锁

mHandlerCheckers 的对象采用手段为通过 mHandler.getLooper().getQueue().isPolling() 判断是否超时;mMonitorChecker 通过 synchronized(this) 判断是否超时,其中特别注意,BinderThreadMonitor 主要是通过判断Binder线程是否超过了系统最大值来判断是否超时。

2. 超时之后,系统会打印一系列的日志,可以根据各种日志输出,进行有效分析。

3. 超时之后,Watchdog会杀掉自己的进程,也就是此时 system_server 进程的pid会变化。

 

 

 

参考:
android原理分析博客,Android WatchDog原理分析:https://blog.csdn.net/weixin_28543661/article/details/117344345