Blog / 阅读

每天进步一点点——磁盘损坏导致container-sync服务退出(Swift Bug )

by admin on 2014-04-16 15:47:08 in ,



 转载请说明出处:http://blog.csdn.net/cywosp/article/details/23848083   


    之前在项目中做了一个监控swift各个服务运行情况的模块,swift中的服务包括:container-updater , account-auditor, object-replicator, proxy-server, container-replicator, object-auditor, object-expirer, container-auditor, container-server, account-server, account-reaper, container-sync, account-replicator, object-updater, object-server共15个,其中proxy-server, account-server, container-server, object-server这四个服务是需要监控的重中之重,它们不工作意味着swift集群就不能对外提供服务了,因此在集群故障处理中,监控这些服务状态就显得尤为重要。
    前段时间监控模块在运行时产生了一些问题让发现了swift的一些小Bug,其中就有当加入到swift中的硬盘损害时导致container-sync服务停止的问题。该Bug的具体log表现如下:
Apr 15 10:07:24 0d7d51e8-024e-3a94-a310-46cf5426b3f9 container-sync UNCAUGHT EXCEPTION#012Traceback (most recent call last):#012 File "/usr/bin/swift-container-sync", line 23, in <module>#012 run_daemon(ContainerSync, conf_file, **options)#012 File "/usr/lib/python2.6/site-packages/swift/common/daemon.py", line 110, in run_daemon#012 klass(conf).run(once=once, **kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/common/daemon.py", line 57, in run#012 self.run_forever(**kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/container/sync.py", line 162, in run_forever#012 for path, device, partition in all_locs:#012 File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1521, in audit_location_generator#012 partitions = listdir(datadir_path)#012 File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1814, in listdir#012 return os.listdir(path)#012OSError: [Errno 5] Input/output error: '/srv/node/sdb1/containers'


根据日志输出我们可以分析得到是sdd1磁盘发生了input/output错误,导致程序在调用listdir函数时抛出了异常,listdir实现如下:
[python] view plaincopy
<span style="font-size:18px;">#swift/common/utils.py  
def listdir(path):  
    try:  
        return os.listdir(path)  
    except OSError as err:  
        if err.errno != errno.ENOENT:   # ENOENT: No such file or directory 文件/路径不存在  
            raise         # 如果所要list的目录(path)不存在则将异常往外抛出  
    return []  
</span>  
listdir函数被audit_location_generator函数调用,具体实现如下:
[python] view plaincopy
<span style="font-size:18px;">#swift/common/utils.py  
def audit_location_generator(devices, datadir, suffix='', mount_check=True, logger=None):  
    device_dir = listdir(devices)  
    # randomize devices in case of process restart before sweep completed  
    shuffle(device_dir)  
    for device in device_dir:  
        ……</span>  
该函数没有捕捉异常,所产生的异常都继续往上抛了


audit_location_generator函数被run_forever函数调用,具体实现如下:
[python] view plaincopy
<span style="font-size:18px;">#swift/container/sync.py  
def run_forever(self):  
        sleep(random() * self.interval)  
        while True:  
            begin = time()  
            all_locs = audit_location_generator(self.devices,  
                                                container_server.DATADIR,  
                                                '.db',  
                                                mount_check=self.mount_check,  
                                                logger=self.logger)  
            for path, device, partition in all_locs:  
                self.container_sync(path)  
                if time() - self.reported >= 3600: # once an hour  
                    self.report()  
            elapsed = time() - begin  
            if elapsed < self.interval:  
                sleep(self.interval - elapsed)</span>  
从上面三个函数以及它们的调用过程可以知道run_forever中没有捕获异常,如果产生了未知异常,那么run_forever函数就会异常退出,从而导致了对应的进程崩溃。


磁盘发生IO错误时/var/log/message的记录:
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: scanning ...
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: end_request: I/O error, dev sdb, sector 976403386
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): metadata I/O error: block 0x3a32bb76 ("xlog_iodone") error 5 numblks 64
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1115 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa072c8b1
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): Log I/O Error Detected. Shutting down filesystem
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): Please umount the filesystem and rectify the problem(s)
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned.
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: sd 0:2:1:0: [sdb] Synchronizing SCSI cache
Apr 15 10:06:54 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned.
Apr 15 10:07:24 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned.
Apr 15 10:07:54 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned.


    该问题虽然对整个集群系统并不带来太大的问题,况且现在的磁盘坏的概率现在已经很低了,但是对整个集群的健康状况以及数据的container的一致性带来了一点小影响。因此,我在swift官方bug报告网站中提交了该bug,不知道大牛们会不会采纳并解决。具体见: https://bugs.launchpad.net/swift/+bug/1307798



写评论

相关文章

上一篇:xshell登录linux中文编码设置

下一篇:单路CPU性能排名

评论

写评论

* 必填.

分享

栏目

赞助商


热门文章

Tag 云