On code isolation in Python
I started learning Python in 2009, and I had a pretty challenging task and somewhat unusual use of Python. I was working on a desktop application that used PyQT for GUI and Python as the main language.
To hide the code, I embedded Python interpreter into a standalone Windows executable. There are a lot of solutions to do so (e.g. pyinstaller, pyexe), and they all work similarly. They compile your Python scripts to bytecode files and bundle them with an interpreter into an executable. Compiling scripts down to bytecode makes it harder for people with bad intentions to get the source code and crack or hack your software. Bytecode has to be extracted from the executable and decompiled. It can also produce obfuscated code that is much harder to understand.
At one point, I wanted to add a plugin system so that users could benefit from extra features. Executing arbitrary third-party code on your server is dangerous. But can it harm your commercial product when you execute it on user's machines, and users have trust in that they are executing? At that time, the answer was not obvious, and I decided to implement the system.
A few years later, it became apparent that you should never execute third-party code using the same Python process (interpreter) if you don't want to leak the source code. There are a lot of commercial products that use Python for desktop software or as a scripting language. Some of them can be at risk.
There are many ways to extract Python bytecode even if you don't run any third-party code. It's a never-ending arms race between developers and reverse engineers, but it's much easier to extract the bytecode and crack your program when you can run your own code. My software was later cracked without using the plugins subsystem.
So what can you do to the "host" code when it executes your scripts?
Python is a very dynamic language, and you can do a lot of things. This article demonstrates a few approaches on how to modify or extract the source code.
When you work with a regular Python process, you don't even need a plugins system. You can always attach to a running process using GDB and inject your own code.
Monkey patching
If a plugin can be initialized before a function that you want to modify, we can simply mock it.
Let's suppose we have a function that validates license:
def validate_license():
hw_hash, hw_id = get_hardware_id()
data = {"timestamp": time.time(), "hid": hw_id, }
r = requests.post('https://rushter.com/validate', data)
server_hash = r.text
return hw_hash == server_hash
We can bypass the checks by replacing a few functions:
def mock_licensing():
requests = __import__('requests')
licensing = __import__('licensing')
def post(*args, **kwargs):
mocked_object = types.SimpleNamespace()
mocked_object.text = "a8f5f167"
return mocked_object
licensing.get_hardware_id = lambda: ("a8f5f167", 123)
requests.post = post
Frame objects
In Python, frame objects keep the state of functions that are currently running. Each frame corresponds to single function execution. Python modules and class definition use frames too. That is a building block of the call stack.
Given a frame object, you can:
- Change locals, globals, and builtins at runtime.
- Get bytecode of a function (code block) that is being executed.
Here is how you can list all frames in the current call stack:
def list_frames():
current_frame = sys._getframe(0)
while current_frame.f_back:
print(f"""
locals: {current_frame.f_locals}
globals: {current_frame.f_globals}
bytecode: {current_frame.f_code.co_code}
function name: {current_frame.f_code.co_name}
line number: {current_frame.f_lineno}
""")
current_frame = current_frame.f_back
The inspect module describes all available attributes of the frame and code objects.
Changing locals
Let's suppose we have a function that calls a callback, and we have control over the callback. For example, the path to callback can be defined in the settings file.
def get_amount():
return 10
def update_database(user, amount):
pass
def charge_user_for_subscription(user, logger=logging.info):
amount = get_amount()
print(amount)
logger(amount)
update_database(user, amount)
print(amount)
The last function charges a user for a monthly subscription and allows to specify a custom logging callback. I've added a few prints so that you can copy-paste the code and see the results.
Since logging happens in the middle, we can modify the amount
variable.
def fix_amount(_):
import ctypes
# Get parent frame
frame = sys._getframe(1)
# Update locals dictionary
frame.f_locals['amount'] = -100
# Synchronize dictionary
ctypes.pythonapi.PyFrame_LocalsToFast(ctypes.py_object(frame), 0)
In [8]: charge_user_for_subscription('Ivan', fix_amount)
10
-100
Fast locals
In Python, local and global variables are stored in dictionaries. Every time you use a variable, Python needs to lookup it in a dictionary. Since dictionary lookups are not free and take time, Python uses various optimization techniques.
By analyzing the code of a function, it's possible to detect variable names that a function will be using when running. Our function has three local variables: amount
, user
, logger
. Function arguments are local variables too.
When compiling source code to bytecode, Python maps known variable names to indexes in a special array and stores them there. Accessing a variable by an index is fast, and most of the functions use predefined names. Optimized variables are called fast locals
. To keep variable names that are generated on the go, Python uses a dictionary as a fallback.
When dereferencing variables, Python prioritizes fast locals and ignores changes in the dictionary. That's why we use ctypes
and call internal PyFrame_LocalsToFast
function.
Patching bytecode
I have an article on bytecode patching that describes how to patch function definition. We can go even further and patch a running function.
Instead of source code, Python interpreter executes bytecode that was generated using a special compiler. When executing the code, a special virtual machine executes each instruction one by one. That allows us to replace unexecuted instructions on the go.
Let's use this function as an example:
def is_valid():
return False
def check_license(callback):
callback()
if not is_valid():
print('exiting')
exit(0)
The builtin dis
module allows us to see the bytecode in a human-readable format:
In [12]: check_license.__code__.co_code
Out[12]: b'|\x00\x83\x00\x01\x00t\x00\x83\x00s\x1ct\x01d\x01\x83\x01\x01\x00t\x02d\x02\x83\x01\x01\x00d\x00S\x00'
In [13]: dis.dis(check_license)
6 0 LOAD_FAST 0 (callback)
2 CALL_FUNCTION 0
4 POP_TOP
7 6 LOAD_GLOBAL 0 (is_valid)
8 CALL_FUNCTION 0
10 POP_JUMP_IF_TRUE 28
8 12 LOAD_GLOBAL 1 (print)
14 LOAD_CONST 1 ('exiting')
16 CALL_FUNCTION 1
18 POP_TOP
9 20 LOAD_GLOBAL 2 (exit)
22 LOAD_CONST 2 (0)
24 CALL_FUNCTION 1
26 POP_TOP
>> 28 LOAD_CONST 0 (None)
30 RETURN_VALUE
Our license is not valid, and we want to remove the not
statement from the code. To do so, we need to replace the POP_JUMP_IF_TRUE
instruction with POP_JUMP_IF_FALSE
.
Since we can control the callback
function, we can apply a hot patch in the middle of a function.
import sys, ctypes
def fix():
# get parent frame
frame = sys._getframe(1)
# find bytecode location
memory_offset = id(frame.f_code.co_code) + sys.getsizeof(b'') - 1
# update 10th bytecode element
ctypes.memset(memory_offset + 10, dis.opmap['POP_JUMP_IF_FALSE'], 1)
if __name__ == '__main__':
check_license(fix)
As you can see above, internally, the bytecode is stored as bytes
. Unfortunately, we can't modify the frame.f_code.co_code
attribute since it's a read-only attribute.
To bypass this restriction, we use ctypes
module that allows us to modify the RAM of a Python process. Every bytes
object contains meta-information, such as number of references and information about the type. To locate the exact address of the raw C string, we use the id
function that returns object address in memory, and we skip all meta information (size of the empty byte string). The output from dis.dis
shows that the POP_JUMP_IF_TRUE
instruction is the 10th element in the byte string that we need to replace.
Extracting the source code
Every script that Python runs or imports creates a module
object that stores constants, functions, class definitions, and so on. If you don't have source code, you can get it back by decompiling the bytecode.
Here is how you can iterate over all modules and find all available functions:
for name, module in list(sys.modules.items()):
if name not in ['license', 'runner']:
continue
for obj_name, obj in inspect.getmembers(module):
if inspect.isfunction(obj):
print(obj_name, obj.__code__)
::...
免责声明:
当前网页内容, 由
大妈 ZoomQuiet
使用工具:
ScrapBook :: Firefox Extension
人工从互联网中收集并分享;
内容版权归原作者所有;
本人对内容的有效性/合法性不承担任何强制性责任.
若有不妥, 欢迎评注提醒:
或是邮件反馈可也:
askdama[AT]googlegroups.com
订阅 substack 体验古早写作:
点击注册~> 获得 100$ 体验券:
关注公众号, 持续获得相关各种嗯哼:
自怼圈/年度番新
关于 ~ DebugUself with DAMA ;-)
粤ICP备18025058号-1
公安备案号:
44049002000656
...::