Self modifying code is a programming technique where the program
modifies itself as it runs. This technique is generally frowned on
except when used in extremely limited ways, and has been largely made
impossible, undesirable, or useless by modern computer architectures.
Self modifying code was most useful on architectures with a
very limited number of registers and limited (less than 64k) ram.
- ways to self modify code:
- store loop index in instruction
- save memory & registers
- modify instruction as a flag
- replace NOP's with instructions or vice versa to add or remove operations
- Problems with self modifying code when used fully
- Self modifying code can difficult to read. Sometimes this was
done intentionally, as job security or as part of copy protection
to make cracking the software harder.
- self modifying code can be tricky to debug, since it may
do different things each time you run it
- self modifying code is tricky to reuse, since it is not reentrant; what one run does depends on what the last one did
current architecture obsticles
- cpu instruction cache
- Instructions that are modified in memory are not modified in cpu cache, and thus are ignored until the cache line expires. This could be exploited, of course, but then you have to totally understand how the instruction cache works.
- read only text segments
- Executable code in memory may be marked as read only by the operating system so it can be shared...
- shared text segments
- Exeuctable pages may be shared between separate processes, and thus modifying one page would affect other users' processes.
This is generally not allowed in multiuser operating systems.
- compiled code vs. machine language
- The instructions generated by the compiler are not necessarily known when the code is written, making it difficult to modify code that isn't generated yet.
modern uses of self modifying code
- runtime linker
- The linker may patch unresolved jump statements in a jump table or in the code itself at or immediately before runtime;
an unresolved symbol may be expressed as a jump to a routine
that would backpatch the original jump to the correct address,
thus allowing demand linking.
- patch kernel to match cpu features available (fpu, etc.)
- The Linux kernel does (or at one time did) include
cpu instructions and features such as math instructions that
were not available on all cpu's. When such an instruction is
encountered the first time, a trap is genenerated and code
is called to patch the instruction into a more efficent subroutine call to
emulate the instruction next time instead of generating the trap.
- trampoline
- On the fly generation of temporary code which may load
or switch banks to run another piece of code; this was
especially popular in bank switched machines, where the
addressable memory was smaller than the available memory,
and in systems that used overlays.
- overflow exploits
- Many security holes are exploited by using potential buffer overruns in buggy code and modifying either the stack or the running code, sometimes even by putting a trampoline on the stack.
- polymorphic viruses or stealth viruses
- So called "polymorphic viruses" work by modifying their
own code to attempt to prevent virus checkers from finding them.
- genetic algorithms
- Genetic algorithms are inherently self modifying; "code" fragments are mixed and matched and mutated using a search algorithm (random search is common) until an ideal combination is found
Structured languages have better methods that give the same
advantages of self modifying code without actually modifying
existing code:
- eval
- Many languages, especially interpreted languages, have
eval, which will take a pregenerated string and run it as
program code, thus generating new code rather than modifying
existing code.
- function pointers
- Rather than modifying code in place, the code is generated using a function pointer (an indirect jump in assembly) which is given a value at runtime. This has the advantage that type checking can still be done, but may be less efficent on some architectures.
- dynamic linking and using DLLs or ld.so to add functions
- Some operating systems have support for linking in additional code at runtime, either via the use of function pointers to activate the code once linked in, or via unresolved symbols that cause the additional code to be automatically linked.
(This uses the same mechanism as shared libraries.)
- overloading
- Some object oriented languages allow functions to be overloaded (defined multiple times in different ways), and linking of overloaded functions may actually change at runtime depending on what modules are loaded or the current context.
- thunk or closure or lazy evaluation
- Some languages (java, lisp, perl, others) allow code to be stored
in or with a variable; the key is that the thunk may be created and passed to another piece of code (carrying along with it some of its execution environment) where it is later executed, similar to a trampoline.
This was brought to you by the Save Our Archaic
Technical Terms Society.