Self modifying code is a programming technique where the program modifies itself as it runs. This technique is generally frowned on except when used in extremely limited ways, and has been largely made impossible, undesirable, or useless by modern computer architectures. Self modifying code was most useful on architectures with a very limited number of registers and limited (less than 64k) ram.

  • ways to self modify code:
    store loop index in instruction
    save memory & registers
    modify instruction as a flag
    replace NOP's with instructions or vice versa to add or remove operations
  • Problems with self modifying code when used fully
    • Self modifying code can difficult to read. Sometimes this was done intentionally, as job security or as part of copy protection to make cracking the software harder.
    • self modifying code can be tricky to debug, since it may do different things each time you run it
    • self modifying code is tricky to reuse, since it is not reentrant; what one run does depends on what the last one did
  • current architecture obsticles
    cpu instruction cache
    Instructions that are modified in memory are not modified in cpu cache, and thus are ignored until the cache line expires. This could be exploited, of course, but then you have to totally understand how the instruction cache works.
    read only text segments
    Executable code in memory may be marked as read only by the operating system so it can be shared...
    shared text segments
    Exeuctable pages may be shared between separate processes, and thus modifying one page would affect other users' processes. This is generally not allowed in multiuser operating systems.
    compiled code vs. machine language
    The instructions generated by the compiler are not necessarily known when the code is written, making it difficult to modify code that isn't generated yet.
  • modern uses of self modifying code
    runtime linker
    The linker may patch unresolved jump statements in a jump table or in the code itself at or immediately before runtime; an unresolved symbol may be expressed as a jump to a routine that would backpatch the original jump to the correct address, thus allowing demand linking.
    patch kernel to match cpu features available (fpu, etc.)
    The Linux kernel does (or at one time did) include cpu instructions and features such as math instructions that were not available on all cpu's. When such an instruction is encountered the first time, a trap is genenerated and code is called to patch the instruction into a more efficent subroutine call to emulate the instruction next time instead of generating the trap.
    On the fly generation of temporary code which may load or switch banks to run another piece of code; this was especially popular in bank switched machines, where the addressable memory was smaller than the available memory, and in systems that used overlays.
    overflow exploits
    Many security holes are exploited by using potential buffer overruns in buggy code and modifying either the stack or the running code, sometimes even by putting a trampoline on the stack.
    polymorphic viruses or stealth viruses
    So called "polymorphic viruses" work by modifying their own code to attempt to prevent virus checkers from finding them.
    genetic algorithms
    Genetic algorithms are inherently self modifying; "code" fragments are mixed and matched and mutated using a search algorithm (random search is common) until an ideal combination is found
  • Structured languages have better methods that give the same advantages of self modifying code without actually modifying existing code:
    Many languages, especially interpreted languages, have eval, which will take a pregenerated string and run it as program code, thus generating new code rather than modifying existing code.
    function pointers
    Rather than modifying code in place, the code is generated using a function pointer (an indirect jump in assembly) which is given a value at runtime. This has the advantage that type checking can still be done, but may be less efficent on some architectures.
    dynamic linking and using DLLs or to add functions
    Some operating systems have support for linking in additional code at runtime, either via the use of function pointers to activate the code once linked in, or via unresolved symbols that cause the additional code to be automatically linked. (This uses the same mechanism as shared libraries.)
    Some object oriented languages allow functions to be overloaded (defined multiple times in different ways), and linking of overloaded functions may actually change at runtime depending on what modules are loaded or the current context.
    thunk or closure or lazy evaluation
    Some languages (java, lisp, perl, others) allow code to be stored in or with a variable; the key is that the thunk may be created and passed to another piece of code (carrying along with it some of its execution environment) where it is later executed, similar to a trampoline.

    This was brought to you by the Save Our Archaic Technical Terms Society.