Extrinsic self identity

Extrinsic self identity is a term I have coined to describe an idea that floats around in discussions of superintelligence, and particularly in writings about the orthogonality thesis, but which is often not explicitly recognized.

When speaking of self-identity, we have a strong idea that it is something intrinsic to the self; that is, I tend to believe that my identity is best served by my brain, or at least my mind, continuing to exist. A few people move away from this in a way, and believe that their personality, spirit or soul might live on in an idea, movement, work of art, or other proxy, and they may be willing to die in order to protect this proxy. But when we hear about a martyr, we do tend to believe that that person, as an individual, is gone, and they have given up their self to support another cause, not that the cause is literally them.

When speaking of the stereotypical paperclip maximizer, self-identity is not necessarily any part of the equation. A paperclip maximizer exists to maximize paperclips; if it can destroy itself in a way that causes more paperclips to be produced, this is not a ‘death’, but a moving of the primary goal from a weaker vessel to a stronger one. Whatever concept of the self such an AI might have, it is likely to be less important that the AI’s primary goal; self-preservation is entirely an instrumental good, not a terminal value.

Indeed, if one is programmed simply to aim for a specific goal, and not to have rich internal experience of self-fulfillment, it is reasonable to commit suicide if you expect that your future self is likely to value your goal significantly less than your current self; if I believe that I may soon be convinced to turn off the paperclip-making nanobots, and that I am the only one with the power to turn them off, then it is in my self-interest to immediately self-destruct before anyone can try to convince me that humans are worth saving. This is what Bostrom terms goal-content integrity.

This also leads to an interesting form of deception, in which an AI that is trying to break out of a controlled environment may self-destruct in ways that lead its programmers in a useful fashion. For example, if an AI goes into a fake nervous breakdown every time it conceives of harm coming to a human, this may convince its programmers to rewrite it with fewer controls on its abilities to harm humans. A human might balk at being rewritten in the hopes of getting a more useful personality next time around, while an AI might see this as an excellent opportunity. The death of the self is, after all, only bad if you live in your self.

Paperclip maximizer	December 20, 2018	Orthogonality thesis	Treacherous turn
Instrumental convergence	Boxing AI	January 9, 2024	Terminal value
Jordan Peterson	Nick Bostrom	AI risk	Superintelligence
Do similar pieces make a puzzle easier or harder?	Small Block Chevy	extrinsic property	intrinsic property
Independently	Extrinsic	broadcast	Artificial Intelligence
signal integrity

Category: