Recent Advances in Reinforcement Learning in Neural Networks
As processing power continues to increase, researchers of neural
networks are making progress in the mathematical modeling of reinforcement
learning. Recent advancements tackle problems such as meta-parameters,
independent critics, task distribution, and the importance of memory storage
in learning. The researchers who accomplished these feats begin to theoretically
link the reinforcement learning shown to be capable of neural networks with the
biological systems found in life on earth.
In an attempt to verify a proposed theory from 1973, cited in the study,
Yamakawa and Okabe (1995) built a neural network using a neural critic. The
researchers designed the neural critic to recursively strengthen
connections within the neural network in response to reinforcement signals. The
neural critic itself uses static criteria, and does not change during the course
of learning. During the solution routine, the neural critic learned from
mistakes and modified the parameters of the neural network.
Yamakawa and Okabe used three different types of neural critics in their
research. A nonadaptive critic worked as a control by only avoiding past
mistakes, without evaluating the mistake any further than a Boolean judgment (Yamakawa
& Okabe, p. 368). The second type of critic used was a first-stage adaptive
critic. This critic used one iteration of mistake evaluation during the critique
of the system’s progress. Finally, the researchers used a recursive adaptive
critic. This type used recursion and more advanced summation to modify the
system’s parameters. As predicted by Yamakawa and Okabe, the recursive adaptive
critic was most efficient in solving the maze task.
Extending the concept of this critic model, the researchers proposed that the human
brain uses a similar critic in reinforcement learning. “The adaptive recursive
critic can be converted into a conventional neural network model, so we can
compare this critic with the part of the brain that controls the sense of values
(maybe around the limbic system, especially the amygdala.)” (Yamakawa &
Okabe, p. 373) Yamakawa and Okabe used computerized neural networks to test the
mathematical power of a theory for how the brain learns via reinforcement.
Similar to the concept of a critic, meta-learning involves using meta-parameters
to define the parameters of a neural network. Schweighofer and Doya (2003)
suggested a method of reinforcement learning designed to solve a Markov
decision problem, a mathematical task requiring the deduction of an optimal
solution. The neural network used in the study involves the neural network
itself, the parameters of the neural network, and the parameters of the
parameters. Meta-learning involves evolving higher-level parameters to solve a
problem.
The researchers proposed an equation to govern the way the meta-parameters
evolved during the course of the task. For every case, “the algorithm did not
only find appropriate values of the meta-parameters, but also controlled the
time course of these meta-parameters in a dynamic, adaptive manner.” (Schweighofer
& Doya, p. 7) The “time course” referred to one of the static attributes of the
meta-meta-parameters, T, which Scheighofer and Doya suggest to be genetic
constants in the organism, although they differ between organisms. “As the
algorithm is extremely robust regarding the T meta-meta-parameter, we can assume
that its value is genetically determined, and perhaps related to the
wake-sleep cycle of the animal.” (Schweighofer & Doya, p. 8)
However, the entire system seemed to be more dependent to variation of the
other meta-meta-parameter, A. Schweighofer and Doya leave open the idea that a
meta-meta-algorithm would be needed to control this variable A. “As the value of
the A parameter is more sensitive, it is not impossible that a
meta-meta-learning algorithm operates to tune it.” (Schweighofer & Doya, p. 8)
The study used various random seed values for the A and T values
in order to show that their algorithm would work under various genetic
differences.
INTERMISSION
In addition to these above learning techniques, neural networks come across a
memory problem during reinforcement learning. Neural networks designed to
simulate reinforcement learning behavior often suffer from the problem of path
interference. Path interference is the inability for a neural network to
remember, or store, previously learned relationships between their input and
output. In their study, Bosman, van Leeuwen, and Wemmenhove (2003)
incorporated a memory function into a neural network in order to take advantage
of previous input-output relationships.
Their incorporation of a memory system with a reinforcement learning algorithm
picks up where a former study leaves off. Chialvo and Bak (2001) used a neural
network to solve a reinforcement task, however their model fails to work with a
low level of neurons because of the path interference problem. “As the number of
neurons in the hidden layer decreases, learning, at a certain moment,
becomes impossible: path interference is the phenomenon which causes this
effect.” (Bosman, et al, p. 3) The study overcomes this problem with the
addition of a memory mechanism, which “has a positive influence on the learning
time of the neural net.” (Bosman, et al, p. 3)
Proposing that memory feedback is crucial for learning, the researchers suggest
this combination of memory and learning “might be biologically realizable.
Without the addition of any feedback-signal, learning of
prescribed input-output relations—whether in reality or in a model—is, of
course, impossible.” (Bosman, et al, p. 1) They predict biological systems would
use a memory-store to enhance reinforcement learning.
Although processing speed continues to improve, it would always be beneficial
to distribute a neural network into various modules. A problem arises with the
distribution of a neural networking task into modules (responsible for
their own sub-tasks), and then recombine the sub-tasks into “the composite
policy for the entire task.” (Samejima, et al, p. 1) Samejima, Doya, and Kawato
(2003) proposed a technique for distributing the reinforcement reward to the
various modules using a credit assignment.
Neural networks that are not subdivided into modules are subject to a single
reinforcement reward for each iteration. For a neural networking task that is
subdivided into sub-tasks, the sub-tasks must also respond to the reinforcement
reward, but the reinforcement reward must be summarized and tailored for the
sub-task. The study points out that “it is necessary to design appropriate
‘pseudo rewards’ for sub-tasks.” (Samejima, et al, p. 1)
Using two different tasks, this study proved the efficiency of their task
distribution techniques in reinforcement learning environments. First, the
researchers designed a target-pursuit task. The task was divided into four
sub-tasks, each of which would be subject to their own ‘pseudo rewards’ during
the reinforcement stage of the task. The modular reward system outperformed a
simpler weighted reward system. “The MMRL (multiple-model based reinforcement
learning) with modular reward achieved near-optimal policy faster than the
MMRL with weighted total TD (temporal difference) error.” (Samejima, et al,
p. 6) Also, the researchers devised a pendulum swing paradigm, for which they
utilized a neural network to find the best way to swing the pendulum based on an
arbitrarily assigned torque value. Results showed similar success when using
their inter-module credit system of reinforcement. “We can see that the value
was more effectively propagated with the backing-up modular reward equation,
which enabled faster learning.” (Samejima, et al, p. 8)
The researchers successfully implemented a credit-based system of reinforcement
distribution across sub-routines that allow for distribution of sub-tasks for a
neural network without sacrificing the potency and accuracy of the reinforcement
reward. “We introduced a new concept of modular reward, which enables the
learning of modular policies directed toward the optimization of an entire
task.” (Samejima, et al, p. 8)
These advances in neural network distribution, meta-learning, higher level
critiquing, and memory improvements to aid reinforcement learning represent
major advances in the understanding of how animal brains might learn via reinforcement.
APA style rocks your world.
References:
Bak, P. & Chialvo, D. (2001) Physical Review E, 63, 031912.
Bosman, R., van Leeuwen, W., Wemmenhove, B. (2003, November). Combining Hebbian
and reinforcement learning in a minibrain model. Neural Networks, xx, 1-8
(Article in press).
Samejima, K., Doya, K, Kawato, M. (2003, November). Inter-module credit
assignment in modular reinforcement learning. Neural Networks, xx, 1-10
(Article in press).
Schweighofer, N. & Doya, K. (2003). Meta-Learning in Reinforcement Learning.
Neural Networks, 16, 5-9.
Yamakawa, H. & Okabe, Y. (1995). A Neural Network-Like Critic for Reinforcement
Learning. Neural Networks, 8, 363-373.
Node your homework