Learning from interactions is a fundamental idea underlying nearly all theories of learning and intelligence. This book will explore a computational approach to dive into this. We will explore some designs in solving learning problems and evaluate the design through mathematical analysis or computational experiments.
RL is a problem,solution and field
Reinforcement learning, like many topics whose names end with “ing,” such as machine learning and mountaineering, is simultaneously a problem, a class of solution methods that work well on the problem, and the field that studies this problem and its solution methods. It's essential to keep the three conceptually seperate.
The detailed formalization of the $problem$ will be given in Ch-3. It's simply about to capture the most important aspests of the real problem facing a learning agent interacting over time with environment. The agent can sense, take actions and have a goal . Any method to solve this is considered as rl method.
Difference supervised and unsupervised learning
Supervised learning is learning from a training set of labeled examples provided by a knowledgable external supervisor. The object of this kind of learning is for the
system to extrapolate, or generalize, its responses so that it acts correctly in situations not present in the training set. However it's not learning from interaction. In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act. In uncharted territory—where one would expect learning to be most beneficial—an agent must be able to learn from its own experience.
Unsupervised learning is to find the structure hidden in the collections of unlabeled data. However, reinforcement learning is trying to maximize a reward signal instead of trying to find hidden structure.
Exploration and exploitation trade-off
To $\max r_t$, the agent prefers the action $a_k \in \{a_i\}_{i=1}^{t-1},a_k=\arg \max_{i} r_t(a_i) $.But to discover such actions, it has to try actions that it has not selected before. On the one hand, the agent has to exploit what it already has experienced to obtain rewards, but it has to explore in order to make better action selections in the future.
RL explicitly considers the whole problem
Another key feature of reinforcement learning is that it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment.All reinforcement learning agents have explicit goals, can sense aspects of their environments, and can choose actions to influence their environments.
Elements of RL
Beyond the agent and the environment, four main elements:
- policy: defines the learning agent's way of behaving at a given time. Can be roughly thought of as perceived states of the environment to actions to be taken.
- reward signal: defines the goal of a RL problem. On each step time, the environment sends to the RL agent a signal number called reward. The agent's sole object is to maximize the total reward it receives over long run.
value function. Different from reward determines the immediate, intrinsic desirability of the environment state, value indicates long-term desirability of states after taking into account the states that are likely to follow and the rewards available in those states.
The relationship between reward and value: Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary.Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward. Nevertheless, it is values with which we are most concerned when making and evaluating decisions. Nevertheless, it is values with which we are most concerned when making and evaluating decisions. Action choices are made based on value judgments. We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run.
model of the environment: allows inferences to be made about how the environment will behave. Models are used for planning which decides courses of action by considering possible future situations before they really experienced. Using models and planning are called model-based methods as opposed to model-free that are explicitly trial-and-error learners