TY - GEN
T1 - Building relational world models for reinforcement learning
AU - Walker, Trevor
AU - Torrey, Lisa
AU - Shavlik, Jude
AU - MacLin, Richard
PY - 2008
Y1 - 2008
N2 - Many reinforcement learning domains are highly relational.Whiletraditional temporal-difference methods can be applied to these domains, they are limited in their capacity to exploit the relational nature of the domain. Our algorithm, AMBIL, constructs relational world models in the form of relational Markov decision processes (MDPs). AMBIL works backwards from collections of high-reward states, utilizing inductive logic programming to learn their preimage, logical definitions of the region of state space that leads to the high-reward states via some action. These learned preimages are chained together to form an MDP that abstractly represents the domain. AMBIL estimates the reward and transition probabilities of this MDP from past experience. Since our MDPs are small, AMBIL uses value-iteration to quickly estimate the Q-values of each action in the induced states and determine a policy. AMBIL is able to employ complex background knowledge and supports relational representations. Empirical evaluation on both synthetic domains and a sub-task of the RoboCup soccer domain shows significant performance gains compared to standard Q-learning.
AB - Many reinforcement learning domains are highly relational.Whiletraditional temporal-difference methods can be applied to these domains, they are limited in their capacity to exploit the relational nature of the domain. Our algorithm, AMBIL, constructs relational world models in the form of relational Markov decision processes (MDPs). AMBIL works backwards from collections of high-reward states, utilizing inductive logic programming to learn their preimage, logical definitions of the region of state space that leads to the high-reward states via some action. These learned preimages are chained together to form an MDP that abstractly represents the domain. AMBIL estimates the reward and transition probabilities of this MDP from past experience. Since our MDPs are small, AMBIL uses value-iteration to quickly estimate the Q-values of each action in the induced states and determine a policy. AMBIL is able to employ complex background knowledge and supports relational representations. Empirical evaluation on both synthetic domains and a sub-task of the RoboCup soccer domain shows significant performance gains compared to standard Q-learning.
UR - http://www.scopus.com/inward/record.url?scp=40249113257&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=40249113257&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-78469-2_27
DO - 10.1007/978-3-540-78469-2_27
M3 - Conference contribution
AN - SCOPUS:40249113257
SN - 3540784683
SN - 9783540784685
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 280
EP - 291
BT - Inductive Logic Programming - 17th International Conference, ILP 2007, Revised Selected Papers
T2 - 17th International Conference on Inductive Logic Programming, ILP 2007
Y2 - 19 June 2007 through 21 June 2007
ER -