What If the World Were Different? Gradient-based Exploration for New Optimal Policies


Planning under uncertainty assumes a model of the world that specifies the probabilistic effects of the actions of an agent in terms of changes of the state. Given such model, planning proceeds to determine a policy that defines for each state the choice of action that the agent should follow in order to maximize a reward function. In this work, we realize that the world can be changed in more ways than those possible by the execution of the agent’s repertoire of actions. These additional configurations of the world may allow new policies that let the agent accumulate even more reward than that possible by following the optimal policy of the original world. We introduce and formalize the problem of planning while considering these additional possible worlds. We then present an approach that models feasible changes to the world as modifications to the probability transition function, and show that the problem of computing the configuration of the world that allows the most rewarding optimal policy can be formulated as a constrained optimization problem. Finally, we contribute a gradient-based algorithm for solving this optimization problem. Experimental evaluation shows the effectiveness of our approach in multiple problems of practical interest.

Global Conference on Artificial Intelligence (GCAI 2018)