Reinforcement learning (RL) can be a tool for designing policies and controllers for robotic systems. However, the cost of real-world samples remains prohibitive as many RL algorithms require a large number of samples before learning useful policies. Simulators are one way to decrease the number of required real-world samples, but imperfect models make deciding when and how to trust samples from a simulator difficult. This project presents a framework, called Multi-Fidelity Reinforcement Learning (MFRL), for efficient RL in a scenario where multiple simulators of a target task are available, each with varying levels of fidelity. The framework is designed to limit the number of samples used in each successively higher-fidelity/cost simulator by allowing a learning agent to choose to run trajectories at the lowest level simulator that will still provide it with useful information. Theoretical proofs of the framework's sample complexity are given and empirical results are demonstrated on a remote controlled car with multiple simulators. The approach enables RL algorithms to find near-optimal policies in a physical robot domain with fewer expensive real-world samples than previous transfer approaches or learning without simulators.
This simple toy domain illustrates the progression of the algorithm. On the right, the 'real' world consists of an agent starting in the lower-left corner of the grid world and trying to find a policy to lead it to the upper-right goal region. Negative reward is a accumulated in the puddle. The worlds on the right consist of low- and medium-fidelity models of the real world. The learning agent transitions between levels several times, leveraging the lower-fidelity worlds to learn an optimal policy in the real world while minimizing the steps taken there.
The MFRL framework is demonstrated in this real-world remote-controlled (RC) car domain. The RC car learns a policy for quickly racing around a track by efficiently utilizing two available simulators.
Recently, we have extended the MFRL framework to include domains with continuous representations of the states and actions. Here, an inverted pendulum is balanced by using simulated data as a prior for both policy parameters the dynamics model. Using the simulated data leads to learning with 3 times less data than without using a simulator.