Paper at ICLR 2024 Poster #17719
We introduce Skill Machines; finite-state machines that encode a sequence of skill compositions to satisfy temporal-logic tasks with guarantees. Starting from a task specified in linear temporal logic (LTL), we plan over its reward machine, then act greedily with respect to composed value functions to obtain zero-shot policies that are near-optimal and satisficing. We further show how these policies seed few-shot optimal learning using any off-policy RL algorithm.
Imagine a service robot in an office world: Deliver mail until none is left, then deliver coffee while people are present, then patrol rooms A-B-C-D-A, and never break a decoration. We want tasks to be human-understandable and solvable without relearning each time a new specification is presented.
We use a formal language such as LTL with familiar operators: $\neg$ (NOT), $\land$ (AND), $\mathsf{X}$ (Next), $\mathsf{F}$ (Eventually), $\mathsf{G}$ (Always), etc.
Example: Navigate to a button then to a cylinder while never entering blue regions. LTL over high-level events $\mathcal{P}=\{\text{button},\text{cylinder},\text{blue}\}$ in a Safety-Gym-like world: $\\$ $\mathsf{F}(\text{button}\;\land\;\mathsf{X}\;(\mathsf{F}\;\text{cylinder}))\;\land\;\mathsf{G}\neg\text{blue}$.
LTL specifications can be compiled into reward machines (finite-state automata whose transitions carry rewards). Planning (e.g., value iteration) over the reward machine reveals which skill to execute at each automaton node.
If an agent only learns atomic skills (e.g., go to button, go to cylinder), it cannot immediately solve tasks such as go to button while avoiding blue. In general,
We augment the state of task primitives with constraints $\mathcal{C}$; propositions that must remain True/False over an episode. This lets us reuse the same spatial skills under always-avoid/always-maintain requirements.
Unlike regular value functions, WVFs are composable: we can compute the value of any Boolean expression over task primitives without further learning.
A Skill Machine (SM) is a finite-state controller that encodes the sequence of composed skill primitives to satisfy a temporal logic task.We obtain SMs by:
The resulting policy executes skills greedily with respect to the composed WVFs at each node. This yields zero-shot satisfaction with near-optimal performance under mild reachability assumptions.
Once the SM is built, the agent can immediately solve new tasks such as: Go to a hazard, then to a cylinder, then again to a cylinder, while avoiding hazards. No additional learning is required.
Zero-shot policies from SMs provide strong initial performance. To reach task-specific optimality when assumptions are violated, we continue learning few-shot with any off-policy method (e.g., Q-learning), using the SM as guidance.
In office world tasks (e.g., deliver coffee and never break a decoration), QL-SM improves upon the zero-shot SM and reaches optimal performance; SM and QL-SM outperform representative baselines.