CaT: Coaching a Teachable Student

Abstract

We propose a novel knowledge distillation framework for effectively teaching a sensorimotor student agent to drive from the supervision of a privileged teacher agent. Current distillation for sensorimotor agents methods tend to result in suboptimal learned driving behavior by the student, which we hypothesize is due to inherent differences between the input, modeling capacity, and optimization processes of the two agents. We develop a novel distillation scheme that can address these limitations and close the gap between the sensorimotor agent and its privileged teacher. Our key insight is to design a student which learns to align their input features with the teacher’s privileged Bird’s Eye View (BEV) space. The student then can benefit from direct supervision by the teacher over the internal representation learning. To scaffold the difficult sensorimotor learning task, the student model is optimized via a student-paced coaching mechanism with various auxiliary supervision. We further propose a high-capacity imitation learned privileged agent that surpasses prior privileged agents in CARLA and ensures the student learns safe driving behavior. Our proposed sensorimotor agent results in a robust image-based behavior cloning agent in CARLA, improving over current models by over 20.6% in driving score without requiring LiDAR, historical observations, ensemble of models, on-policy data aggregation or reinforcement learning.

Video

Motivation

To ease the challenging sensorimotor agent training task, recent approaches decompose the task into stages, by first training a privileged network with complete knowledge of the world and distilling its knowledge into a less capable student network. However, current distillation methods for sensorimotor agents result in suboptimal driving behavior due to inherent differences between the inputs, modeling capacity, and optimization processes of the two agents.

Chen, et al. CoRL 2020

Method

We develop a novel deep distillation scheme that can address these limitations and close the gap between the sensorimotor agent and its privileged teacher. We achieve this by

Effective Teacher: We propose to incorporate explicit safety-aware cues into the BEV space that facilitate a surprisingly effective teacher agent design. We demonstrate our learned agent to match expert- level decisionmaking.
Teachable Student via Alignment: An IPM-based transformer alignment module can facilitate direct distillation of most of the teacher’s features and better guide the student learning process.
Student-paced Coaching: A coaching mechanism for managing difficult samples can scaffold knowledge and lead to improved model optimization by better considering the ability of the student.

Effective Teacher

Given the low performance of prior privileged agents, the teacher could benefit from more explicit safety-driven cues in the BEV. We propose to add two types of channels of predicted agents’ future and entity attention. We first utilize a kinematics bicycle model to predict future trajectories of dynamic objects, which enables us to iteratively predict and represent short-term future position, orientation, and speed of agent. Then, we encode an explicit attention channel for highlighting potential future infractions. Our boosted teacher agent even improves over the rule-based expert.

Teachable student via Alignment

Due to the difference between inputs and modeling capacities, it can be difficult to align the image-based student features and output with the BEV-based privileged teacher. Therefore, we proposed an IPM-based transformer alignment module that can facilitate direct distillation of most of the teacher’s features and better guide the student learning process.

Student-paced Coaching

To better consider the learning ability of the student, we gradually coach the student with a student-paced coaching mechanism. Specifically, we interpolate between teacher features and student features on samples that the student performed poorly (i.e. higher loss). This effectively adjusts the learning rate in a sample-selective manner, which aims to stabilize training by reducing the difficulty when the student is unable to perform the optimal action.

Model Architecture of CaT

Our proposed CaT framework enables highly effective knowledge transfer between a privileged teacher and a sensorimotor (i.e., image-based) student. Specifically, we first sample queries using a spatial parameterization of the BEV space and process them with a self-attention module. Then, a deformable cross-attention module is applied to populate the student’s BEV features. The residual blocks following the alignment module can consequently facilitate knowledge transfer via direct distillation of most of the teacher’s features. Our optimization objective for guiding the distillation process is a weighted sum over both distillation and auxiliary tasks, including output distillation loss, feature distillation loss, segmentation loss and command prediction loss.

Result

We present our closed-loop evaluation results on the Longest-6 Benchmark in CARLA. As illustrated in the table below, CaT is able to achieve state of the art performance among all prior agents, including lidar based approaches. Moreover, we note that our privileged BEV agent learned by imitation with history and desired path, agent forecast, and entity attention even outperforms the rule-based expert.

Qualitative Examples

BibTeX

@inproceedings{zhang2023coaching,
        title={Coaching a Teachable Student},
        author={Zhang, Jimuyang and Huang, Zanming and Ohn-Bar, Eshed},
        booktitle={CVPR},
        year={2023}
}

Acknowledgments

We thank the Red Hat Collaboratory for supporting this research.