In this article, we explore how to evaluate the performance of embodied AI systems, which are AI models that interact with the physical world through sensors and actuators. We introduce three key metrics: task success rate, average episode length, and policy improvement.
Task success rate measures the percentage of tasks completed successfully within a set number of steps. This metric helps us understand how well our model can complete a specific task, such as picking up an object or following a command.
Average episode length measures the average number of steps taken across all episodes. Lower values indicate higher efficiency, as the model is able to complete tasks more quickly.
Policy improvement compares the current policy to a reference policy, measuring how much better the current policy performs. This metric helps us evaluate the progress of our model over time and identify areas for improvement.
To train our models, we use a two-stream architecture that processes different modalities, such as RGB images and depth channels, and incorporates late fusion to combine features from both streams. We also introduce skip connections to allow multi-scale information flow and improve the overall performance of our model.
By using these metrics, we can evaluate the performance of our embodied AI models in a more comprehensive way than ever before, allowing us to identify areas for improvement and train our models to perform better.