In this paper, the authors propose a new method for predicting human motion based on social affordances, which are the possibilities for action in a given situation. They introduce the concept of a "social affordance carrier," an object or humanoid that represents the potential contacts between a human actor and their environment. The authors use a graph neural network and a 4D Transformer autoencoder to model the relative motion between joints and generate new motion sequences autoregressively.
To begin with, the paper defines social affordance representation (SAR) as a crucial component of human motion forecasting. SAR encodes information about the possible contacts between a human actor and their environment, which is essential for generating realistic and diverse motions. The authors propose using a carrier object or humanoid to represent these potential contacts, which they call the social affordance carrier (SAC).
The SAC is a crucial component of the proposed method, as it enables the model to understand the relationships between the human actor, the environment, and the potential contacts. The authors explain that when a human actor interacts with an object or another person, they come into contact with these objects directly or indirectly. By choosing an appropriate carrier, the model can represent this direct or potential contact information in the interaction.
Next, the paper delves into the technical details of the proposed method. The authors describe how they use a graph neural network (GNN) to transform the human actor skeleton into a graph, which efficiently models the relative motion between different joints. They then propose a 4D Transformer autoencoder, composed of a motion encoder and a motion decoder, to generate new motion sequences autoregressively.
The motion encoder takes the canonical complete social affordance as input and generates a latent embedding of it. The motion decoder uses this latent embedding as a condition and takes the previously taken motions of the reactor as input to generate new reaction motions autoregressively. In other words, the model can generate new motion sequences based on the current state of the environment and the possible contacts between the human actor and their surroundings.
The authors demonstrate the effectiveness of their proposed method through experiments on several datasets. They show that their approach outperforms existing methods in terms of both quality and diversity of generated motions. The paper also provides a thorough analysis of the contributions of different components of their method, providing insight into how they improve motion forecasting accuracy.
In conclusion, this paper presents a significant advance in human motion forecasting by incorporating social affordance representation into the model. By using a carrier object or humanoid to represent potential contacts between a human actor and their environment, the proposed method can generate more realistic and diverse motions than previous approaches. The authors demonstrate the effectiveness of their method through experimental results and provide valuable insights into the contributions of different components of their approach. This work has important implications for applications such as robotics, computer animation, and virtual reality, where realistic human motion is critical.
Computer Science, Computer Vision and Pattern Recognition