In this article, we propose a new method called Link Block to improve the association between objects in ultra-long videos. Our approach is based on a graph neural network (GNN) that learns to encode the semantic context of nodes and edges in the video frames. The GNN feeds the edge embeddings to a multi-layer perceptron (MLP) and outputs a score that determines the similarity between two nodes. We use a threshold to limit the number of associations, and if the similarity score is below the threshold, we start a new trajectory.
To build the graph, we treat object trajectories as nodes for information interaction, which improves the GNN’s feature representation capability. We formulate the graph-building problem as a top-k selection task for reliable objects or trajectories and learn better predictions on longer-time scales by adding composite nodes. Our method outperforms state-of-the-art methods in several commonly used datasets.
To understand how Link Block works, imagine you are at a party where people are moving around and interacting with each other. The partygoers are like the objects in the video frames, and their movements are like the edges between them. Our method is like a magic ring that can identify which partygoers are most likely to be friends or associated with each other based on their interactions over time.
The magic ring works by looking at the embeddings of the edges (i.e., the connections between partygoers) and the node embeddings (i.e., the characteristics of each partygoer). It then feeds these embeddings to an MLP to generate a score that tells us how similar two partygoers are. If the score is above a certain threshold, we consider them associated; if not, we start a new trajectory.
Building the graph involves treating object trajectories as nodes for information interaction. This allows our method to better capture the relationships between objects over time. Think of it like a web of connections between the partygoers, where each connection represents an association between two people. By looking at these connections, we can identify which partygoers are most likely to be friends or associated with each other.
Our method also allows us to learn better predictions on longer-time scales by adding composite nodes. Imagine you are at a party that lasts for several days; our method can help identify the people who are most likely to be friends over the entire duration of the party, even if they don’t interact with each other directly.
In summary, Link Block is a powerful tool for association in ultra-long videos. By using GNNs and MLPs, it can identify the objects or trajectories that are most likely to be associated with each other based on their interactions over time. Our method outperforms state-of-the-art methods in several commonly used datasets, making it an excellent choice for applications where video analysis is important.
Computer Science, Computer Vision and Pattern Recognition