In this article, we propose a new approach to autonomous driving called Dolphins, which combines the strengths of large vision-language models (LLMs) and human drivers. Dolphins is designed to excel in both holistic understanding and reasoning, as well as in the ability to interpret diverse visual features. Unlike existing ADS, which rely solely on linguistic input and lack rich visual capabilities, Dolphins is capable of comprehending dynamic and complex scenarios, even within long-tail open-world driving environments.
To achieve this, Dolphins combines the reasoning and planning abilities of LLMs with the ability to understand diverse visual features. This allows it to holistically interpret and understand scenarios in a way that is similar to human drivers, who can quickly deduce potential dangers and take appropriate action.
For instance, imagine a scenario where a ball bounces onto the road, followed by a child running after it. A human driver would immediately recognize the danger and act accordingly, while existing ADS might struggle to interpret this scenario accurately without prior exposure to similar data. Dolphins, on the other hand, can holistically understand the situation and take appropriate action, just like a human driver.
In addition, Dolphins is able to learn and adapt quickly, allowing it to improve its performance over time. This makes it an ideal solution for autonomous driving, where the ability to rapidly respond to changing situations is crucial.
Overall, Dolphins offers a promising approach to autonomous driving that could potentially revolutionize the way we travel. By combining the strengths of LLMs and human drivers, it provides a holistic understanding and reasoning capabilities that are unmatched by existing ADS. With its ability to adapt quickly and handle diverse visual features, Dolphins is poised to become a leader in the field of autonomous driving.
Computer Science, Computer Vision and Pattern Recognition