Healthcare data is crucial for making accurate diagnoses and developing effective treatments, but it can be challenging to obtain and process due to privacy and ethical concerns. In this survey, we explore the use of "data-centric foundation models" (FMs) in computational healthcare, which can help address these challenges by leveraging large language models (LLMs) pre-trained on general domain data to improve downstream tasks.
Data Efficiency
One of the main goals of FMs is to improve data efficiency, reducing the amount of data required for downstream tasks. By adapting LLMs to healthcare datasets with limited sizes, FMs can yield satisfactory results, even in cases where there is a lack of labeled data. For example, CITE explores the adaptation of general vision FMs to comprehend pathological images.
Data Augmentation
Another strategy for addressing data sample limitation is data augmentation. Conventional techniques include resizing, clipping, and flipping for images or synonym replacement, random insertion, and back translation for text. However, these methods only manipulate the existing data samples and maintain limited information entropy as they do not introduce external information beyond the existing distribution. In contrast, FMs can bring a remarkable shift to healthcare data augmentation by applying pre-trained knowledge from general domain data.
Information-rich Generative FMs
The key to FMs’ success lies in their ability to transfer general insights from pre-trained LLMs to the healthcare domain. These models are trained on vast amounts of text or image data, allowing them to generate new information that is rich in knowledge and context. By leveraging this generative capacity, FMs can introduce external information beyond the existing distribution, leading to more accurate and efficient downstream tasks.
Applications
FMs have numerous applications in computational healthcare, including image classification, text classification, and drug discovery. For instance, Segment Anything, a tool developed by researchers at MIT, uses FMs to segment medical images with unprecedented accuracy. Similarly, Large Language Models are Zero-Shot Reasoners, a study published in the Advances in Neural Information Processing Systems journal, demonstrates that LLMs can be used for reasoning tasks without requiring any additional training or fine-tuning.
Challenges and Future Directions
Despite the promising applications of FMs, there are several challenges that need to be addressed in future research. One of the main concerns is the potential risks associated with using pre-trained models for sensitive tasks like healthcare, where the stakes are high. Another challenge is the need for better interpretability and explainability mechanisms to understand how FMs arrive at their conclusions.
Conclusion
In conclusion, data-centric foundation models have the potential to revolutionize computational healthcare by improving data efficiency and augmentation strategies. By leveraging pre-trained LLMs, these models can introduce external information beyond the existing distribution, leading to more accurate and efficient downstream tasks. While there are challenges that need to be addressed, the promising applications of FMs make them an exciting area of research with significant implications for improving healthcare outcomes.