BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature
The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.On average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.
Discussion
Host: Hey everyone, and welcome back to the podcast! I'm your host, Leo, and I'm super excited about today's episode. We're diving into something really fascinating – the world of biomedical image analysis and how AI is changing the game. We're not just talking about your typical cat pictures here; we’re talking about groundbreaking research that has the potential to revolutionize how we understand and treat diseases. It’s a bit technical, I won't lie, but I promise to break it down and make it engaging for everyone. So buckle up, and let's get started!
Host: So, you might be wondering, what exactly are we discussing today? Well, it's all about a new dataset called '\dataset'. I know, the name might not roll off the tongue, but this dataset is a big deal. It's essentially a massive collection of biomedical images and their corresponding text descriptions pulled straight from scientific literature. Think of it as a treasure trove of information, all carefully organized and ready to be used for training AI models. The idea is to enable the development of advanced vision-language models specifically for the biomedical field. This is going to be key for stuff like helping doctors diagnose diseases faster and more accurately, or even for accelerating drug discovery, you know, the kind of stuff that has direct impact on people’s lives. It’s pretty revolutionary.
Host: And what's really cool about this project, and I think it is worth mentioning, is the open-source nature of this whole project. This isn't some closely guarded secret; it's a resource made available to everyone. We are talking about a framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset, which is a huge feat in itself! This accessibility, I believe, is going to be so critical for fostering collaboration and innovation in the biomedical AI community and ultimately speed up progress. It levels the playing field for research teams and allows for diverse approaches, leading to better, more robust models and solutions. I mean, that's what it's all about, right?
Host: Right, so before we delve deep into the specifics, let’s lay down the basics. The entire paper is structured in a way that clearly outlines their methodology and findings. They begin with an introduction that highlights the importance of large-scale datasets in driving progress in vision-language models and then they clearly emphasize why the biomedical field specifically needs such resources. They address the limitations of existing datasets, which are often too narrow or not publicly available. And it's not just about throwing tons of data at the problem; they’ve created a complete curation process. Then, they move on to discussing what others in the field have done, that 'Related Works' section. It gives us a bit of a background of how this project fits into the overall landscape of research. So, we're not just talking about things in a vacuum.
Host: Exactly. The 'Related Work' section is crucial because it shows how this project builds upon previous efforts while also addressing their limitations. They talk about datasets like ROCO, MEDICAT, and PMC-OA, all of which used scientific literature for biomedical image-caption pretraining. But, and here’s the key part, these datasets were often focused on specific areas, like radiology or pathology. This new project, the \dataset, is different, and takes a more domain-agnostic approach. This means they aim to be more comprehensive and cater to all sorts of biomedical fields and, in my humble opinion, that's why they stand out from the crowd. It is not just about more data; it’s about more diversity.
Host: And that domain-agnostic approach is why the dataset has far more metadata and annotations compared to previous datasets – which is always welcome. They've also taken a data-driven approach to annotations, which is really interesting. It's not just about using pre-trained models to categorize images. Instead, they've had actual clinicians and scientists develop a detailed taxonomy, which is this classification system, and then annotate the images. This is way more reliable than letting the computer decide based on some pre-existing criteria. This makes the final data so much more robust and relevant to the needs of biomedical researchers. The emphasis on data quality over sheer quantity is what sets this project apart, I think.
Host: Absolutely. The curation process itself, described in the paper, is quite meticulous. They talk about it as having three main stages: extraction, concept labeling, and serialization. In the extraction phase, they're pulling the data from the PubMed Central Open Access subset—stuff like article metadata, text (including captions and full text), and the images themselves. Then, they use DINO v2 features and clustering to help make the whole labeling process more manageable, almost like they are preprocessing the dataset before they present it to the clinicians for labeling. They use a combination of PCA and K-means clustering to organize the images. So, you aren’t just dumping 24M images onto someone to label, that’s how you burn people out!
Host: Right. So, the clinicians and scientists then use this organized structure to come up with a detailed hierarchical taxonomy of concepts. This is crucial because it ensures the dataset is not only large but also well-organized and meaningful. It's not some random pile of data; it’s categorized with all this specific, expert driven context. Once the taxonomy is built, the clinicians and scientists annotate clusters of images based on this taxonomy and then propagate those annotations to individual images. So you are not annotating one image at a time, they are labeling clusters of similar images and this speeds up the annotation process quite a lot. This systematic approach is really what makes the dataset so valuable.
Host: And then comes the serialization phase. This is where they prepare the dataset for easy access by researchers. They convert everything to WebDataset format. What this does is that it ensures that users can efficiently stream the data rather than having to download 27 TB of data locally! It is super practical for anyone who's trying to train large models. They also make the dataset accessible on Hugging Face. I’d say that it's very important for researchers to have that low barrier access. It’s not enough to build a dataset; it also needs to be accessible. If it isn't, then what's the point?
Host: Right, and the dataset description section really digs into the numbers. They downloaded over 6 million articles, with 5 million containing images, which resulted in over 24 million image-caption pairs. Each image has a bunch of metadata associated with it too; around 27 different fields according to the paper. The text data in the dataset is massive as well, with captions ranging from single-token words all the way up to thousands of tokens. This variety is super important because it means the dataset captures the diverse ways in which information is represented in scientific literature. The images themselves also have a wide range of dimensions, from thumbnails to high-resolution images and figures. So they are not just dealing with one type of image, which makes it very robust to real-world scenarios.
Host: Yeah, and the concept taxonomy, that classification system we talked about earlier, includes 12 global concepts and 170 local concepts, and they used that to annotate over 23 million images. Biomedical images account for 17% of the dataset, with the vast majority being plots and charts. This shows that there is a real variety of content and it’s not just one kind of scientific image. They break down the metadata in the supplementary section, where they included metadata for image data, annotations, and even article metadata. They made sure that the users have a good overview of what's in the dataset by showing the provenance, or origin, of each piece of information. It's not just about the what, but also the where and the how.
Host: So, the paper also includes a section that describes their evaluation benchmark. They didn’t just build a dataset and call it a day. They also provided a structured framework to measure how well AI models can learn from it, right? To do this, they repurposed 39 existing biomedical classification tasks and added a new retrieval dataset, using high-quality biomedical image-caption pairs from Flickr, for a total of 40 datasets. This comprehensive approach means they can assess how well the dataset can be used for different types of tasks and not just a narrow domain.
Host: And for each classification task, they convert the classes into captions. They provide two caption variations per class, which makes the evaluation more robust. Their benchmark covers multiple fields: pathology, radiology, ophthalmology, dermatology, surgery, biology, and general microscopy. They also divided the image classification benchmark into two splits: general bioimaging and microscopy composition. The microscopy part involves tasks to do with identifying properties of micrographs, things like the image domain, the modality and submodality used for acquisition, and the staining technique, which is kind of neat. They’re really trying to push the boundaries of what the models can learn.
Host: And when it comes to retrieval tasks, they use a dataset of 7,000 high-quality biomedical image-caption pairs from Flickr. This dataset spans concepts across multiple fields, allowing for comprehensive assessment of the image and text understanding of the AI models. For the metrics, they use average accuracy for the classification tasks and recall at 1, 10, and 100 for the retrieval tasks. They also report unweighted averages to ensure each task is given equal importance, which is very important to ensure a fair and balanced evaluation. It’s not just about one specific task, it’s about general performance.
Host: Right. And so, with all that setup in place, they then do a series of experiments to explore how to best leverage the dataset to train a model. They do experiments on continual pretraining using different kinds of training strategies, including topic balancing and filtering. They also explore robust fine-tuning methods. They train their own CLIP model from scratch, as well, so they can also compare how they perform with these more involved strategies. And it’s really nice that everything is trained through streaming, meaning they didn't have to download 27 TB of data. It makes the training more efficient and reproducible for other research groups.
Host: So, in the experiments, the continual pretraining involves using a base model, OpenCLIP, and then they continue training it using their new dataset. They do this with the full dataset as a base case, and then with a concept-balanced dataset (which is where they’ve made all the different concept categories have equal representations) and then again using a filtered dataset (where they specifically select concepts related to clinical and scientific imaging). The concept-balanced approach is used to make sure that over-represented data does not cause biases and the filtering approach focuses on what they consider more directly clinically relevant content. They then test all these different models to see which one performs better.
Host: Exactly. And in the robust fine-tuning experiment, they use a technique called model merging to further improve performance of the models. This is where they take a base model and combine the model’s weights with weights from its adapted counterpart. And all this training was done using a batch size of 1024 per GPU across four GPUs, using an effective batch size of 8192, with learning rate of 1e-6 with 1000 warmup steps. And it's really worth highlighting that they report all these very important details in the main paper text, but also in the supplementary materials. They really went the extra mile to make their research as reproducible as possible, and that's always something to admire in scientific research.
Host: And now we get to the results, which are pretty exciting. They found that concept filtering leads to the best performance across all the classification and retrieval tasks, when compared to both the full dataset pretraining and the concept-balanced dataset. Concept filtering essentially means focusing on specific kinds of image content, like clinical and scientific images, and it seems that focusing on this relevant data helped the model to learn better, or in other words, it is more effective than just throwing in everything at the model. Their models trained on the \dataset also achieve state-of-the-art zero-shot performance when compared to prior work! They beat PMC-CLIP in all tasks, with a minimum improvement of 5% and a maximum improvement of 53%. They also beat BioMedCLIP in a majority of the cases and were only using 10x less compute, which is remarkable.
Host: Yeah, and on top of that, the robust fine-tuning technique, the model merging, further improved the model performance, particularly in microscopy tasks. The WiSE-FT, the model merging technique, was able to improve the performance of their best model by 8% in those microscopy tasks! This, they report, complements the weaknesses of the original model in other areas. But all this performance comes at the cost of lower performance in other subtasks. It’s a tradeoff, where they improved certain tasks but reduced performance in others. And that’s okay, because it allows you to explore trade offs and make better design choices. I think it's super important to highlight these limitations as well, rather than just focusing on the positives.
Host: And they do just that. In the 'Limitations' section, they highlight that CLIP models have short context lengths. CLIP, that vision-language model that they are leveraging, has a context limit of 77 tokens, which means it can’t utilize full captions that exceed that limit, and these limitations can impact performance on tasks that require comprehensive textual understanding. They also highlight the fact that the images in the \dataset have varied sizes and resolutions, meaning that the model has to compress these images to have the same dimensions. And this is a problem because it can lead to loss of vital visual information. Even though they report state-of-the-art results, they aren’t trying to oversell the models, but rather highlighting what could be done next.
Host: And that brings us to the conclusion. They have successfully shown that the \dataset framework is a good way to build large datasets that can be used for training AI models. The dataset contains over 24 million image-caption pairs and uses expert-driven annotations and all the bells and whistles needed to train state of the art models. All the models achieve great performance compared to prior work, with much less computation. And, crucially, they made everything publicly available to facilitate progress in the field, from the dataset to the code and models. And that is, in my opinion, the true power of this entire project.