CAREER: Aligning Image Retrieval Systems with Human Notions of Similarity
openNSF
Fine-grained visual categorization involves identifying subtle differences between highly similar visual categories, such as distinguishing between two closely related bird species or recognizing different models of a car. While these capabilities are critical in a variety of fields, including biodiversity research, forensic investigations, and e-commerce, the task is challenging because differences between categories can be small, while variations within a category can be large. For example, two different bird species may look very similar, while male and female birds of the same species may look very different. Visual categorization in fine-grained domains is often treated as an image retrieval problem, where the label for a query image is determined based on the labels of the most visually similar images. An image retrieval approach typically performs better than standard classification approaches on fine-grained visual categorization tasks, especially in domains with very large numbers of classes. However, image-retrieval approaches often fail because a retrieved image that is visually similar is not necessarily from the same class, and potential images from the same class may not be retrieved due to low visual similarity. Moreover, the features learned by standard image retrieval models are often biased towards overall visual similarity rather than task-specific or domain-specific notions of importance. This limitation can hinder analysts and domain experts who may want to prioritize specific visual features due to their own expertise or intuitions. This project aims to improve image retrieval systems for fine-grained domains by aligning them more closely with how human experts perceive and prioritize differences, and enabling users to focus on the features most important to their task. While these innovations will be evaluated across a variety of fine-grained domains, they will studied through their integration into an existing real-world image retrieval system developed by the investigator that is used by analysts at the National Center for Missing and Exploited Children to recognize the hotels where victims of child sexual abuse and human trafficking are photographed.
This project will address the limitations of traditional image retrieval in fine-grained domains by developing systems that better align with human notions of visual similarity while empowering users to dynamically guide retrieval processes. Ensembles of specialized models, each focused on a single visual notion, offer improved alignment with human judgments but are computationally impractical for real-time use, particularly in resource-constrained settings. To address this, the project will explore the use of knowledge distillation to integrate the varied knowledge of the ensemble models into a single model, while preserving flexibility for users to prioritize specific visual notions learned by individual models in the ensemble at query time. The project will also develop additional mechanisms for a user to dynamically refine search results. This line of work will have two directions: one where users specify their refinement by identifying visual features to prioritize, and one where the refinement is expressed in natural language. The project will also investigate the utility of subspace projection techniques as a pre-processing step for building task-specific indices from the embedding space of pre-trained vision language models. These innovations will enable both image-based and text-based retrieval systems where users can articulate preferences through natural language or visual cues, creating intuitive and scalable tools for specific fine-grained domains. Complementing these technical advancements, the project’s educational and outreach initiatives will focus on integrating machine learning competitions into undergraduate and graduate curricula to foster hands-on learning and critical thinking about real-world applications. Workshops will be organized to share best practices for using machine learning competitions as an educational tool, engaging educators, researchers, and students from a variety of backgrounds.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Up to $345K
machine learningEducation