Academics | The Hong Kong University of Science and Technology

Breast Cancer LLM Evaluation Benchmark
W e have developed a breast cancer benchmark to evaluate the performance of artificial intelligence models in pathological image analysis. We obtained 10 whole-slide images (WSIs) from collaborating hospitals, which were annotated by professional pathologists for six key breast cancer features: necrosis, perineural invasion, calcification, lymphovascular invasion, ductal carcinoma in situ (DCIS), and invasive ductal carcinoma (IDC). This resulted in approximately 1,000 high-quality images. Using these data, we evaluated multimodal large language models (MLLMs) and CLIP-like models on the six breast cancer diagnostic tasks. For CLIP models, we used predefined templates and synonym lists, assessing performance based on similarity scores between input images and preset text descriptions. For MLLM models, we combined templates and prompts to guide the models to respond with \"yes\" or \"no\" when answering pathology-related questions. Through this breast cancer benchmark, we aim to assess the performance of artificial intelligence models in pathological image analysis, provide a unified platform for researchers, and promote the advancement of artificial intelligence in the field of breast cancer diagnosis.