Fig. 4: The accuracy (%) and F1 scores (%) of our model and LLMs on the ScholarChemQA dataset.
From: Unveiling the power of language models in chemical research question answering

It can be seen that our ChemMatch outperforms other baselines, significantly surpassing Llama2 and GPT-3.5.