TY - GEN
T1 - UCCIX
T2 - 27th European Conference on Artificial Intelligence, ECAI 2024
AU - Tran, Khanh Tung
AU - O’Sullivan, Barry
AU - Nguyen, Hoang D.
N1 - Publisher Copyright:
© 2024 The Authors.
PY - 2024/10/16
Y1 - 2024/10/16
N2 - The development of Large Language Models (LLMs) has predominantly focused on high-resource languages, leaving extremely low-resource languages like Irish with limited representation. This work presents UCCIX, a pioneering effort on the development of an open-source Irish-based LLM. We propose a novel framework for continued pre-training of LLMs specifically adapted for extremely low-resource languages, requiring only a fraction of the textual data typically needed for training LLMs according to scaling laws. Our model, based on Llama 2-13B [23], outperforms much larger models on Irish language tasks with up to 12% performance improvement, showcasing the effectiveness and efficiency of our approach. We also contribute comprehensive Irish benchmarking datasets, including IrishQA, a question-answering dataset, and Irish version of MT-bench [28]. These datasets enable rigorous evaluation and facilitate future research in Irish LLM systems. Our work aims to preserve and promote the Irish language, knowledge, and culture of Ireland in the digital era while providing a framework for adapting LLMs to other indigenous languages.
AB - The development of Large Language Models (LLMs) has predominantly focused on high-resource languages, leaving extremely low-resource languages like Irish with limited representation. This work presents UCCIX, a pioneering effort on the development of an open-source Irish-based LLM. We propose a novel framework for continued pre-training of LLMs specifically adapted for extremely low-resource languages, requiring only a fraction of the textual data typically needed for training LLMs according to scaling laws. Our model, based on Llama 2-13B [23], outperforms much larger models on Irish language tasks with up to 12% performance improvement, showcasing the effectiveness and efficiency of our approach. We also contribute comprehensive Irish benchmarking datasets, including IrishQA, a question-answering dataset, and Irish version of MT-bench [28]. These datasets enable rigorous evaluation and facilitate future research in Irish LLM systems. Our work aims to preserve and promote the Irish language, knowledge, and culture of Ireland in the digital era while providing a framework for adapting LLMs to other indigenous languages.
UR - https://www.scopus.com/pages/publications/85216622316
U2 - 10.3233/FAIA241040
DO - 10.3233/FAIA241040
M3 - Conference proceeding
AN - SCOPUS:85216622316
T3 - Frontiers in Artificial Intelligence and Applications
SP - 4503
EP - 4506
BT - ECAI 2024 - 27th European Conference on Artificial Intelligence, Including 13th Conference on Prestigious Applications of Intelligent Systems, PAIS 2024, Proceedings
A2 - Endriss, Ulle
A2 - Melo, Francisco S.
A2 - Bach, Kerstin
A2 - Bugarin-Diz, Alberto
A2 - Alonso-Moral, Jose M.
A2 - Barro, Senen
A2 - Heintz, Fredrik
PB - IOS Press BV
Y2 - 19 October 2024 through 24 October 2024
ER -