Addressing the gap between current language models and key-term-based clustering

  • Eric M. Cabral
  • , Sima Rezaeipourfarsangi
  • , Maria Cristina F. Oliveira
  • , Evangelos E. Milios
  • , Rosane Minghim

Research output: Chapter in Book/Report/Conference proceedingsChapterpeer-review

Abstract

This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.

Original languageEnglish
Title of host publicationDocEng 2023 - Proceedings of the 2023 ACM Symposium on Document Engineering
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9798400700279
DOIs
Publication statusPublished - 22 Aug 2023
Event2023 ACM Symposium on Document Engineering, DocEng 2023 - Limerick, Ireland
Duration: 22 Aug 202325 Aug 2023

Publication series

NameDocEng 2023 - Proceedings of the 2023 ACM Symposium on Document Engineering

Conference

Conference2023 ACM Symposium on Document Engineering, DocEng 2023
Country/TerritoryIreland
CityLimerick
Period22/08/2325/08/23

Keywords

  • clustering analysis
  • document clustering
  • document embeddings
  • interactive clustering
  • user-centred clustering

Fingerprint

Dive into the research topics of 'Addressing the gap between current language models and key-term-based clustering'. Together they form a unique fingerprint.

Cite this