TY - CHAP
T1 - Addressing the gap between current language models and key-term-based clustering
AU - Cabral, Eric M.
AU - Rezaeipourfarsangi, Sima
AU - Oliveira, Maria Cristina F.
AU - Milios, Evangelos E.
AU - Minghim, Rosane
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/8/22
Y1 - 2023/8/22
N2 - This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.
AB - This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections.
KW - clustering analysis
KW - document clustering
KW - document embeddings
KW - interactive clustering
KW - user-centred clustering
UR - https://www.scopus.com/pages/publications/85173560882
U2 - 10.1145/3573128.3604900
DO - 10.1145/3573128.3604900
M3 - Chapter
AN - SCOPUS:85173560882
T3 - DocEng 2023 - Proceedings of the 2023 ACM Symposium on Document Engineering
BT - DocEng 2023 - Proceedings of the 2023 ACM Symposium on Document Engineering
PB - Association for Computing Machinery, Inc
T2 - 2023 ACM Symposium on Document Engineering, DocEng 2023
Y2 - 22 August 2023 through 25 August 2023
ER -