Reddit temporal N-gram corpus and its applications on paraphrase and semantic similarity in social media using a topic-based latent semantic analysis

  • Anh Dang
  • , Abidalrahman Moh'd
  • , Aminul Islam
  • , Rosane Minghim
  • , Michael Smit
  • , Evangelos Milios

Research output: Chapter in Book/Report/Conference proceedingsChapterpeer-review

Abstract

This paper introduces a new large-scale n-gram corpus that is created specifically from social media text. Two distinguishing characteristics of this corpus are its monthly temporal attribute and that it is created from 1.65 billion comments of user-generated text in Reddit. The usefulness of this corpus is exemplified and evaluated by a novel Topic-based Latent Semantic Analysis (TLSA) algorithm. The experimental results show that unsupervised TLSA outperforms all the state-of-the-art unsupervised and semi-supervised methods in SEMEVAL 2015: paraphrase and semantic similarity in Twitter tasks.

Original languageEnglish
Title of host publicationCOLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016
Subtitle of host publicationTechnical Papers
PublisherAssociation for Computational Linguistics, ACL Anthology
Pages3553-3564
Number of pages12
ISBN (Print)9784879747020
Publication statusPublished - 2016
Externally publishedYes
Event26th International Conference on Computational Linguistics, COLING 2016 - Osaka, Japan
Duration: 11 Dec 201616 Dec 2016

Publication series

NameCOLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers

Conference

Conference26th International Conference on Computational Linguistics, COLING 2016
Country/TerritoryJapan
CityOsaka
Period11/12/1616/12/16

Fingerprint

Dive into the research topics of 'Reddit temporal N-gram corpus and its applications on paraphrase and semantic similarity in social media using a topic-based latent semantic analysis'. Together they form a unique fingerprint.

Cite this