Hybrid Method for Short Text Topic Modeling

Loading...
Thumbnail Image
File version

Accepted Manuscript (AM)

Author(s)
Chen, J
Stantic, B
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2023
Size
File type(s)
Location

Phuket, Thailand

License
Abstract

The rise in social media’s popularity has led to a significant increase in user-generated content across various topics. Extracting information from these data can be valuable for different domains, however, due to the nature of the vast volume it is not possible to extract information manually. Different aspects of information extraction have been introduced in literature including identifying what topic is discussed in the text. The challenge becomes even bigger when the text is short, such as found in social media. Various methods for topic modeling have been proposed in the literature that could be generally categorized as unsupervised and supervised learning. However, unsupervised topic modeling methods have some shortcomings, such as semantic loss and poor explanation, and are sensitive to the choice of parameters, such as the number of topics. While supervised machine learning methods based on deep learning can achieve high accuracy they need data annotated by humans, which is time-consuming and costly. To overcome the above mentioned disadvantages this work proposes a hybrid topic modeling method that combines the advantages of both unsupervised and supervised methods. We built a hybrid model by combining Latent Dirichlet Allocation (LDA) and deep learning built on top of the Bidirectional Encoder Representations from the Transformers (BERT) model. LDA is used to identify the optimal number of topics and topic-relevant keywords where the only need for human input, with the aid of ChatGPT, is to identify associated topics based on topic-specific keywords. This annotation is used to train and fine-tune the BERT model. In the experimental evaluation of posts related to climate change, we show that the proposed concept is applicable for predicting topics from short text without the need for lengthy and costly annotation.

Journal Title
Conference Title

Recent Challenges in Intelligent Information and Database Systems 15th Asian Conference, ACIIDS 2023, Phuket, Thailand, July 24–26, 2023, Proceedings

Book Title
Edition
Volume

1863

Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023. This is the author-manuscript version of this paper. Reproduced in accordance with the copyright policy of the publisher. The original publication is available at https://doi.org/10.1007/978-3-031-42430-4_13

Item Access Status
Note
Access the data
Related item(s)
Subject

Information and computing sciences

Persistent link to this record
Citation

Chen, J; Stantic, B, Hybrid Method for Short Text Topic Modeling, Recent Challenges in Intelligent Information and Database Systems 15th Asian Conference, ACIIDS 2023, Phuket, Thailand, July 24–26, 2023, Proceedings, 2023, 1863, pp. 157-168