Improving binary classification of web pages using an ensemble of feature selection algorithms
File version
Author(s)
Lombardi, M
Marani, A
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Minh Ngoc Dinh
Date
Size
File type(s)
Location
Brisbane, Australia
License
Abstract
A well-known method to produce accurate predictive models is to apply algorithms for feature selection and feature reduction. These algorithms describe an item with a subset of its attributes that is expected to be the smallest possible without compromising the actual representation of the object, and consequently the entire classification. However, different feature-selection algorithms have different potentially complementary properties each only collecting some aspects of the feature set. Hence the resulting subset of attributes may significantly vary from one feature-selection approach to another. Each method has different effects on the accuracy of the classification. In this contribution, we combine feature-selection algorithms with the intention of recognising the purpose of a webpage. That is, we propose a framework for building an ensemble of feature selection algorithms, to merge their outcomes into a single score and thus achieving a comprehensive analysis of the feature set. We evaluated our proposal against traditional feature-selection and feature-reduction algorithms in a binary classification task of web pages. Our dataset consists of more than 400 pages labelled by educators as either relevant or not relevant for teaching purposes. We also evaluate the impact of the combination across several classifiers. Our results show that our framework outperforms current algorithms, allowing for a much faster and yet reliable classification of web pages in all the different scenarios tested. We expect that our findings will contribute to improving the performance of web classifiers, particularly when running on-the-fly and in real-time.
Journal Title
Conference Title
ACM International Conference Proceeding Series
Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject
Information retrieval and web search