dc.description.abstract | The increasing trend of sharing educational resources on the World Wide Web has attracted
several contributions from the research community. Since most Technology Enhanced
Learning users retrieve resources from the Web for teaching or learning, it is clear that the
Web is a source of educational material. Therefore, it should be possible to use the Web as a
repository for teaching resources.
Regarding the retrieval of online resources, a big issue is that the Web is a huge and
mostly unorganised space. Hence, there is no guarantee that items retrieved by current
search engines are appropriate for educational uses. Automatically identifying Web-content
suitable and usable for education is one of the most challenging objectives because it requires
extraordinary attention. Indeed, an inappropriate recommendation in such eld may result in
reduced learning outcomes by students in assignments and exams or, even worse, in teachers
building their courses on incorrect or incomplete foundations.
Studies in Information Retrieval and Technology Enhanced Learning have proposed several
solutions to support the teaching and learning needs of instructors and pupils within an
enclosed platform. Other studies o er di erent techniques for collecting Web resources that
have speci c characteristics. However, to the best of our knowledge, none of the current
proposals in the state-of-the-art has paid attention to gathering Web resources that can be
used for learning or teaching, without any restriction on topic or terminology. Personalisation
also improved Web-search by identifying what topics users prefer, and some progress has
been achieved in deducing the purpose of the search (e.g., the user is about to book a trip)
for tailored advertising; however, this is a very di erent use of recommendation.
Instead, we focus here on identifying documents with a purpose in the sense of being of
value for a learning objective. This contribution is built on the rationale that the classi cation of textual materials and natural language processing are strictly related. Thus, we propose
to involve natural language processing methods to analyse the content of Web-pages suitable
for inclusion in teaching and learning environments. In the eld of the Semantic Web, it is
common to apply Information Retrieval from classi ed online pages. The rapid expansion of
the Web creates an ever-increasing demand for faster and yet reliable ltering of Web-pages,
according to the information needs of users and aiming to eliminate displaying irrelevant and
harmful content. The accuracy of the classi cation is not the only di culty when applying
Information Retrieval techniques on the sheer volume of documents hosted on the World
Wide Web. Accessing the most valuable data as quick as possible raises further research
questions about the trade-o in accuracy versus the computational time required by a Webpage
classi er. Another characteristic of Web-pages is the multitude of traits (features to
be used as independent variables) that may be used for their description. The number of
attributes has a signi cant impact on the velocity of the classi er. Therefore, managing a
broad set of features is not desirable, because it brings up the issues associated with the curse
of dimensionality.
Well-cited studies from researchers in Information Retrieval and Knowledge Management
focus on handling the typically large number of features of items and examine the balance
between reliability and speed. There are a variety of methods that can be applied to most
of the existing classi cation problems for reducing the feature space, namely feature-selection
and feature-reduction algorithms. However, an improper feature selection may complicate
even more the performance in real-time classi cation, now an essential aspect in many Webbased
applications. For crawlingWeb-pages tailored to pedagogical purposes, we rmly believe
it is fundamental to identify which online resources could be potentially useful for teaching
and learning. Our primary motivation is to improve the support o ered by Technology Enhanced
Learning systems to learners and educators during their educational tasks, providing
straightforward access to a huge dataset of potential educational resources extracted from the
Web.
We propose a technique for deducing educational semantic information about potential
educational resources on the Web by analysing their content and structure, e.g., page title,
body, links, and highlights. Then, the Dandelion API, a tool for extracting semantic entities
from a text, is used for analysing the textual content of each section. We propose to use a framework introduced in a previous contribution for performing Feature Selection, where several
state-of-the-art algorithms are grouped in an ensemble. Such an ensemble of algorithms
has the purpose of combining the many di erent aspects analysed by each of the methods.
The outcomes of the algorithms are combined into a score that represents the importance of
every single feature. Such scoring process allows producing a feature ranking. As a result, the
framework enables the reduction of the features set to only a few comprehensive attributes.
We incorporate semantic technologies when processing natural language to elicit more than
100 features computed directly from the text of Web-resources. After that, we analyse our
features to discover which of these become attributes that permit a clear distinction between
resources suitable for education and those not suitable. The resulting features set is evaluated
performing a binary classi cation of items in our dataset of more than 2,300 Web-pages obtained
from the SeminarsOnly website (http://www.seminarsonly.com), and other sources
identi ed as relevant for teaching by surveying human instructors. We built such a dataset
labelling the aforementioned educational Web-pages as \relevant for education". Then, we
labelled as \non-relevant for education" pages crawled from the former DMOZ Web directory,
currently known as Curlie (https://curlie.org), for a total of more than 5,600 labelled
Web-pages.
Our evaluation covers learning with several representatives of the state-of-the-art of classi
cation algorithms. We then apply Student's t-test to strengthen the validity of the features
set deduced in this study. The t-test con rms that all the features are essential for achieving
the best accuracy in our ltering task when using any of the classi ers. Then, the framework
is evaluated in a ltering task performed on the same dataset, comparing our proposal
on both accuracy and speed against popular algorithms for feature selection and feature reduction.
In both aspects, our framework outperforms current feature reduction algorithms,
achieving more accurate and faster classi cation of Web-pages in several scenarios. So, we can
declare our framework suitable to be used in a purpose-driven crawling task. Smart systems
in Technology Enhanced Learning can use our proposal for retrieving an enormous amount
of resources and information ready to be used for educational purposes. For example, recommender
systems in Technology Enhanced Learning would bene t from the result of this study
for suggesting educational resources for both building and improving courses, signi cantly
enhancing the support provided to teachers and students. | |