VQ-BASED WRITTEN LANGUAGE IDENTIFICATION
File version
Author(s)
Tran, D
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
B. Boashash
Date
Size
File type(s)
Location
PARIS, FRANCE
License
Abstract
Humans can recognize different types of written languages by their grammars and vocabularies. However, computers see everything as numbers. We present a computational algorithm for machine classification of written languages using the method of vector quantization. For a language document, each word is converted to a sequence of numbers and forms as a vector of numerical values according to its characters. This collection of vectors is then represented by a codebook that contains a number of template vectors for classification. The proposed method is more effective for machine learning than the n-gram based method, which has been widely used for written language identification. Experimental results of classifying a set of five closely roman-typed scripts show the promising application of the proposed method.
Journal Title
Conference Title
SEVENTH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOL 1, PROCEEDINGS