Adaptable N-gram Classification Model for Data Leakage Prevention
File version
Author(s)
Sithirasenan, Elankayer
Muthukkumarasamy, Vallipuram
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Wysocki, TA
Wysocki, BJ
Date
Size
File type(s)
Location
Gold Coast, AUSTRALIA
License
Abstract
Data confidentiality, integrity and availability are the ultimate goals for all information security mechanisms. However, most of these mechanisms do not proactively protect sensitive data; rather, they work under predefined policies and conditions to protect data in general. Few systems such as anomaly-based intrusion detection systems (IDS) might work independently without much administrative interference, but with no dedication to sensitivity of data. New mechanisms called data leakage prevention systems (DLP) have been developed to mitigate the risk of sensitive data leakage. Current DLPs mostly use data fingerprinting and exact and partial document matching to classify sensitive data. These approaches can have a serious limitation because they are susceptible to data misidentification. In this paper, we investigate the use of N-grams statistical analysis for data classification purposes. Our method is based on using N-grams frequency to classify documents under distinct categories. We are using simple taxicap geometry to compute the similarity between documents and existing categories. Moreover, we examine the effect of removing the most common words and connecting phrases on the overall classification. We are aiming to compensate the limitations in current data classification approaches used in the field of data leakage prevention. We show that our method is capable of correctly classifying up to 90.5% of the tested documents.
Journal Title
Conference Title
2013 7TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS)
Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject
Other information and computing sciences not elsewhere classified