Adaptable N-gram Classification Model for Data Leakage Prevention
MetadataShow full item record
Data confidentiality, integrity and availability are the ultimate goals for all information security mechanisms. However, most of these mechanisms do not proactively protect sensitive data; rather, they work under predefined policies and conditions to protect data in general. Few systems such as anomaly-based intrusion detection systems (IDS) might work independently without much administrative interference, but with no dedication to sensitivity of data. New mechanisms called data leakage prevention systems (DLP) have been developed to mitigate the risk of sensitive data leakage. Current DLPs mostly use data fingerprinting and exact and partial document matching to classify sensitive data. These approaches can have a serious limitation because they are susceptible to data misidentification. In this paper, we investigate the use of N-grams statistical analysis for data classification purposes. Our method is based on using N-grams frequency to classify documents under distinct categories. We are using simple taxicap geometry to compute the similarity between documents and existing categories. Moreover, we examine the effect of removing the most common words and connecting phrases on the overall classification. We are aiming to compensate the limitations in current data classification approaches used in the field of data leakage prevention. We show that our method is capable of correctly classifying up to 90.5% of the tested documents.
Signal Processing and Communication Systems (ICSPCS), 2013 7th International Conference on
Information and Computing Sciences not elsewhere classified