Information Protection

Thursday, January 01, 2009

Considerations when building queries for DLP products

Term weight is normally reduced the longer the document is. This may be counter intuitive to the need for scanning a document for compliance issues such as PCI, as a document with reoccurring terms may lead to a higher risk, than a document with fewer items. So when searching an inverse index, it is important not to reduce the scale either by adding one plus the log, or using the cosine on a vector based search.

However by doing this, the terms in the query becomes more important. A term that has a high occurrence in both a set containing sensitive documents, and its corresponding set of non sensitive documents will lead to a high occurrence of false positives. Because of this, an effectiveness of terms must be calculated and stored over time. A term with low effectiveness should either be eliminated from the query, or should have a lower weight.

Several solutions may be available here, one is to combine highly effective terms with less effective terms in a larger pattern. The question though, is if the distribution of terms in sensitive documents take on a Gaussian property with a bell curve, or if there are power law distributions in terms. To this question, I don’t know the answer yet, but I have noticed in practice that the distribution of documents follows power law distributions. This can be used in a query strategy, where an initial query with a high false negative rate is used initially to ferret out areas with a high probability of containing sensitive documents. When this approach is used, a broader query can be used in this space.

When considering a space, it can be a geographical space such as a site, it can be a logical site such as a file server supporting the HR department, or it can be a space in time. Most likely it is a combination of the above, and may even have more vectors such as user identity, frequency etc. So far, this is a trial and error based approach. To improve on this approach, large data sets would need to be collected and analyzed.

Information Protection

Thursday, January 01, 2009

No comments:

About Me

Blog Archive

Links

Search