Thursday, January 01, 2009

Considerations when building queries for DLP products

Term weight is normally reduced the longer the document is. This may be counter intuitive to the need for scanning a document for compliance issues such as PCI, as a document with reoccurring terms may lead to a higher risk, than a document with fewer items. So when searching an inverse index, it is important not to reduce the scale either by adding one plus the log, or using the cosine on a vector based search.

However by doing this, the terms in the query becomes more important. A term that has a high occurrence in both a set containing sensitive documents, and its corresponding set of non sensitive documents will lead to a high occurrence of false positives. Because of this, an effectiveness of terms must be calculated and stored over time. A term with low effectiveness should either be eliminated from the query, or should have a lower weight.

Several solutions may be available here, one is to combine highly effective terms with less effective terms in a larger pattern. The question though, is if the distribution of terms in sensitive documents take on a Gaussian property with a bell curve, or if there are power law distributions in terms. To this question, I don’t know the answer yet, but I have noticed in practice that the distribution of documents follows power law distributions. This can be used in a query strategy, where an initial query with a high false negative rate is used initially to ferret out areas with a high probability of containing sensitive documents. When this approach is used, a broader query can be used in this space.

When considering a space, it can be a geographical space such as a site, it can be a logical site such as a file server supporting the HR department, or it can be a space in time. Most likely it is a combination of the above, and may even have more vectors such as user identity, frequency etc. So far, this is a trial and error based approach. To improve on this approach, large data sets would need to be collected and analyzed.

No comments: