Monday, December 31, 2007

Pattern Matching

How to discover patterns in sensitive data that enables you to not only find what you already know about, but also discover sensitive information you didn’t know you had?

First off, you have to start with a corpus of known sensitive information. There are many algorithms to choose from. The simplest is of course key word searches. Then there are of course regular expression matching such as NFA and DFA. You can also use exact string matching, or hash parts of the content you are looking for, and see if it occurs in other areas. A new and exiting field is from Genetics. Genetic algorithms can show how information mutates (e.g. how information transforms when it goes from data base into email or documents).

When you have a corpus, you can then train your rules against the corpus. It sounds straight forward, but in some instances, I have seen over a million false positives on a group of computers. Rules needs tweaking, and it can be time consuming work. Unfortunately, I have not seen much automation in this area, so that is something we are currently working on.

A good resource for Pattern matching can be found here: http://www.cs.ucr.edu/~stelo/pattern.html

No comments: