Friday, January 25, 2008

Context and uniqueness, a method for finding the proverbial needle in the haystack. How to find IP amongst large quantities of content that is close in likeness of IP, but is not IP. The false positive problem when searching content for IP, or in other words, avoiding having a million plus false positives over a large area of content containing both.

The problem around context can easily be seen when you think about this example: "He wanted more chips". Without knowing his location or situation, it is impossible to determine if he wanted potato chips, wood chips, or poker chips. Only when you know in what situation this want exists would you know which type of chips was wanted. If you look at this example: "He was outside cooking meat in his smoker. He wanted more chips." Now it becomes apparent what type of chips he wanted.

Context copied from Merriam-Webster OnLine: http://www.m-w.com/dictionary/context
Etymology:
Middle English, weaving together of words, from Latin contextus connection of words, coherence, from contexere to weave together, from com- + texere to weave — more at technical
Date:
circa 1568
1 : the parts of a discourse that surround a word or passage and can throw light on its meaning
2 : the interrelated conditions in which something exists or occurs : environment setting

When you are looking at a piece of source code such as this (randomly chosen) source code on msdn: http://msdn2.microsoft.com/en-us/library/ak5wyby1(VS.80).aspx

// attr_implements.idl
import "docobj.idl";
[ version(1.0), uuid(0ed71801-a1b6-3178-af3b-9431fc00185e) ]
library odod
{
importlib("stdole2.tlb");
importlib("olepro32.dll");

[
object,
uuid(1AECC9BB-2104-3723-98B8-7CC54722C7DD)
]
interface IBar1 {
[id(1)] HRESULT bar1();
};

[
dual,
uuid(1AECCABB-2104-3723-98B8-7CC54722C7DD)
]
interface IBar2 {
[id(1)] HRESULT bar2();
};

[
uuid(1AECC9CC-2104-3723-98B8-7CC54722C7DD)
]
dispinterface ISna {
properties:

methods:
[id(1)] HRESULT sna();
};

[
uuid(159A9BBB-E5F1-33F6-BEF5-6CFAD7A5933F),
version(1.0)
]
coclass CBar {
interface IBar1;
interface IBar2;
dispinterface ISna;
};
}

What could you use to distinguish this piece of code which is publicly available from internal code? Can you read context out of this text? What is unique? Well, if you have a bunch of this kinds of files, they all start to look very much the same, as c++ is a very structured language, and most developers will reuse code, which means that a segment can be found in many files. So the real question becomes, how can you find something unique in your internally developed code that you want to protect versus something that is publicly available. The same is of course true if you want to scan for open source code, or code covered with copy right in your internal code.

One thing this piece of code is almost devoid of are comments. The problem is that when you look at some source code, it will also contain boiler plate comments which is of course useless for identification of source code IP. So you have to do some intelligent searching to find something that can identify what is truly unique.

The method I propose is as follows:

1. Search through each file

2. Extract the comments

3. Search a commercially available search engine for number of hits

If number of hits are none, probability is high that the text combination is unique: Mark as value 9

If number of hits are low, probability are medium to high dependent of number of hits and closeness to the initial search term: Mark value 5

If number of hits are high, probability is low: Mark value 1

If there are more than one comment, the add the values

4. Create search term including the high value terms

5. Test against corpus of known publicly available source code

6. Count false positives

If number of false positives are greater than X then discard search term

7. Test against corpus of known internal code

8. Count false negatives

If number of false negatives are greater than X then go to 4 and add to search term

This method could also be used for identifying other types of IP, or business intelligence. I believe this method would also be very helpful in identifying pre-release marketing material as well as financial or legal documents. Of course, for marketing material, you would have to look for unique phrases, or words, or combinations thereof. For legal documents, you would have to look for what makes the legal document unique. With legal documents, you will also find that those can contain large amounts of boilerplate text, so this method would work well here to.

No comments: