1# License Classifier v2 2 3This is a substantial revision of the license classifier with a focus on improved accuracy and performance. 4 5## Glossary 6 7- corpus dictionary - contains all the unique tokens stored in the corpus of 8documents to match. Any tokens in the target document that aren't in the corpus 9dictionary are mapped to an invalid value. 10 11- document - an internal-only data type that contains sequenced token information 12for a source or target content for matching. 13 14- source content - a body of text that can be matched by the scanner. 15 16- target content - the argument to Match that is scanned for matches with source 17content. 18 19- indexed document - an internal-only data type that maps a document to the 20corpus dictionary, resulting in a compressed representation suitable for fast 21text searching and mapping operations. an indexed document is necessarily 22tightly coupled to its corpus. 23 24- frequency table - a lookup table holding per-token counts of the number of 25times a token appears in content. used for fast filtering of target content 26against different source contents. 27 28- q-gram - a substring of content of length q tokens used to efficiently match 29ranges of text. For background on the q-gram algorithms used, please see 30[Indexing Methods for Approximate String Matching](https://users.dcc.uchile.cl/~gnavarro/ps/deb01.pdf) 31 32- searchset - a data structure that uses q-grams to identify ranges of text in 33the target that correspond to a range of text in the source. The searchset 34algorithms compensate for the allowable error in matching text exactly, dealing 35with additional or missing tokens. 36 37 38## Migrating from v1 39 40The API for the classifier versions is quite similar, but there are two key 41distinctions to be aware of while migrating usages. 42 43The confidence value for the v2 classifier is applied uniformly to results; it 44will never return a match that is lower confidence than the threshold. In v1, 45MultipleMatch behaved this way, but NearestMatch would return a value 46regardless of the confidence match. Users often verified that the confidence 47was above the threshold, but this is no longer necessary. 48 49The second change is that the classifier now returns all matches against the 50supplied corpus. The v1 classifier allowed filtering on header matches via a 51boolean field. This can be emulated by creating a license classifier with a 52reduced corpus if matching against headers is not desired. Alternatively, the 53user can use the MatchType field in the Match struct to filter out unwanted 54matches. 55 56 57