xref: /aosp_15_r20/external/licenseclassifier/v2/README.md (revision 46c4c49da23cae783fa41bf46525a6505638499a)
1*46c4c49dSIbrahim Kanouche# License Classifier v2
2*46c4c49dSIbrahim Kanouche
3*46c4c49dSIbrahim KanoucheThis is a substantial revision of the license classifier with a focus on improved accuracy and performance.
4*46c4c49dSIbrahim Kanouche
5*46c4c49dSIbrahim Kanouche## Glossary
6*46c4c49dSIbrahim Kanouche
7*46c4c49dSIbrahim Kanouche- corpus dictionary - contains all the unique tokens stored in the corpus of
8*46c4c49dSIbrahim Kanouchedocuments to match. Any tokens in the target document that aren't in the corpus
9*46c4c49dSIbrahim Kanouchedictionary are mapped to an invalid value.
10*46c4c49dSIbrahim Kanouche
11*46c4c49dSIbrahim Kanouche- document - an internal-only data type that contains sequenced token information
12*46c4c49dSIbrahim Kanouchefor a source or target content for matching.
13*46c4c49dSIbrahim Kanouche
14*46c4c49dSIbrahim Kanouche- source content - a body of text that can be matched by the scanner.
15*46c4c49dSIbrahim Kanouche
16*46c4c49dSIbrahim Kanouche- target content - the argument to Match that is scanned for matches with source
17*46c4c49dSIbrahim Kanouchecontent.
18*46c4c49dSIbrahim Kanouche
19*46c4c49dSIbrahim Kanouche- indexed document - an internal-only data type that maps a document to the
20*46c4c49dSIbrahim Kanouchecorpus dictionary, resulting in a compressed representation suitable for fast
21*46c4c49dSIbrahim Kanouchetext searching and mapping operations. an indexed document is necessarily
22*46c4c49dSIbrahim Kanouchetightly coupled to its corpus.
23*46c4c49dSIbrahim Kanouche
24*46c4c49dSIbrahim Kanouche- frequency table - a lookup table holding per-token counts of the number of
25*46c4c49dSIbrahim Kanouchetimes a token appears in content. used for fast filtering of target content
26*46c4c49dSIbrahim Kanoucheagainst different source contents.
27*46c4c49dSIbrahim Kanouche
28*46c4c49dSIbrahim Kanouche- q-gram - a substring of content of length q tokens used to efficiently match
29*46c4c49dSIbrahim Kanoucheranges of text. For background on the q-gram algorithms used, please see
30*46c4c49dSIbrahim Kanouche[Indexing Methods for Approximate String Matching](https://users.dcc.uchile.cl/~gnavarro/ps/deb01.pdf)
31*46c4c49dSIbrahim Kanouche
32*46c4c49dSIbrahim Kanouche- searchset - a data structure that uses q-grams to identify ranges of text in
33*46c4c49dSIbrahim Kanouchethe target that correspond to a range of text in the source. The searchset
34*46c4c49dSIbrahim Kanouchealgorithms compensate for the allowable error in matching text exactly, dealing
35*46c4c49dSIbrahim Kanouchewith additional or missing tokens.
36*46c4c49dSIbrahim Kanouche
37*46c4c49dSIbrahim Kanouche
38*46c4c49dSIbrahim Kanouche## Migrating from v1
39*46c4c49dSIbrahim Kanouche
40*46c4c49dSIbrahim KanoucheThe API for the classifier versions is quite similar, but there are two key
41*46c4c49dSIbrahim Kanouchedistinctions to be aware of while migrating usages.
42*46c4c49dSIbrahim Kanouche
43*46c4c49dSIbrahim KanoucheThe confidence value for the v2 classifier is applied uniformly to results; it
44*46c4c49dSIbrahim Kanouchewill never return a match that is lower confidence than the threshold. In v1,
45*46c4c49dSIbrahim KanoucheMultipleMatch behaved this way, but NearestMatch would return a value
46*46c4c49dSIbrahim Kanoucheregardless of the confidence match. Users often verified that the confidence
47*46c4c49dSIbrahim Kanouchewas above the threshold, but this is no longer necessary.
48*46c4c49dSIbrahim Kanouche
49*46c4c49dSIbrahim KanoucheThe second change is that the classifier now returns all matches against the
50*46c4c49dSIbrahim Kanouchesupplied corpus. The v1 classifier allowed filtering on header matches via a
51*46c4c49dSIbrahim Kanoucheboolean field. This can be emulated by creating a license classifier with a
52*46c4c49dSIbrahim Kanouchereduced corpus if matching against headers is not desired. Alternatively, the
53*46c4c49dSIbrahim Kanoucheuser can use the MatchType field in the Match struct to filter out unwanted
54*46c4c49dSIbrahim Kanouchematches.
55*46c4c49dSIbrahim Kanouche
56*46c4c49dSIbrahim Kanouche
57