xref: /aosp_15_r20/external/rappor/doc/data-flow.md (revision 2abb31345f6c95944768b5222a9a5ed3fc68cc00)
1RAPPOR Data Flow
2================
3
4This doc explains the simulation tools and data formats in the [RAPPOR
5repository](https://github.com/google/rappor).  We'll focus on the code, and
6describe the algorithm only informally.  For details, see the [paper][].
7
8Overview
9--------
10
11Start with this command:
12
13    $ ./demo.sh run
14
15It takes a minute or so to run.  The dependencies listed in the
16[README][] must be installed.  At the end, it will say:
17
18    Wrote _tmp/report.html.  Open this in your browser.
19
20It should look like [this][example].
21
22The following diagram shows what processes and files are involved in the demo.
23Ovals represent **processes**; rectangles represent **data**.  The dotted lines
24denote components that are involved in the simulation, but wouldn't be used in
25a "real" setting.
26
27In most configurations, reporting (in blue) is done by client machines, while
28analysis (in green) is done by a server.
29
30<img src="data-flow.png" alt="Diagram of RAPPOR Data Flow" />
31
32In the simulation, reporting consists of these steps:
33
34  1. Generate simulated input data with different distributions.
35  2. Obscure each value with the RAPPOR privacy-preserving reporting mechanism.
36
37Analysis consists of these steps:
38
39  1. Aggregate the reports by summing bits (i.e. make a counting Bloom filter)
40  2. Come up with candidate strings, and hash them in the same manner as the
41  client.
42  3. Using the reports, RAPPOR parameters, and candidate strings as input,
43  infer the distribution of true values.  We don't see the values themselves.
44  We plot the true and inferred distributions side by side for comparison.
45
46This process is described in detail below.
47
481. Generating Simulated Input
49-----------------------------
50
51The `tests/gen_sim_input.py` tool generates CSV data, like this:
52
53<!-- TODO: a realistic data set would be nice? How could we generate one?  -->
54
55**exp.csv**
56
57    client, true_value
58    1,      v6
59    1,      v3
60    1,      v3
61    1,      v5
62    1,      v13
63    1,      v1
64    1,      v8
65    2,      v2
66    2,      v3
67    2,      v1
68    2,      v8
69    2,      v1
70    2,      v30
71    2,      v10
72    3,      v4
73    ...
74
75*(spaces added for clarity)*
76
77By default we generate 700,000 rows: 7 random values from `v1` to `v50` for
78each client.  These can be thought of as a variable being reported over time.
79
80We're simulating an environment where there are many RAPPOR clients, and a
81single server does the RAPPOR analysis on the accumulated data.
82
83The `client` is represented by an integer ID.  The `true_value` should **not**
84be sent over the network because we wish to preserve the client's privacy.
85
86
872. RAPPOR Reporting
88-------------------
89
90The `tests/rappor_sim.py` tool uses the Python client library
91(`client/python/rappor.py`) to obscure the `v1` .. `vN` strings.  We want to
92infer the distribution of these strings over the entire population, but we
93don't want to know any individual values.
94
95After the RAPPOR transformation, we get another CSV file with 700,000 rows.
96Each client is assigned a cohort.
97
98**exp_out.csv**
99
100    client, cohort, rappor
101    1,      63,     1111101011110111
102    1,      15,     1110110011111100
103    1,      12,     0110101111100101
104    1,       0,     1111100111110111
105    1,       3,     1001110111110011
106    1,      14,     1011111010110011
107    1,      33,     0111010100101011
108    2,      40,     0011011010101001
109    2,      35,     1010110101110100
110    2,      58,     1110110110111110
111    2,      38,     0010001111001010
112    2,       5,     1110111011100101
113    2,      36,     0111010100111111
114    2,      39,     0101101000101101
115    3,      32,     0011100111111110
116    ...
117
118*(spaces added for clarity)*
119
120We also get a one-row CSV file that contains the RAPPOR parameters:
121
122**exp_params.csv**
123
124    k,h,m,p,q,f
125    16,2,64,0.5,0.75,0.5
126
127These are described in the [paper][]. The parameters that the clients use
128must be known to the server, or the decoding will fail.  In addition, all
129clients must use the same parameters for a given variable.
130
131The `rappor_sim.py` process also writes these files:
132
133- `exp_hist.csv`: The true histogram of input values.  This is used only for
134  comparison.  In the real world you obviously won't have this.
135- `exp_true_inputs.txt`: A list of the unique values reported, e.g. `v1` ..
136  `v50`.  You won't have this either, in general.  To use RAPPOR, you must
137  supply *candidate strings*, described below.
138
1393. Report Aggregation
140---------------------
141
142`sum_bits.py` takes the `exp_out.csv` output, and produces the "counts" file:
143
144**exp_counts.csv**
145
146    11116,6273,6433,6347,6385,6290,6621,6359,6747,6623,6321,6696,6282,6652,6368,6286,6222
147    10861,6365,6263,6170,6258,6107,6633,6171,6226,6123,6286,6254,6408,6182,6442,6195,6187
148    ...
149
150The file has 64 rows, because the simulation has 64 cohorts by default (`m =
15164`).  This parameter should be adjusted based on the number of unique true
152values expected.  <!-- TODO: more detail -->
153
154There are 17 columns.  The left-most column is the total number of reports in
155the cohort.  The remaining 16 columns correspond to the `k = 16` bits in the
156Bloom filter.  Each column contains the number of reports with that bit set
157to 1.
158
159So, in general, the "counts" file is a `(k+1) * m` matrix.
160
1614. Candidate Strings
162--------------------
163
164In the simulation, we assume that the analyst will come up with a *superset* of
165the candidate strings.  This is done in the `more-candidates` /
166`print-candidates` functions in `demo.sh`.
167
168You can also test what happens if you omit true strings from the candidates, by
169editing the invocation of `print-candidates` in `run-dist`:
170
171    # Example of omitting true values.  Generate candidates from
172    # exp_true_inputs.txt, omitting values v1 and v2.
173
174    print-candidates $dist 'v1|v2'  > _tmp/${dist}_candidates.txt
175
176In general, coming up with candidates is an application- or metric-specific
177process.
178
179The candidates are hashed by `hash_candidates.py` to create the "map" file,
180before being passed to R for analysis.
181
182**exp_map.csv**
183
184    v1,8,13,30,22,37,37,53,53,77,67,89,86,97,97,118,128,139,136,157,<truncated>
185    v10,13,2,25,28,42,45,58,60,68,66,91,89,108,102,113,125,130,131,<truncated>
186
187The map file has one row per candidate.  In this case, there are 60 rows:
18850 for the true values and 10 for "fake" values, which make the candidates a
189superset of the true input.
190
191The left most column is the raw candidate string.  Then there are 128 more
192columns: for `m = 64` cohorts times `k = 2` hash functions in the Bloom filter.
193
194<!-- TODO: more detail about setting params?  Examples of coming up with
195candidate strings? -->
196
1975. RAPPOR Analysis
198------------------
199
200Once you have the `counts`, `params`, and `map` files, you can pass it to the
201`tests/analyze.R` tool, which is a small wrapper around the `analyze/R`
202library.
203
204You will get a plot of the true distribution vs. the distribution recovered
205with the RAPPOR privacy algorithm.
206
207[View the example output][example].
208
209You can change the simulation parameters and RAPPOR parameters via flags, and
210compare the resulting distributions.
211
212For example, if you expect more unique values from clients, you should also use
213more cohorts (i.e. raise `m`), to prevent hash function collisions from
214degrading the result quality.
215
216<!-- TODO:
217     - how to change flags
218     - more detail on what the various parameters do
219     - association analysis
220     - basic RAPPOR
221     - longitudinal privacy
222-->
223
224Conclusion
225----------
226
227RAPPOR allows you infer statistics about populations while preserving the
228privacy of individual clients.  In this doc, we walked through a simple demo.
229However, there are other variations of RAPPOR and settings in which you can use
230RAPPOR, which we'll write more about.
231
232Feel free to send feedback on this doc to
233[[email protected]](https://groups.google.com/forum/#!forum/rappor-discuss).
234
235
236[README]: https://github.com/google/rappor/blob/master/README.md
237[paper]: http://arxiv.org/abs/1407.6981
238[example]: http://google.github.io/rappor/examples/report.html
239
240