xref: /aosp_15_r20/external/rappor/gh-pages/doc/data-flow.html (revision 2abb31345f6c95944768b5222a9a5ed3fc68cc00)
1  <!DOCTYPE html>
2  <html>
3    <head>
4      <meta charset="UTF-8">
5      <style type="text/css">
6        code { color: green; }
7        pre { margin-left: 3em; }
8      </style>
9      <!-- INSERT LATCH JS -->
10    </head>
11    <body style="margin: 0 auto; width: 40em; text-align: left;">
12      <!-- INSERT LATCH HTML -->
13<h1>RAPPOR Data Flow</h1>
14
15<p>This doc explains the simulation tools and data formats in the <a href="https://github.com/google/rappor">RAPPOR
16repository</a>.  We'll focus on the code, and
17describe the algorithm only informally.  For details, see the <a href="http://arxiv.org/abs/1407.6981">paper</a>.</p>
18
19<h2>Overview</h2>
20
21<p>Start with this command:</p>
22
23<pre><code>$ ./demo.sh run
24</code></pre>
25
26<p>It takes a minute or so to run.  The dependencies listed in the
27<a href="https://github.com/google/rappor/blob/master/README.md">README</a> must be installed.  At the end, it will say:</p>
28
29<pre><code>Wrote _tmp/report.html.  Open this in your browser.
30</code></pre>
31
32<p>It should look like <a href="http://google.github.io/rappor/examples/report.html">this</a>.</p>
33
34<p>The following diagram shows what processes and files are involved in the demo.
35Ovals represent <strong>processes</strong>; rectangles represent <strong>data</strong>.  The dotted lines
36denote components that are involved in the simulation, but wouldn't be used in
37a "real" setting.</p>
38
39<p>In most configurations, reporting (in blue) is done by client machines, while
40analysis (in green) is done by a server.</p>
41
42<p><img src="data-flow.png" alt="Diagram of RAPPOR Data Flow" /></p>
43
44<p>In the simulation, reporting consists of these steps:</p>
45
46<ol>
47<li>Generate simulated input data with different distributions.</li>
48<li>Obscure each value with the RAPPOR privacy-preserving reporting mechanism.</li>
49</ol>
50
51<p>Analysis consists of these steps:</p>
52
53<ol>
54<li>Aggregate the reports by summing bits (i.e. make a counting Bloom filter)</li>
55<li>Come up with candidate strings, and hash them in the same manner as the
56client.</li>
57<li>Using the reports, RAPPOR parameters, and candidate strings as input,
58infer the distribution of true values.  We don't see the values themselves.
59We plot the true and inferred distributions side by side for comparison.</li>
60</ol>
61
62<p>This process is described in detail below.</p>
63
64<h2>1. Generating Simulated Input</h2>
65
66<p>The <code>tests/gen_sim_input.py</code> tool generates CSV data, like this:</p>
67
68<!-- TODO: a realistic data set would be nice? How could we generate one?  -->
69
70<p><strong>exp.csv</strong></p>
71
72<pre><code>client, true_value
731,      v6
741,      v3
751,      v3
761,      v5
771,      v13
781,      v1
791,      v8
802,      v2
812,      v3
822,      v1
832,      v8
842,      v1
852,      v30
862,      v10
873,      v4
88...
89</code></pre>
90
91<p><em>(spaces added for clarity)</em></p>
92
93<p>By default we generate 700,000 rows: 7 random values from <code>v1</code> to <code>v50</code> for
94each client.  These can be thought of as a variable being reported over time.</p>
95
96<p>We're simulating an environment where there are many RAPPOR clients, and a
97single server does the RAPPOR analysis on the accumulated data.</p>
98
99<p>The <code>client</code> is represented by an integer ID.  The <code>true_value</code> should <strong>not</strong>
100be sent over the network because we wish to preserve the client's privacy.</p>
101
102<h2>2. RAPPOR Reporting</h2>
103
104<p>The <code>tests/rappor_sim.py</code> tool uses the Python client library
105(<code>client/python/rappor.py</code>) to obscure the <code>v1</code> .. <code>vN</code> strings.  We want to
106infer the distribution of these strings over the entire population, but we
107don't want to know any individual values.</p>
108
109<p>After the RAPPOR transformation, we get another CSV file with 700,000 rows.
110Each client is assigned a cohort.</p>
111
112<p><strong>exp_out.csv</strong></p>
113
114<pre><code>client, cohort, rappor
1151,      63,     1111101011110111
1161,      15,     1110110011111100
1171,      12,     0110101111100101
1181,       0,     1111100111110111
1191,       3,     1001110111110011
1201,      14,     1011111010110011
1211,      33,     0111010100101011
1222,      40,     0011011010101001
1232,      35,     1010110101110100
1242,      58,     1110110110111110
1252,      38,     0010001111001010
1262,       5,     1110111011100101
1272,      36,     0111010100111111
1282,      39,     0101101000101101
1293,      32,     0011100111111110
130...
131</code></pre>
132
133<p><em>(spaces added for clarity)</em></p>
134
135<p>We also get a one-row CSV file that contains the RAPPOR parameters:</p>
136
137<p><strong>exp_params.csv</strong></p>
138
139<pre><code>k,h,m,p,q,f
14016,2,64,0.5,0.75,0.5
141</code></pre>
142
143<p>These are described in the <a href="http://arxiv.org/abs/1407.6981">paper</a>. The parameters that the clients use
144must be known to the server, or the decoding will fail.  In addition, all
145clients must use the same parameters for a given variable.</p>
146
147<p>The <code>rappor_sim.py</code> process also writes these files:</p>
148
149<ul>
150<li><code>exp_hist.csv</code>: The true histogram of input values.  This is used only for
151comparison.  In the real world you obviously won't have this.</li>
152<li><code>exp_true_inputs.txt</code>: A list of the unique values reported, e.g. <code>v1</code> ..
153<code>v50</code>.  You won't have this either, in general.  To use RAPPOR, you must
154supply <em>candidate strings</em>, described below.</li>
155</ul>
156
157<h2>3. Report Aggregation</h2>
158
159<p><code>sum_bits.py</code> takes the <code>exp_out.csv</code> output, and produces the "counts" file:</p>
160
161<p><strong>exp_counts.csv</strong></p>
162
163<pre><code>11116,6273,6433,6347,6385,6290,6621,6359,6747,6623,6321,6696,6282,6652,6368,6286,6222
16410861,6365,6263,6170,6258,6107,6633,6171,6226,6123,6286,6254,6408,6182,6442,6195,6187
165...
166</code></pre>
167
168<p>The file has 64 rows, because the simulation has 64 cohorts by default (<code>m =
16964</code>).  This parameter should be adjusted based on the number of unique true
170values expected.  <!-- TODO: more detail --></p>
171
172<p>There are 17 columns.  The left-most column is the total number of reports in
173the cohort.  The remaining 16 columns correspond to the <code>k = 16</code> bits in the
174Bloom filter.  Each column contains the number of reports with that bit set
175to 1.</p>
176
177<p>So, in general, the "counts" file is a <code>(k+1) * m</code> matrix.</p>
178
179<h2>4. Candidate Strings</h2>
180
181<p>In the simulation, we assume that the analyst will come up with a <em>superset</em> of
182the candidate strings.  This is done in the <code>more-candidates</code> /
183<code>print-candidates</code> functions in <code>demo.sh</code>.</p>
184
185<p>You can also test what happens if you omit true strings from the candidates, by
186editing the invocation of <code>print-candidates</code> in <code>run-dist</code>:</p>
187
188<pre><code># Example of omitting true values.  Generate candidates from
189# exp_true_inputs.txt, omitting values v1 and v2.
190
191print-candidates $dist 'v1|v2'  &gt; _tmp/${dist}_candidates.txt
192</code></pre>
193
194<p>In general, coming up with candidates is an application- or metric-specific
195process.</p>
196
197<p>The candidates are hashed by <code>hash_candidates.py</code> to create the "map" file,
198before being passed to R for analysis.</p>
199
200<p><strong>exp_map.csv</strong></p>
201
202<pre><code>v1,8,13,30,22,37,37,53,53,77,67,89,86,97,97,118,128,139,136,157,&lt;truncated&gt;
203v10,13,2,25,28,42,45,58,60,68,66,91,89,108,102,113,125,130,131,&lt;truncated&gt;
204</code></pre>
205
206<p>The map file has one row per candidate.  In this case, there are 60 rows:
20750 for the true values and 10 for "fake" values, which make the candidates a
208superset of the true input.</p>
209
210<p>The left most column is the raw candidate string.  Then there are 128 more
211columns: for <code>m = 64</code> cohorts times <code>k = 2</code> hash functions in the Bloom filter.</p>
212
213<!-- TODO: more detail about setting params?  Examples of coming up with
214candidate strings? -->
215
216<h2>5. RAPPOR Analysis</h2>
217
218<p>Once you have the <code>counts</code>, <code>params</code>, and <code>map</code> files, you can pass it to the
219<code>tests/analyze.R</code> tool, which is a small wrapper around the <code>analyze/R</code>
220library.</p>
221
222<p>You will get a plot of the true distribution vs. the distribution recovered
223with the RAPPOR privacy algorithm.</p>
224
225<p><a href="http://google.github.io/rappor/examples/report.html">View the example output</a>.</p>
226
227<p>You can change the simulation parameters and RAPPOR parameters via flags, and
228compare the resulting distributions.</p>
229
230<p>For example, if you expect more unique values from clients, you should also use
231more cohorts (i.e. raise <code>m</code>), to prevent hash function collisions from
232degrading the result quality.</p>
233
234<!-- TODO:
235     - how to change flags
236     - more detail on what the various parameters do
237     - association analysis
238     - basic RAPPOR
239     - longitudinal privacy
240-->
241
242<h2>Conclusion</h2>
243
244<p>RAPPOR allows you infer statistics about populations while preserving the
245privacy of individual clients.  In this doc, we walked through a simple demo.
246However, there are other variations of RAPPOR and settings in which you can use
247RAPPOR, which we'll write more about.</p>
248
249<p>Feel free to send feedback on this doc to
250<a href="https://groups.google.com/forum/#!forum/rappor-discuss">[email protected]</a>.</p>
251    </body>
252  </html>
253