xref: /aosp_15_r20/external/rappor/gh-pages/doc/data-flow.html (revision 2abb31345f6c95944768b5222a9a5ed3fc68cc00)
1*2abb3134SXin Li  <!DOCTYPE html>
2*2abb3134SXin Li  <html>
3*2abb3134SXin Li    <head>
4*2abb3134SXin Li      <meta charset="UTF-8">
5*2abb3134SXin Li      <style type="text/css">
6*2abb3134SXin Li        code { color: green; }
7*2abb3134SXin Li        pre { margin-left: 3em; }
8*2abb3134SXin Li      </style>
9*2abb3134SXin Li      <!-- INSERT LATCH JS -->
10*2abb3134SXin Li    </head>
11*2abb3134SXin Li    <body style="margin: 0 auto; width: 40em; text-align: left;">
12*2abb3134SXin Li      <!-- INSERT LATCH HTML -->
13*2abb3134SXin Li<h1>RAPPOR Data Flow</h1>
14*2abb3134SXin Li
15*2abb3134SXin Li<p>This doc explains the simulation tools and data formats in the <a href="https://github.com/google/rappor">RAPPOR
16*2abb3134SXin Lirepository</a>.  We'll focus on the code, and
17*2abb3134SXin Lidescribe the algorithm only informally.  For details, see the <a href="http://arxiv.org/abs/1407.6981">paper</a>.</p>
18*2abb3134SXin Li
19*2abb3134SXin Li<h2>Overview</h2>
20*2abb3134SXin Li
21*2abb3134SXin Li<p>Start with this command:</p>
22*2abb3134SXin Li
23*2abb3134SXin Li<pre><code>$ ./demo.sh run
24*2abb3134SXin Li</code></pre>
25*2abb3134SXin Li
26*2abb3134SXin Li<p>It takes a minute or so to run.  The dependencies listed in the
27*2abb3134SXin Li<a href="https://github.com/google/rappor/blob/master/README.md">README</a> must be installed.  At the end, it will say:</p>
28*2abb3134SXin Li
29*2abb3134SXin Li<pre><code>Wrote _tmp/report.html.  Open this in your browser.
30*2abb3134SXin Li</code></pre>
31*2abb3134SXin Li
32*2abb3134SXin Li<p>It should look like <a href="http://google.github.io/rappor/examples/report.html">this</a>.</p>
33*2abb3134SXin Li
34*2abb3134SXin Li<p>The following diagram shows what processes and files are involved in the demo.
35*2abb3134SXin LiOvals represent <strong>processes</strong>; rectangles represent <strong>data</strong>.  The dotted lines
36*2abb3134SXin Lidenote components that are involved in the simulation, but wouldn't be used in
37*2abb3134SXin Lia "real" setting.</p>
38*2abb3134SXin Li
39*2abb3134SXin Li<p>In most configurations, reporting (in blue) is done by client machines, while
40*2abb3134SXin Lianalysis (in green) is done by a server.</p>
41*2abb3134SXin Li
42*2abb3134SXin Li<p><img src="data-flow.png" alt="Diagram of RAPPOR Data Flow" /></p>
43*2abb3134SXin Li
44*2abb3134SXin Li<p>In the simulation, reporting consists of these steps:</p>
45*2abb3134SXin Li
46*2abb3134SXin Li<ol>
47*2abb3134SXin Li<li>Generate simulated input data with different distributions.</li>
48*2abb3134SXin Li<li>Obscure each value with the RAPPOR privacy-preserving reporting mechanism.</li>
49*2abb3134SXin Li</ol>
50*2abb3134SXin Li
51*2abb3134SXin Li<p>Analysis consists of these steps:</p>
52*2abb3134SXin Li
53*2abb3134SXin Li<ol>
54*2abb3134SXin Li<li>Aggregate the reports by summing bits (i.e. make a counting Bloom filter)</li>
55*2abb3134SXin Li<li>Come up with candidate strings, and hash them in the same manner as the
56*2abb3134SXin Liclient.</li>
57*2abb3134SXin Li<li>Using the reports, RAPPOR parameters, and candidate strings as input,
58*2abb3134SXin Liinfer the distribution of true values.  We don't see the values themselves.
59*2abb3134SXin LiWe plot the true and inferred distributions side by side for comparison.</li>
60*2abb3134SXin Li</ol>
61*2abb3134SXin Li
62*2abb3134SXin Li<p>This process is described in detail below.</p>
63*2abb3134SXin Li
64*2abb3134SXin Li<h2>1. Generating Simulated Input</h2>
65*2abb3134SXin Li
66*2abb3134SXin Li<p>The <code>tests/gen_sim_input.py</code> tool generates CSV data, like this:</p>
67*2abb3134SXin Li
68*2abb3134SXin Li<!-- TODO: a realistic data set would be nice? How could we generate one?  -->
69*2abb3134SXin Li
70*2abb3134SXin Li<p><strong>exp.csv</strong></p>
71*2abb3134SXin Li
72*2abb3134SXin Li<pre><code>client, true_value
73*2abb3134SXin Li1,      v6
74*2abb3134SXin Li1,      v3
75*2abb3134SXin Li1,      v3
76*2abb3134SXin Li1,      v5
77*2abb3134SXin Li1,      v13
78*2abb3134SXin Li1,      v1
79*2abb3134SXin Li1,      v8
80*2abb3134SXin Li2,      v2
81*2abb3134SXin Li2,      v3
82*2abb3134SXin Li2,      v1
83*2abb3134SXin Li2,      v8
84*2abb3134SXin Li2,      v1
85*2abb3134SXin Li2,      v30
86*2abb3134SXin Li2,      v10
87*2abb3134SXin Li3,      v4
88*2abb3134SXin Li...
89*2abb3134SXin Li</code></pre>
90*2abb3134SXin Li
91*2abb3134SXin Li<p><em>(spaces added for clarity)</em></p>
92*2abb3134SXin Li
93*2abb3134SXin Li<p>By default we generate 700,000 rows: 7 random values from <code>v1</code> to <code>v50</code> for
94*2abb3134SXin Lieach client.  These can be thought of as a variable being reported over time.</p>
95*2abb3134SXin Li
96*2abb3134SXin Li<p>We're simulating an environment where there are many RAPPOR clients, and a
97*2abb3134SXin Lisingle server does the RAPPOR analysis on the accumulated data.</p>
98*2abb3134SXin Li
99*2abb3134SXin Li<p>The <code>client</code> is represented by an integer ID.  The <code>true_value</code> should <strong>not</strong>
100*2abb3134SXin Libe sent over the network because we wish to preserve the client's privacy.</p>
101*2abb3134SXin Li
102*2abb3134SXin Li<h2>2. RAPPOR Reporting</h2>
103*2abb3134SXin Li
104*2abb3134SXin Li<p>The <code>tests/rappor_sim.py</code> tool uses the Python client library
105*2abb3134SXin Li(<code>client/python/rappor.py</code>) to obscure the <code>v1</code> .. <code>vN</code> strings.  We want to
106*2abb3134SXin Liinfer the distribution of these strings over the entire population, but we
107*2abb3134SXin Lidon't want to know any individual values.</p>
108*2abb3134SXin Li
109*2abb3134SXin Li<p>After the RAPPOR transformation, we get another CSV file with 700,000 rows.
110*2abb3134SXin LiEach client is assigned a cohort.</p>
111*2abb3134SXin Li
112*2abb3134SXin Li<p><strong>exp_out.csv</strong></p>
113*2abb3134SXin Li
114*2abb3134SXin Li<pre><code>client, cohort, rappor
115*2abb3134SXin Li1,      63,     1111101011110111
116*2abb3134SXin Li1,      15,     1110110011111100
117*2abb3134SXin Li1,      12,     0110101111100101
118*2abb3134SXin Li1,       0,     1111100111110111
119*2abb3134SXin Li1,       3,     1001110111110011
120*2abb3134SXin Li1,      14,     1011111010110011
121*2abb3134SXin Li1,      33,     0111010100101011
122*2abb3134SXin Li2,      40,     0011011010101001
123*2abb3134SXin Li2,      35,     1010110101110100
124*2abb3134SXin Li2,      58,     1110110110111110
125*2abb3134SXin Li2,      38,     0010001111001010
126*2abb3134SXin Li2,       5,     1110111011100101
127*2abb3134SXin Li2,      36,     0111010100111111
128*2abb3134SXin Li2,      39,     0101101000101101
129*2abb3134SXin Li3,      32,     0011100111111110
130*2abb3134SXin Li...
131*2abb3134SXin Li</code></pre>
132*2abb3134SXin Li
133*2abb3134SXin Li<p><em>(spaces added for clarity)</em></p>
134*2abb3134SXin Li
135*2abb3134SXin Li<p>We also get a one-row CSV file that contains the RAPPOR parameters:</p>
136*2abb3134SXin Li
137*2abb3134SXin Li<p><strong>exp_params.csv</strong></p>
138*2abb3134SXin Li
139*2abb3134SXin Li<pre><code>k,h,m,p,q,f
140*2abb3134SXin Li16,2,64,0.5,0.75,0.5
141*2abb3134SXin Li</code></pre>
142*2abb3134SXin Li
143*2abb3134SXin Li<p>These are described in the <a href="http://arxiv.org/abs/1407.6981">paper</a>. The parameters that the clients use
144*2abb3134SXin Limust be known to the server, or the decoding will fail.  In addition, all
145*2abb3134SXin Liclients must use the same parameters for a given variable.</p>
146*2abb3134SXin Li
147*2abb3134SXin Li<p>The <code>rappor_sim.py</code> process also writes these files:</p>
148*2abb3134SXin Li
149*2abb3134SXin Li<ul>
150*2abb3134SXin Li<li><code>exp_hist.csv</code>: The true histogram of input values.  This is used only for
151*2abb3134SXin Licomparison.  In the real world you obviously won't have this.</li>
152*2abb3134SXin Li<li><code>exp_true_inputs.txt</code>: A list of the unique values reported, e.g. <code>v1</code> ..
153*2abb3134SXin Li<code>v50</code>.  You won't have this either, in general.  To use RAPPOR, you must
154*2abb3134SXin Lisupply <em>candidate strings</em>, described below.</li>
155*2abb3134SXin Li</ul>
156*2abb3134SXin Li
157*2abb3134SXin Li<h2>3. Report Aggregation</h2>
158*2abb3134SXin Li
159*2abb3134SXin Li<p><code>sum_bits.py</code> takes the <code>exp_out.csv</code> output, and produces the "counts" file:</p>
160*2abb3134SXin Li
161*2abb3134SXin Li<p><strong>exp_counts.csv</strong></p>
162*2abb3134SXin Li
163*2abb3134SXin Li<pre><code>11116,6273,6433,6347,6385,6290,6621,6359,6747,6623,6321,6696,6282,6652,6368,6286,6222
164*2abb3134SXin Li10861,6365,6263,6170,6258,6107,6633,6171,6226,6123,6286,6254,6408,6182,6442,6195,6187
165*2abb3134SXin Li...
166*2abb3134SXin Li</code></pre>
167*2abb3134SXin Li
168*2abb3134SXin Li<p>The file has 64 rows, because the simulation has 64 cohorts by default (<code>m =
169*2abb3134SXin Li64</code>).  This parameter should be adjusted based on the number of unique true
170*2abb3134SXin Livalues expected.  <!-- TODO: more detail --></p>
171*2abb3134SXin Li
172*2abb3134SXin Li<p>There are 17 columns.  The left-most column is the total number of reports in
173*2abb3134SXin Lithe cohort.  The remaining 16 columns correspond to the <code>k = 16</code> bits in the
174*2abb3134SXin LiBloom filter.  Each column contains the number of reports with that bit set
175*2abb3134SXin Lito 1.</p>
176*2abb3134SXin Li
177*2abb3134SXin Li<p>So, in general, the "counts" file is a <code>(k+1) * m</code> matrix.</p>
178*2abb3134SXin Li
179*2abb3134SXin Li<h2>4. Candidate Strings</h2>
180*2abb3134SXin Li
181*2abb3134SXin Li<p>In the simulation, we assume that the analyst will come up with a <em>superset</em> of
182*2abb3134SXin Lithe candidate strings.  This is done in the <code>more-candidates</code> /
183*2abb3134SXin Li<code>print-candidates</code> functions in <code>demo.sh</code>.</p>
184*2abb3134SXin Li
185*2abb3134SXin Li<p>You can also test what happens if you omit true strings from the candidates, by
186*2abb3134SXin Liediting the invocation of <code>print-candidates</code> in <code>run-dist</code>:</p>
187*2abb3134SXin Li
188*2abb3134SXin Li<pre><code># Example of omitting true values.  Generate candidates from
189*2abb3134SXin Li# exp_true_inputs.txt, omitting values v1 and v2.
190*2abb3134SXin Li
191*2abb3134SXin Liprint-candidates $dist 'v1|v2'  &gt; _tmp/${dist}_candidates.txt
192*2abb3134SXin Li</code></pre>
193*2abb3134SXin Li
194*2abb3134SXin Li<p>In general, coming up with candidates is an application- or metric-specific
195*2abb3134SXin Liprocess.</p>
196*2abb3134SXin Li
197*2abb3134SXin Li<p>The candidates are hashed by <code>hash_candidates.py</code> to create the "map" file,
198*2abb3134SXin Libefore being passed to R for analysis.</p>
199*2abb3134SXin Li
200*2abb3134SXin Li<p><strong>exp_map.csv</strong></p>
201*2abb3134SXin Li
202*2abb3134SXin Li<pre><code>v1,8,13,30,22,37,37,53,53,77,67,89,86,97,97,118,128,139,136,157,&lt;truncated&gt;
203*2abb3134SXin Liv10,13,2,25,28,42,45,58,60,68,66,91,89,108,102,113,125,130,131,&lt;truncated&gt;
204*2abb3134SXin Li</code></pre>
205*2abb3134SXin Li
206*2abb3134SXin Li<p>The map file has one row per candidate.  In this case, there are 60 rows:
207*2abb3134SXin Li50 for the true values and 10 for "fake" values, which make the candidates a
208*2abb3134SXin Lisuperset of the true input.</p>
209*2abb3134SXin Li
210*2abb3134SXin Li<p>The left most column is the raw candidate string.  Then there are 128 more
211*2abb3134SXin Licolumns: for <code>m = 64</code> cohorts times <code>k = 2</code> hash functions in the Bloom filter.</p>
212*2abb3134SXin Li
213*2abb3134SXin Li<!-- TODO: more detail about setting params?  Examples of coming up with
214*2abb3134SXin Licandidate strings? -->
215*2abb3134SXin Li
216*2abb3134SXin Li<h2>5. RAPPOR Analysis</h2>
217*2abb3134SXin Li
218*2abb3134SXin Li<p>Once you have the <code>counts</code>, <code>params</code>, and <code>map</code> files, you can pass it to the
219*2abb3134SXin Li<code>tests/analyze.R</code> tool, which is a small wrapper around the <code>analyze/R</code>
220*2abb3134SXin Lilibrary.</p>
221*2abb3134SXin Li
222*2abb3134SXin Li<p>You will get a plot of the true distribution vs. the distribution recovered
223*2abb3134SXin Liwith the RAPPOR privacy algorithm.</p>
224*2abb3134SXin Li
225*2abb3134SXin Li<p><a href="http://google.github.io/rappor/examples/report.html">View the example output</a>.</p>
226*2abb3134SXin Li
227*2abb3134SXin Li<p>You can change the simulation parameters and RAPPOR parameters via flags, and
228*2abb3134SXin Licompare the resulting distributions.</p>
229*2abb3134SXin Li
230*2abb3134SXin Li<p>For example, if you expect more unique values from clients, you should also use
231*2abb3134SXin Limore cohorts (i.e. raise <code>m</code>), to prevent hash function collisions from
232*2abb3134SXin Lidegrading the result quality.</p>
233*2abb3134SXin Li
234*2abb3134SXin Li<!-- TODO:
235*2abb3134SXin Li     - how to change flags
236*2abb3134SXin Li     - more detail on what the various parameters do
237*2abb3134SXin Li     - association analysis
238*2abb3134SXin Li     - basic RAPPOR
239*2abb3134SXin Li     - longitudinal privacy
240*2abb3134SXin Li-->
241*2abb3134SXin Li
242*2abb3134SXin Li<h2>Conclusion</h2>
243*2abb3134SXin Li
244*2abb3134SXin Li<p>RAPPOR allows you infer statistics about populations while preserving the
245*2abb3134SXin Liprivacy of individual clients.  In this doc, we walked through a simple demo.
246*2abb3134SXin LiHowever, there are other variations of RAPPOR and settings in which you can use
247*2abb3134SXin LiRAPPOR, which we'll write more about.</p>
248*2abb3134SXin Li
249*2abb3134SXin Li<p>Feel free to send feedback on this doc to
250*2abb3134SXin Li<a href="https://groups.google.com/forum/#!forum/rappor-discuss">[email protected]</a>.</p>
251*2abb3134SXin Li    </body>
252*2abb3134SXin Li  </html>
253