1*2abb3134SXin Li <!DOCTYPE html> 2*2abb3134SXin Li <html> 3*2abb3134SXin Li <head> 4*2abb3134SXin Li <meta charset="UTF-8"> 5*2abb3134SXin Li <style type="text/css"> 6*2abb3134SXin Li code { color: green; } 7*2abb3134SXin Li pre { margin-left: 3em; } 8*2abb3134SXin Li </style> 9*2abb3134SXin Li <!-- INSERT LATCH JS --> 10*2abb3134SXin Li </head> 11*2abb3134SXin Li <body style="margin: 0 auto; width: 40em; text-align: left;"> 12*2abb3134SXin Li <!-- INSERT LATCH HTML --> 13*2abb3134SXin Li<h1>RAPPOR Data Flow</h1> 14*2abb3134SXin Li 15*2abb3134SXin Li<p>This doc explains the simulation tools and data formats in the <a href="https://github.com/google/rappor">RAPPOR 16*2abb3134SXin Lirepository</a>. We'll focus on the code, and 17*2abb3134SXin Lidescribe the algorithm only informally. For details, see the <a href="http://arxiv.org/abs/1407.6981">paper</a>.</p> 18*2abb3134SXin Li 19*2abb3134SXin Li<h2>Overview</h2> 20*2abb3134SXin Li 21*2abb3134SXin Li<p>Start with this command:</p> 22*2abb3134SXin Li 23*2abb3134SXin Li<pre><code>$ ./demo.sh run 24*2abb3134SXin Li</code></pre> 25*2abb3134SXin Li 26*2abb3134SXin Li<p>It takes a minute or so to run. The dependencies listed in the 27*2abb3134SXin Li<a href="https://github.com/google/rappor/blob/master/README.md">README</a> must be installed. At the end, it will say:</p> 28*2abb3134SXin Li 29*2abb3134SXin Li<pre><code>Wrote _tmp/report.html. Open this in your browser. 30*2abb3134SXin Li</code></pre> 31*2abb3134SXin Li 32*2abb3134SXin Li<p>It should look like <a href="http://google.github.io/rappor/examples/report.html">this</a>.</p> 33*2abb3134SXin Li 34*2abb3134SXin Li<p>The following diagram shows what processes and files are involved in the demo. 35*2abb3134SXin LiOvals represent <strong>processes</strong>; rectangles represent <strong>data</strong>. The dotted lines 36*2abb3134SXin Lidenote components that are involved in the simulation, but wouldn't be used in 37*2abb3134SXin Lia "real" setting.</p> 38*2abb3134SXin Li 39*2abb3134SXin Li<p>In most configurations, reporting (in blue) is done by client machines, while 40*2abb3134SXin Lianalysis (in green) is done by a server.</p> 41*2abb3134SXin Li 42*2abb3134SXin Li<p><img src="data-flow.png" alt="Diagram of RAPPOR Data Flow" /></p> 43*2abb3134SXin Li 44*2abb3134SXin Li<p>In the simulation, reporting consists of these steps:</p> 45*2abb3134SXin Li 46*2abb3134SXin Li<ol> 47*2abb3134SXin Li<li>Generate simulated input data with different distributions.</li> 48*2abb3134SXin Li<li>Obscure each value with the RAPPOR privacy-preserving reporting mechanism.</li> 49*2abb3134SXin Li</ol> 50*2abb3134SXin Li 51*2abb3134SXin Li<p>Analysis consists of these steps:</p> 52*2abb3134SXin Li 53*2abb3134SXin Li<ol> 54*2abb3134SXin Li<li>Aggregate the reports by summing bits (i.e. make a counting Bloom filter)</li> 55*2abb3134SXin Li<li>Come up with candidate strings, and hash them in the same manner as the 56*2abb3134SXin Liclient.</li> 57*2abb3134SXin Li<li>Using the reports, RAPPOR parameters, and candidate strings as input, 58*2abb3134SXin Liinfer the distribution of true values. We don't see the values themselves. 59*2abb3134SXin LiWe plot the true and inferred distributions side by side for comparison.</li> 60*2abb3134SXin Li</ol> 61*2abb3134SXin Li 62*2abb3134SXin Li<p>This process is described in detail below.</p> 63*2abb3134SXin Li 64*2abb3134SXin Li<h2>1. Generating Simulated Input</h2> 65*2abb3134SXin Li 66*2abb3134SXin Li<p>The <code>tests/gen_sim_input.py</code> tool generates CSV data, like this:</p> 67*2abb3134SXin Li 68*2abb3134SXin Li<!-- TODO: a realistic data set would be nice? How could we generate one? --> 69*2abb3134SXin Li 70*2abb3134SXin Li<p><strong>exp.csv</strong></p> 71*2abb3134SXin Li 72*2abb3134SXin Li<pre><code>client, true_value 73*2abb3134SXin Li1, v6 74*2abb3134SXin Li1, v3 75*2abb3134SXin Li1, v3 76*2abb3134SXin Li1, v5 77*2abb3134SXin Li1, v13 78*2abb3134SXin Li1, v1 79*2abb3134SXin Li1, v8 80*2abb3134SXin Li2, v2 81*2abb3134SXin Li2, v3 82*2abb3134SXin Li2, v1 83*2abb3134SXin Li2, v8 84*2abb3134SXin Li2, v1 85*2abb3134SXin Li2, v30 86*2abb3134SXin Li2, v10 87*2abb3134SXin Li3, v4 88*2abb3134SXin Li... 89*2abb3134SXin Li</code></pre> 90*2abb3134SXin Li 91*2abb3134SXin Li<p><em>(spaces added for clarity)</em></p> 92*2abb3134SXin Li 93*2abb3134SXin Li<p>By default we generate 700,000 rows: 7 random values from <code>v1</code> to <code>v50</code> for 94*2abb3134SXin Lieach client. These can be thought of as a variable being reported over time.</p> 95*2abb3134SXin Li 96*2abb3134SXin Li<p>We're simulating an environment where there are many RAPPOR clients, and a 97*2abb3134SXin Lisingle server does the RAPPOR analysis on the accumulated data.</p> 98*2abb3134SXin Li 99*2abb3134SXin Li<p>The <code>client</code> is represented by an integer ID. The <code>true_value</code> should <strong>not</strong> 100*2abb3134SXin Libe sent over the network because we wish to preserve the client's privacy.</p> 101*2abb3134SXin Li 102*2abb3134SXin Li<h2>2. RAPPOR Reporting</h2> 103*2abb3134SXin Li 104*2abb3134SXin Li<p>The <code>tests/rappor_sim.py</code> tool uses the Python client library 105*2abb3134SXin Li(<code>client/python/rappor.py</code>) to obscure the <code>v1</code> .. <code>vN</code> strings. We want to 106*2abb3134SXin Liinfer the distribution of these strings over the entire population, but we 107*2abb3134SXin Lidon't want to know any individual values.</p> 108*2abb3134SXin Li 109*2abb3134SXin Li<p>After the RAPPOR transformation, we get another CSV file with 700,000 rows. 110*2abb3134SXin LiEach client is assigned a cohort.</p> 111*2abb3134SXin Li 112*2abb3134SXin Li<p><strong>exp_out.csv</strong></p> 113*2abb3134SXin Li 114*2abb3134SXin Li<pre><code>client, cohort, rappor 115*2abb3134SXin Li1, 63, 1111101011110111 116*2abb3134SXin Li1, 15, 1110110011111100 117*2abb3134SXin Li1, 12, 0110101111100101 118*2abb3134SXin Li1, 0, 1111100111110111 119*2abb3134SXin Li1, 3, 1001110111110011 120*2abb3134SXin Li1, 14, 1011111010110011 121*2abb3134SXin Li1, 33, 0111010100101011 122*2abb3134SXin Li2, 40, 0011011010101001 123*2abb3134SXin Li2, 35, 1010110101110100 124*2abb3134SXin Li2, 58, 1110110110111110 125*2abb3134SXin Li2, 38, 0010001111001010 126*2abb3134SXin Li2, 5, 1110111011100101 127*2abb3134SXin Li2, 36, 0111010100111111 128*2abb3134SXin Li2, 39, 0101101000101101 129*2abb3134SXin Li3, 32, 0011100111111110 130*2abb3134SXin Li... 131*2abb3134SXin Li</code></pre> 132*2abb3134SXin Li 133*2abb3134SXin Li<p><em>(spaces added for clarity)</em></p> 134*2abb3134SXin Li 135*2abb3134SXin Li<p>We also get a one-row CSV file that contains the RAPPOR parameters:</p> 136*2abb3134SXin Li 137*2abb3134SXin Li<p><strong>exp_params.csv</strong></p> 138*2abb3134SXin Li 139*2abb3134SXin Li<pre><code>k,h,m,p,q,f 140*2abb3134SXin Li16,2,64,0.5,0.75,0.5 141*2abb3134SXin Li</code></pre> 142*2abb3134SXin Li 143*2abb3134SXin Li<p>These are described in the <a href="http://arxiv.org/abs/1407.6981">paper</a>. The parameters that the clients use 144*2abb3134SXin Limust be known to the server, or the decoding will fail. In addition, all 145*2abb3134SXin Liclients must use the same parameters for a given variable.</p> 146*2abb3134SXin Li 147*2abb3134SXin Li<p>The <code>rappor_sim.py</code> process also writes these files:</p> 148*2abb3134SXin Li 149*2abb3134SXin Li<ul> 150*2abb3134SXin Li<li><code>exp_hist.csv</code>: The true histogram of input values. This is used only for 151*2abb3134SXin Licomparison. In the real world you obviously won't have this.</li> 152*2abb3134SXin Li<li><code>exp_true_inputs.txt</code>: A list of the unique values reported, e.g. <code>v1</code> .. 153*2abb3134SXin Li<code>v50</code>. You won't have this either, in general. To use RAPPOR, you must 154*2abb3134SXin Lisupply <em>candidate strings</em>, described below.</li> 155*2abb3134SXin Li</ul> 156*2abb3134SXin Li 157*2abb3134SXin Li<h2>3. Report Aggregation</h2> 158*2abb3134SXin Li 159*2abb3134SXin Li<p><code>sum_bits.py</code> takes the <code>exp_out.csv</code> output, and produces the "counts" file:</p> 160*2abb3134SXin Li 161*2abb3134SXin Li<p><strong>exp_counts.csv</strong></p> 162*2abb3134SXin Li 163*2abb3134SXin Li<pre><code>11116,6273,6433,6347,6385,6290,6621,6359,6747,6623,6321,6696,6282,6652,6368,6286,6222 164*2abb3134SXin Li10861,6365,6263,6170,6258,6107,6633,6171,6226,6123,6286,6254,6408,6182,6442,6195,6187 165*2abb3134SXin Li... 166*2abb3134SXin Li</code></pre> 167*2abb3134SXin Li 168*2abb3134SXin Li<p>The file has 64 rows, because the simulation has 64 cohorts by default (<code>m = 169*2abb3134SXin Li64</code>). This parameter should be adjusted based on the number of unique true 170*2abb3134SXin Livalues expected. <!-- TODO: more detail --></p> 171*2abb3134SXin Li 172*2abb3134SXin Li<p>There are 17 columns. The left-most column is the total number of reports in 173*2abb3134SXin Lithe cohort. The remaining 16 columns correspond to the <code>k = 16</code> bits in the 174*2abb3134SXin LiBloom filter. Each column contains the number of reports with that bit set 175*2abb3134SXin Lito 1.</p> 176*2abb3134SXin Li 177*2abb3134SXin Li<p>So, in general, the "counts" file is a <code>(k+1) * m</code> matrix.</p> 178*2abb3134SXin Li 179*2abb3134SXin Li<h2>4. Candidate Strings</h2> 180*2abb3134SXin Li 181*2abb3134SXin Li<p>In the simulation, we assume that the analyst will come up with a <em>superset</em> of 182*2abb3134SXin Lithe candidate strings. This is done in the <code>more-candidates</code> / 183*2abb3134SXin Li<code>print-candidates</code> functions in <code>demo.sh</code>.</p> 184*2abb3134SXin Li 185*2abb3134SXin Li<p>You can also test what happens if you omit true strings from the candidates, by 186*2abb3134SXin Liediting the invocation of <code>print-candidates</code> in <code>run-dist</code>:</p> 187*2abb3134SXin Li 188*2abb3134SXin Li<pre><code># Example of omitting true values. Generate candidates from 189*2abb3134SXin Li# exp_true_inputs.txt, omitting values v1 and v2. 190*2abb3134SXin Li 191*2abb3134SXin Liprint-candidates $dist 'v1|v2' > _tmp/${dist}_candidates.txt 192*2abb3134SXin Li</code></pre> 193*2abb3134SXin Li 194*2abb3134SXin Li<p>In general, coming up with candidates is an application- or metric-specific 195*2abb3134SXin Liprocess.</p> 196*2abb3134SXin Li 197*2abb3134SXin Li<p>The candidates are hashed by <code>hash_candidates.py</code> to create the "map" file, 198*2abb3134SXin Libefore being passed to R for analysis.</p> 199*2abb3134SXin Li 200*2abb3134SXin Li<p><strong>exp_map.csv</strong></p> 201*2abb3134SXin Li 202*2abb3134SXin Li<pre><code>v1,8,13,30,22,37,37,53,53,77,67,89,86,97,97,118,128,139,136,157,<truncated> 203*2abb3134SXin Liv10,13,2,25,28,42,45,58,60,68,66,91,89,108,102,113,125,130,131,<truncated> 204*2abb3134SXin Li</code></pre> 205*2abb3134SXin Li 206*2abb3134SXin Li<p>The map file has one row per candidate. In this case, there are 60 rows: 207*2abb3134SXin Li50 for the true values and 10 for "fake" values, which make the candidates a 208*2abb3134SXin Lisuperset of the true input.</p> 209*2abb3134SXin Li 210*2abb3134SXin Li<p>The left most column is the raw candidate string. Then there are 128 more 211*2abb3134SXin Licolumns: for <code>m = 64</code> cohorts times <code>k = 2</code> hash functions in the Bloom filter.</p> 212*2abb3134SXin Li 213*2abb3134SXin Li<!-- TODO: more detail about setting params? Examples of coming up with 214*2abb3134SXin Licandidate strings? --> 215*2abb3134SXin Li 216*2abb3134SXin Li<h2>5. RAPPOR Analysis</h2> 217*2abb3134SXin Li 218*2abb3134SXin Li<p>Once you have the <code>counts</code>, <code>params</code>, and <code>map</code> files, you can pass it to the 219*2abb3134SXin Li<code>tests/analyze.R</code> tool, which is a small wrapper around the <code>analyze/R</code> 220*2abb3134SXin Lilibrary.</p> 221*2abb3134SXin Li 222*2abb3134SXin Li<p>You will get a plot of the true distribution vs. the distribution recovered 223*2abb3134SXin Liwith the RAPPOR privacy algorithm.</p> 224*2abb3134SXin Li 225*2abb3134SXin Li<p><a href="http://google.github.io/rappor/examples/report.html">View the example output</a>.</p> 226*2abb3134SXin Li 227*2abb3134SXin Li<p>You can change the simulation parameters and RAPPOR parameters via flags, and 228*2abb3134SXin Licompare the resulting distributions.</p> 229*2abb3134SXin Li 230*2abb3134SXin Li<p>For example, if you expect more unique values from clients, you should also use 231*2abb3134SXin Limore cohorts (i.e. raise <code>m</code>), to prevent hash function collisions from 232*2abb3134SXin Lidegrading the result quality.</p> 233*2abb3134SXin Li 234*2abb3134SXin Li<!-- TODO: 235*2abb3134SXin Li - how to change flags 236*2abb3134SXin Li - more detail on what the various parameters do 237*2abb3134SXin Li - association analysis 238*2abb3134SXin Li - basic RAPPOR 239*2abb3134SXin Li - longitudinal privacy 240*2abb3134SXin Li--> 241*2abb3134SXin Li 242*2abb3134SXin Li<h2>Conclusion</h2> 243*2abb3134SXin Li 244*2abb3134SXin Li<p>RAPPOR allows you infer statistics about populations while preserving the 245*2abb3134SXin Liprivacy of individual clients. In this doc, we walked through a simple demo. 246*2abb3134SXin LiHowever, there are other variations of RAPPOR and settings in which you can use 247*2abb3134SXin LiRAPPOR, which we'll write more about.</p> 248*2abb3134SXin Li 249*2abb3134SXin Li<p>Feel free to send feedback on this doc to 250*2abb3134SXin Li<a href="https://groups.google.com/forum/#!forum/rappor-discuss">[email protected]</a>.</p> 251*2abb3134SXin Li </body> 252*2abb3134SXin Li </html> 253