1 <!DOCTYPE html> 2 <html> 3 <head> 4 <meta charset="UTF-8"> 5 <style type="text/css"> 6 code { color: green; } 7 pre { margin-left: 3em; } 8 </style> 9 <!-- INSERT LATCH JS --> 10 </head> 11 <body style="margin: 0 auto; width: 40em; text-align: left;"> 12 <!-- INSERT LATCH HTML --> 13<h1>RAPPOR Data Flow</h1> 14 15<p>This doc explains the simulation tools and data formats in the <a href="https://github.com/google/rappor">RAPPOR 16repository</a>. We'll focus on the code, and 17describe the algorithm only informally. For details, see the <a href="http://arxiv.org/abs/1407.6981">paper</a>.</p> 18 19<h2>Overview</h2> 20 21<p>Start with this command:</p> 22 23<pre><code>$ ./demo.sh run 24</code></pre> 25 26<p>It takes a minute or so to run. The dependencies listed in the 27<a href="https://github.com/google/rappor/blob/master/README.md">README</a> must be installed. At the end, it will say:</p> 28 29<pre><code>Wrote _tmp/report.html. Open this in your browser. 30</code></pre> 31 32<p>It should look like <a href="http://google.github.io/rappor/examples/report.html">this</a>.</p> 33 34<p>The following diagram shows what processes and files are involved in the demo. 35Ovals represent <strong>processes</strong>; rectangles represent <strong>data</strong>. The dotted lines 36denote components that are involved in the simulation, but wouldn't be used in 37a "real" setting.</p> 38 39<p>In most configurations, reporting (in blue) is done by client machines, while 40analysis (in green) is done by a server.</p> 41 42<p><img src="data-flow.png" alt="Diagram of RAPPOR Data Flow" /></p> 43 44<p>In the simulation, reporting consists of these steps:</p> 45 46<ol> 47<li>Generate simulated input data with different distributions.</li> 48<li>Obscure each value with the RAPPOR privacy-preserving reporting mechanism.</li> 49</ol> 50 51<p>Analysis consists of these steps:</p> 52 53<ol> 54<li>Aggregate the reports by summing bits (i.e. make a counting Bloom filter)</li> 55<li>Come up with candidate strings, and hash them in the same manner as the 56client.</li> 57<li>Using the reports, RAPPOR parameters, and candidate strings as input, 58infer the distribution of true values. We don't see the values themselves. 59We plot the true and inferred distributions side by side for comparison.</li> 60</ol> 61 62<p>This process is described in detail below.</p> 63 64<h2>1. Generating Simulated Input</h2> 65 66<p>The <code>tests/gen_sim_input.py</code> tool generates CSV data, like this:</p> 67 68<!-- TODO: a realistic data set would be nice? How could we generate one? --> 69 70<p><strong>exp.csv</strong></p> 71 72<pre><code>client, true_value 731, v6 741, v3 751, v3 761, v5 771, v13 781, v1 791, v8 802, v2 812, v3 822, v1 832, v8 842, v1 852, v30 862, v10 873, v4 88... 89</code></pre> 90 91<p><em>(spaces added for clarity)</em></p> 92 93<p>By default we generate 700,000 rows: 7 random values from <code>v1</code> to <code>v50</code> for 94each client. These can be thought of as a variable being reported over time.</p> 95 96<p>We're simulating an environment where there are many RAPPOR clients, and a 97single server does the RAPPOR analysis on the accumulated data.</p> 98 99<p>The <code>client</code> is represented by an integer ID. The <code>true_value</code> should <strong>not</strong> 100be sent over the network because we wish to preserve the client's privacy.</p> 101 102<h2>2. RAPPOR Reporting</h2> 103 104<p>The <code>tests/rappor_sim.py</code> tool uses the Python client library 105(<code>client/python/rappor.py</code>) to obscure the <code>v1</code> .. <code>vN</code> strings. We want to 106infer the distribution of these strings over the entire population, but we 107don't want to know any individual values.</p> 108 109<p>After the RAPPOR transformation, we get another CSV file with 700,000 rows. 110Each client is assigned a cohort.</p> 111 112<p><strong>exp_out.csv</strong></p> 113 114<pre><code>client, cohort, rappor 1151, 63, 1111101011110111 1161, 15, 1110110011111100 1171, 12, 0110101111100101 1181, 0, 1111100111110111 1191, 3, 1001110111110011 1201, 14, 1011111010110011 1211, 33, 0111010100101011 1222, 40, 0011011010101001 1232, 35, 1010110101110100 1242, 58, 1110110110111110 1252, 38, 0010001111001010 1262, 5, 1110111011100101 1272, 36, 0111010100111111 1282, 39, 0101101000101101 1293, 32, 0011100111111110 130... 131</code></pre> 132 133<p><em>(spaces added for clarity)</em></p> 134 135<p>We also get a one-row CSV file that contains the RAPPOR parameters:</p> 136 137<p><strong>exp_params.csv</strong></p> 138 139<pre><code>k,h,m,p,q,f 14016,2,64,0.5,0.75,0.5 141</code></pre> 142 143<p>These are described in the <a href="http://arxiv.org/abs/1407.6981">paper</a>. The parameters that the clients use 144must be known to the server, or the decoding will fail. In addition, all 145clients must use the same parameters for a given variable.</p> 146 147<p>The <code>rappor_sim.py</code> process also writes these files:</p> 148 149<ul> 150<li><code>exp_hist.csv</code>: The true histogram of input values. This is used only for 151comparison. In the real world you obviously won't have this.</li> 152<li><code>exp_true_inputs.txt</code>: A list of the unique values reported, e.g. <code>v1</code> .. 153<code>v50</code>. You won't have this either, in general. To use RAPPOR, you must 154supply <em>candidate strings</em>, described below.</li> 155</ul> 156 157<h2>3. Report Aggregation</h2> 158 159<p><code>sum_bits.py</code> takes the <code>exp_out.csv</code> output, and produces the "counts" file:</p> 160 161<p><strong>exp_counts.csv</strong></p> 162 163<pre><code>11116,6273,6433,6347,6385,6290,6621,6359,6747,6623,6321,6696,6282,6652,6368,6286,6222 16410861,6365,6263,6170,6258,6107,6633,6171,6226,6123,6286,6254,6408,6182,6442,6195,6187 165... 166</code></pre> 167 168<p>The file has 64 rows, because the simulation has 64 cohorts by default (<code>m = 16964</code>). This parameter should be adjusted based on the number of unique true 170values expected. <!-- TODO: more detail --></p> 171 172<p>There are 17 columns. The left-most column is the total number of reports in 173the cohort. The remaining 16 columns correspond to the <code>k = 16</code> bits in the 174Bloom filter. Each column contains the number of reports with that bit set 175to 1.</p> 176 177<p>So, in general, the "counts" file is a <code>(k+1) * m</code> matrix.</p> 178 179<h2>4. Candidate Strings</h2> 180 181<p>In the simulation, we assume that the analyst will come up with a <em>superset</em> of 182the candidate strings. This is done in the <code>more-candidates</code> / 183<code>print-candidates</code> functions in <code>demo.sh</code>.</p> 184 185<p>You can also test what happens if you omit true strings from the candidates, by 186editing the invocation of <code>print-candidates</code> in <code>run-dist</code>:</p> 187 188<pre><code># Example of omitting true values. Generate candidates from 189# exp_true_inputs.txt, omitting values v1 and v2. 190 191print-candidates $dist 'v1|v2' > _tmp/${dist}_candidates.txt 192</code></pre> 193 194<p>In general, coming up with candidates is an application- or metric-specific 195process.</p> 196 197<p>The candidates are hashed by <code>hash_candidates.py</code> to create the "map" file, 198before being passed to R for analysis.</p> 199 200<p><strong>exp_map.csv</strong></p> 201 202<pre><code>v1,8,13,30,22,37,37,53,53,77,67,89,86,97,97,118,128,139,136,157,<truncated> 203v10,13,2,25,28,42,45,58,60,68,66,91,89,108,102,113,125,130,131,<truncated> 204</code></pre> 205 206<p>The map file has one row per candidate. In this case, there are 60 rows: 20750 for the true values and 10 for "fake" values, which make the candidates a 208superset of the true input.</p> 209 210<p>The left most column is the raw candidate string. Then there are 128 more 211columns: for <code>m = 64</code> cohorts times <code>k = 2</code> hash functions in the Bloom filter.</p> 212 213<!-- TODO: more detail about setting params? Examples of coming up with 214candidate strings? --> 215 216<h2>5. RAPPOR Analysis</h2> 217 218<p>Once you have the <code>counts</code>, <code>params</code>, and <code>map</code> files, you can pass it to the 219<code>tests/analyze.R</code> tool, which is a small wrapper around the <code>analyze/R</code> 220library.</p> 221 222<p>You will get a plot of the true distribution vs. the distribution recovered 223with the RAPPOR privacy algorithm.</p> 224 225<p><a href="http://google.github.io/rappor/examples/report.html">View the example output</a>.</p> 226 227<p>You can change the simulation parameters and RAPPOR parameters via flags, and 228compare the resulting distributions.</p> 229 230<p>For example, if you expect more unique values from clients, you should also use 231more cohorts (i.e. raise <code>m</code>), to prevent hash function collisions from 232degrading the result quality.</p> 233 234<!-- TODO: 235 - how to change flags 236 - more detail on what the various parameters do 237 - association analysis 238 - basic RAPPOR 239 - longitudinal privacy 240--> 241 242<h2>Conclusion</h2> 243 244<p>RAPPOR allows you infer statistics about populations while preserving the 245privacy of individual clients. In this doc, we walked through a simple demo. 246However, there are other variations of RAPPOR and settings in which you can use 247RAPPOR, which we'll write more about.</p> 248 249<p>Feel free to send feedback on this doc to 250<a href="https://groups.google.com/forum/#!forum/rappor-discuss">[email protected]</a>.</p> 251 </body> 252 </html> 253