1RAPPOR Data Flow 2================ 3 4This doc explains the simulation tools and data formats in the [RAPPOR 5repository](https://github.com/google/rappor). We'll focus on the code, and 6describe the algorithm only informally. For details, see the [paper][]. 7 8Overview 9-------- 10 11Start with this command: 12 13 $ ./demo.sh run 14 15It takes a minute or so to run. The dependencies listed in the 16[README][] must be installed. At the end, it will say: 17 18 Wrote _tmp/report.html. Open this in your browser. 19 20It should look like [this][example]. 21 22The following diagram shows what processes and files are involved in the demo. 23Ovals represent **processes**; rectangles represent **data**. The dotted lines 24denote components that are involved in the simulation, but wouldn't be used in 25a "real" setting. 26 27In most configurations, reporting (in blue) is done by client machines, while 28analysis (in green) is done by a server. 29 30<img src="data-flow.png" alt="Diagram of RAPPOR Data Flow" /> 31 32In the simulation, reporting consists of these steps: 33 34 1. Generate simulated input data with different distributions. 35 2. Obscure each value with the RAPPOR privacy-preserving reporting mechanism. 36 37Analysis consists of these steps: 38 39 1. Aggregate the reports by summing bits (i.e. make a counting Bloom filter) 40 2. Come up with candidate strings, and hash them in the same manner as the 41 client. 42 3. Using the reports, RAPPOR parameters, and candidate strings as input, 43 infer the distribution of true values. We don't see the values themselves. 44 We plot the true and inferred distributions side by side for comparison. 45 46This process is described in detail below. 47 481. Generating Simulated Input 49----------------------------- 50 51The `tests/gen_sim_input.py` tool generates CSV data, like this: 52 53<!-- TODO: a realistic data set would be nice? How could we generate one? --> 54 55**exp.csv** 56 57 client, true_value 58 1, v6 59 1, v3 60 1, v3 61 1, v5 62 1, v13 63 1, v1 64 1, v8 65 2, v2 66 2, v3 67 2, v1 68 2, v8 69 2, v1 70 2, v30 71 2, v10 72 3, v4 73 ... 74 75*(spaces added for clarity)* 76 77By default we generate 700,000 rows: 7 random values from `v1` to `v50` for 78each client. These can be thought of as a variable being reported over time. 79 80We're simulating an environment where there are many RAPPOR clients, and a 81single server does the RAPPOR analysis on the accumulated data. 82 83The `client` is represented by an integer ID. The `true_value` should **not** 84be sent over the network because we wish to preserve the client's privacy. 85 86 872. RAPPOR Reporting 88------------------- 89 90The `tests/rappor_sim.py` tool uses the Python client library 91(`client/python/rappor.py`) to obscure the `v1` .. `vN` strings. We want to 92infer the distribution of these strings over the entire population, but we 93don't want to know any individual values. 94 95After the RAPPOR transformation, we get another CSV file with 700,000 rows. 96Each client is assigned a cohort. 97 98**exp_out.csv** 99 100 client, cohort, rappor 101 1, 63, 1111101011110111 102 1, 15, 1110110011111100 103 1, 12, 0110101111100101 104 1, 0, 1111100111110111 105 1, 3, 1001110111110011 106 1, 14, 1011111010110011 107 1, 33, 0111010100101011 108 2, 40, 0011011010101001 109 2, 35, 1010110101110100 110 2, 58, 1110110110111110 111 2, 38, 0010001111001010 112 2, 5, 1110111011100101 113 2, 36, 0111010100111111 114 2, 39, 0101101000101101 115 3, 32, 0011100111111110 116 ... 117 118*(spaces added for clarity)* 119 120We also get a one-row CSV file that contains the RAPPOR parameters: 121 122**exp_params.csv** 123 124 k,h,m,p,q,f 125 16,2,64,0.5,0.75,0.5 126 127These are described in the [paper][]. The parameters that the clients use 128must be known to the server, or the decoding will fail. In addition, all 129clients must use the same parameters for a given variable. 130 131The `rappor_sim.py` process also writes these files: 132 133- `exp_hist.csv`: The true histogram of input values. This is used only for 134 comparison. In the real world you obviously won't have this. 135- `exp_true_inputs.txt`: A list of the unique values reported, e.g. `v1` .. 136 `v50`. You won't have this either, in general. To use RAPPOR, you must 137 supply *candidate strings*, described below. 138 1393. Report Aggregation 140--------------------- 141 142`sum_bits.py` takes the `exp_out.csv` output, and produces the "counts" file: 143 144**exp_counts.csv** 145 146 11116,6273,6433,6347,6385,6290,6621,6359,6747,6623,6321,6696,6282,6652,6368,6286,6222 147 10861,6365,6263,6170,6258,6107,6633,6171,6226,6123,6286,6254,6408,6182,6442,6195,6187 148 ... 149 150The file has 64 rows, because the simulation has 64 cohorts by default (`m = 15164`). This parameter should be adjusted based on the number of unique true 152values expected. <!-- TODO: more detail --> 153 154There are 17 columns. The left-most column is the total number of reports in 155the cohort. The remaining 16 columns correspond to the `k = 16` bits in the 156Bloom filter. Each column contains the number of reports with that bit set 157to 1. 158 159So, in general, the "counts" file is a `(k+1) * m` matrix. 160 1614. Candidate Strings 162-------------------- 163 164In the simulation, we assume that the analyst will come up with a *superset* of 165the candidate strings. This is done in the `more-candidates` / 166`print-candidates` functions in `demo.sh`. 167 168You can also test what happens if you omit true strings from the candidates, by 169editing the invocation of `print-candidates` in `run-dist`: 170 171 # Example of omitting true values. Generate candidates from 172 # exp_true_inputs.txt, omitting values v1 and v2. 173 174 print-candidates $dist 'v1|v2' > _tmp/${dist}_candidates.txt 175 176In general, coming up with candidates is an application- or metric-specific 177process. 178 179The candidates are hashed by `hash_candidates.py` to create the "map" file, 180before being passed to R for analysis. 181 182**exp_map.csv** 183 184 v1,8,13,30,22,37,37,53,53,77,67,89,86,97,97,118,128,139,136,157,<truncated> 185 v10,13,2,25,28,42,45,58,60,68,66,91,89,108,102,113,125,130,131,<truncated> 186 187The map file has one row per candidate. In this case, there are 60 rows: 18850 for the true values and 10 for "fake" values, which make the candidates a 189superset of the true input. 190 191The left most column is the raw candidate string. Then there are 128 more 192columns: for `m = 64` cohorts times `k = 2` hash functions in the Bloom filter. 193 194<!-- TODO: more detail about setting params? Examples of coming up with 195candidate strings? --> 196 1975. RAPPOR Analysis 198------------------ 199 200Once you have the `counts`, `params`, and `map` files, you can pass it to the 201`tests/analyze.R` tool, which is a small wrapper around the `analyze/R` 202library. 203 204You will get a plot of the true distribution vs. the distribution recovered 205with the RAPPOR privacy algorithm. 206 207[View the example output][example]. 208 209You can change the simulation parameters and RAPPOR parameters via flags, and 210compare the resulting distributions. 211 212For example, if you expect more unique values from clients, you should also use 213more cohorts (i.e. raise `m`), to prevent hash function collisions from 214degrading the result quality. 215 216<!-- TODO: 217 - how to change flags 218 - more detail on what the various parameters do 219 - association analysis 220 - basic RAPPOR 221 - longitudinal privacy 222--> 223 224Conclusion 225---------- 226 227RAPPOR allows you infer statistics about populations while preserving the 228privacy of individual clients. In this doc, we walked through a simple demo. 229However, there are other variations of RAPPOR and settings in which you can use 230RAPPOR, which we'll write more about. 231 232Feel free to send feedback on this doc to 233[[email protected]](https://groups.google.com/forum/#!forum/rappor-discuss). 234 235 236[README]: https://github.com/google/rappor/blob/master/README.md 237[paper]: http://arxiv.org/abs/1407.6981 238[example]: http://google.github.io/rappor/examples/report.html 239 240