1# Private Join and Compute 2 3This project contains an implementation of the "Private Join and Compute" 4functionality. This functionality allows two users, each holding an input file, 5to privately compute the sum of associated values for records that have common 6identifiers. 7 8In more detail, suppose a Server has a file containing the following 9identifiers: 10 11Identifiers | 12----------- | 13Sam | 14Ada | 15Ruby | 16Brendan | 17 18And a Client has a file containing the following identifiers, paired with 19associated integer values: 20 21Identifiers | Associated Values 22----------- | ----------------- 23Ruby | 10 24Ada | 30 25Alexander | 5 26Mika | 35 27 28Then the Private Join and Compute functionality would allow the Client to learn 29that the input files had *2* identifiers in common, and that the associated 30values summed to *40*. It does this *without* revealing which specific 31identifiers were in common (Ada and Ruby in the example above), or revealing 32anything additional about the other identifiers in the two parties' data set. 33 34Private Join and Compute is a variant of the well-studied Private Set 35Intersection functionality. We sometimes also refer to Private Join and Compute 36as Private Intersection-Sum. 37 38## How to run the protocol 39 40In order to run Private Join and Compute, you need to install Bazel, if you 41don't have it already. 42[Follow the instructions for your platform on the Bazel website.](https://docs.bazel.build/versions/master/install.html) 43 44You also need to install Git, if you don't have it already. 45[Follow the instructions for your platform on the Git website.](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) 46 47Once you've installed Bazel and Git, open a Terminal and clone the Private Join 48and Compute repository into a local folder: 49 50```shell 51git clone https://github.com/google/private-join-and-compute.git 52``` 53 54Navigate into the `private-join-and-compute` folder you just created, and build 55the Private Join and Compute library and dependencies using Bazel: 56 57```bash 58cd private-join-and-compute 59bazel build //private_join_and_compute:all 60``` 61 62(All the following instructions must be run from inside the 63private-join-and-compute folder.) 64 65Next, generate some dummy data to run the protocol on: 66 67```shell 68bazel-bin/private_join_and_compute/generate_dummy_data --server_data_file=/tmp/dummy_server_data.csv \ 69--client_data_file=/tmp/dummy_client_data.csv 70``` 71 72This will create dummy data for the server and client at the specified 73locations. You can look at the files in `/tmp/dummy_server_data.csv` and 74`/tmp/dummy_client_data.csv` to see the dummy data that was generated. You can 75also change the size of the dummy data generated using additional flags. For 76example: 77 78```shell 79bazel-bin/private_join_and_compute/generate_dummy_data \ 80--server_data_file=/tmp/dummy_server_data.csv \ 81--client_data_file=/tmp/dummy_client_data.csv --server_data_size=1000 \ 82--client_data_size=1000 --intersection_size=200 --max_associated_value=100 83``` 84 85Once you've generated dummy data, you can start the server as follows: 86 87```shell 88bazel-bin/private_join_and_compute/server --server_data_file=/tmp/dummy_server_data.csv 89``` 90 91The server will load data from the specified file, and wait for a connection 92from the client. 93 94Once the server is running, you can start a client to connect to the server. 95Create a new terminal and navigate to the private-join-and-compute folder. Once 96there, run the following command to start the client: 97 98```shell 99bazel-bin/private_join_and_compute/client --client_data_file=/tmp/dummy_client_data.csv 100``` 101 102The client will connect to the server and execute the steps of the protocol 103sequentially. At the end of the protocol, the client will output the 104Intersection Size (the number of identifiers in common) and the Intersection Sum 105(the sum of associated values). If the protocol was successful, both the server 106and client will shut down. 107 108## Caveats 109 110Several caveats should be carefully considered before using Private Join and 111Compute. 112 113### Security Model 114 115Our protocol has security against honest-but-curious adversaries. This means 116that as long as both participants follow the protocol honestly, neither will 117learn more than the size of the intersection and the intersection-sum. However, 118if a participant deviates from the protocol, it is possible they could learn 119more than the prescribed information. For example, they could learn the specific 120identifiers in the intersection. If the underlying data is sensitive, we 121recommend performing a careful risk analysis before using Private Join and 122Compute, to ensure that neither party has an incentive to deviate from the 123protocol. The protocol can also be supplemented with external enforcement such 124as code audits to ensure that no party deviates from the protocol. 125 126### Maliciously Chosen Inputs 127 128We note that our protocol does not authenticate that parties use "real" input, 129nor does it prevent them from arbitrarily changing their input. We suggest 130careful analysis of whether any party has an incentive to lie about their 131inputs. This risk can also be mitigated by external enforcement such as code 132audits. 133 134### Leakage from the Intersection-Sum. 135 136While the Private Join and Compute functionality is supposed to reveal only the 137intersection-size and intersection-sum, it is possible that the intersection-sum 138itself could reveal something about which identifiers were in common. 139 140For example, if an identifier has a very unique associated integer values, then 141it may be easy to detect if that identifier was in the intersection simply by 142looking at the intersection-sum. One way this could happen is if one of the 143identifiers has a very large associated value compared to all other identifiers. 144In that case, if the intersection-sum is large, one could reasonably infer that 145that identifier was in the intersection. To mitigate this, we suggest scrubbing 146inputs to remove identifiers with "outlier" values. 147 148Another way that the intersection-sum may leak which identifiers are in the 149intersection is if the intersection is too small. This could make it easier to 150guess which combination of identifiers could be in the intersection in order to 151yield a particular intersection-sum. To mitigate this, one could abort the 152protocol if the intersection-size is below a certain threshold, or to add noise 153to the output of the protocol. 154 155(Note that these mitigations are not currently implemented in this open-source 156library.) 157 158## Disclaimers 159 160This is not an officially supported Google product. The software is provided 161as-is without any guarantees or warranties, express or implied. 162