{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n" ] } ], "source": [ "import matplotlib.pyplot as plt\n", "import arviz as az\n", "import numpy as np\n", "\n", "# move to previouse directory to access the privugger code\n", "import os, sys\n", "sys.path.append(os.path.join(\"../../\"))\n", "\n", "import privugger as pv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Example of using Privugger on OpenDP " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial shows the use of privugger on a program using the differential privacy libray [OpenDP](https://github.com/opendp).\n", "\n", "## OpenDP program\n", "\n", "We consider a program that takes as input a dataset with attributes: age, sex, education, race, income and marriage status. The program outputs the mean of the incomes and adds Laplacian noise to protects the individuals privacy.\n", "\n", "For each attribute, the program takes a parameter of type array (int or float) with the attribute value for each individual in the dataset. For example, to model a dataset of size 2 where the first individual is 20 and the second is 40, we set the `age` input parameter as `age=[20,40]`. The remaining parameters are defined in the same way. The last parameter `N` indicates the number of records in the dataset.\n", "\n", "This way of defining the input may seem unnatural, but, as we will see below, it allows for a structured manner to specify the prior of the program.\n", "\n", "Furthermore, note that the first lines of `dp_program` simply defined a pandas dataframe. This snippet of code can be adapted to other programs. The part of the code after the comment `## After here the...` can contain arbitrary code working on a pandas dataframe with the attributes defined in the parameters of the program." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def dp_program(age, sex, educ, race, income, married, N):\n", " import opendp.smartnoise.core as sn\n", " import pandas as pd\n", "\n", " # assert that all vectors are have the same size\n", " assert age.size == sex.size == educ.size == race.size == income.size == married.size\n", " \n", " ## Dataframe definition (can be automatized)\n", " temp_file='temp.csv'\n", " var_names = [\"age\", \"sex\", \"educ\", \"race\", \"income\", \"married\"]\n", " data = {\n", " \"age\": age,\n", " \"sex\": sex,\n", " \"educ\": educ,\n", " \"race\": race,\n", " \"income\": income,\n", " \"married\": married\n", " }\n", " df = pd.DataFrame(data,columns=var_names)\n", " \n", " ## After here the program works on a pandas dataframe\n", " df.to_csv(temp_file)\n", " with sn.Analysis() as analysis:\n", " # load data\n", " data = sn.Dataset(path=temp_file,column_names=var_names)\n", "\n", " # get mean income with laplacian noise (epsilon=.1 arbitrarily chosen)\n", " age_mean = sn.dp_mean(data = sn.to_float(data['income']),\n", " privacy_usage = {'epsilon': .1},\n", " data_lower = 0., # min income\n", " data_upper = 200., # max income \n", " data_rows = N\n", " )\n", " analysis.release()\n", " return np.float64(age_mean.value) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input specification\n", "\n", "The next step is to specify the prior knowledge of the attacker using probability distributions (i.e., they are defined as _random variables_). In this example, we show specify each of the attributes that conform the dataset.\n", "\n", "The variable `N` defines the size of the dataset (`N_rv` is a point distribution with all probability mass concentrated at `N`, this is necessary because the input specification must be composed by random variables). In this example, we consider a dataset of size 150.\n", "\n", "For non-numeric attributes such as `sex`, `educ` or `race` we use distribution over natural numbers with each number denoting a category. We treat them as nominal values (i.e., we assume there is not order relation, $\\leq$ among them). For these attributes (and `married`) we specify a uniform distribution over all possible categories.\n", "\n", "For `age`, we set a binomial distribution prior with support 0 to 120; this distribution gives highest probability to ages close to 60 years old. We remak here that this prior may be refined by using statistical data about age data.\n", "\n", "Finally, `income` is distributed according to a Normal distribution with mean 100 and standard deviation 5. This gives high probability to values close to 100 (i.e., 100k DKK)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "N = 150\n", "N_rv = pv.Constant('N', N)\n", "age = pv.Binomial('age', p=0.5, n=120, num_elements=N)\n", "sex = pv.DiscreteUniform('sex', 0,2,num_elements=N)\n", "educ = pv.DiscreteUniform('educ', 0,10, num_elements=N)\n", "race = pv.DiscreteUniform('race', 0,50, num_elements=N)\n", "income = pv.Normal('income', mu=100, std=5, num_elements=N)\n", "married = pv.DiscreteUniform('married', 0,1,num_elements=N)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset & Program specification \n", "\n", "The input specification above is always wrapped into a `Dataset` object. This object is used as the input for the program." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "ds = pv.Dataset(input_specs = [age, sex, educ, race, income, married, N_rv])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Program specification\n", "\n", "The program specification takes the `Dataset` above and the program to analyze. To this end, we define a `Program` object including the `Dataset`, a python function corresponding to the program to analyze (the input parameters of the function must match those of the `Dataset`). Finally, it is necessary to specify the type of the output of the program. In this case, since we are analyzing a program compute the mean income it is a float. The first parameter of the `Program` constructor is the name of the output distribution (i.e., the distribution of the output of the program under analysis).\n", "\n", " output type specifies the output type of the program, in this case it is a floating point number as the program calulates the mean. The function is the program specified above. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "program = pv.Program('output',dataset=ds, output_type=pv.Float, function=dp_program)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inference\n", "\n", "Lastly we use the privug interface to perform the inference. This is done by calling `infer` and specifying the `program`, number of cores, number of chains, number of draws, and the backend (which is pymc3 in this example)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Multiprocess sampling (2 chains in 4 jobs)\n", "CompoundStep\n", ">CompoundStep\n", ">>Metropolis: [N]\n", ">>Metropolis: [married]\n", ">>Metropolis: [race]\n", ">>Metropolis: [educ]\n", ">>Metropolis: [sex]\n", ">>Metropolis: [age]\n", ">NUTS: [income]\n" ] }, { "data": { "text/html": [ "\n", "