{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n" ] } ], "source": [ "import matplotlib.pyplot as plt\n", "import arviz as az\n", "import numpy as np\n", "\n", "# move to previouse directory to access the privugger code\n", "import os, sys\n", "sys.path.append(os.path.join(\"../../\"))\n", "\n", "import privugger as pv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Example of using Privugger on OpenDP " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial shows the use of privugger on a program using the differential privacy libray [OpenDP](https://github.com/opendp).\n", "\n", "## OpenDP program\n", "\n", "We consider a program that takes as input a dataset with attributes: age, sex, education, race, income and marriage status. The program outputs the mean of the incomes and adds Laplacian noise to protects the individuals privacy.\n", "\n", "For each attribute, the program takes a parameter of type array (int or float) with the attribute value for each individual in the dataset. For example, to model a dataset of size 2 where the first individual is 20 and the second is 40, we set the `age` input parameter as `age=[20,40]`. The remaining parameters are defined in the same way. The last parameter `N` indicates the number of records in the dataset.\n", "\n", "This way of defining the input may seem unnatural, but, as we will see below, it allows for a structured manner to specify the prior of the program.\n", "\n", "Furthermore, note that the first lines of `dp_program` simply defined a pandas dataframe. This snippet of code can be adapted to other programs. The part of the code after the comment `## After here the...` can contain arbitrary code working on a pandas dataframe with the attributes defined in the parameters of the program." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def dp_program(age, sex, educ, race, income, married, N):\n", " import opendp.smartnoise.core as sn\n", " import pandas as pd\n", "\n", " # assert that all vectors are have the same size\n", " assert age.size == sex.size == educ.size == race.size == income.size == married.size\n", " \n", " ## Dataframe definition (can be automatized)\n", " temp_file='temp.csv'\n", " var_names = [\"age\", \"sex\", \"educ\", \"race\", \"income\", \"married\"]\n", " data = {\n", " \"age\": age,\n", " \"sex\": sex,\n", " \"educ\": educ,\n", " \"race\": race,\n", " \"income\": income,\n", " \"married\": married\n", " }\n", " df = pd.DataFrame(data,columns=var_names)\n", " \n", " ## After here the program works on a pandas dataframe\n", " df.to_csv(temp_file)\n", " with sn.Analysis() as analysis:\n", " # load data\n", " data = sn.Dataset(path=temp_file,column_names=var_names)\n", "\n", " # get mean income with laplacian noise (epsilon=.1 arbitrarily chosen)\n", " age_mean = sn.dp_mean(data = sn.to_float(data['income']),\n", " privacy_usage = {'epsilon': .1},\n", " data_lower = 0., # min income\n", " data_upper = 200., # max income \n", " data_rows = N\n", " )\n", " analysis.release()\n", " return np.float64(age_mean.value) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input specification\n", "\n", "The next step is to specify the prior knowledge of the attacker using probability distributions (i.e., they are defined as _random variables_). In this example, we show specify each of the attributes that conform the dataset.\n", "\n", "The variable `N` defines the size of the dataset (`N_rv` is a point distribution with all probability mass concentrated at `N`, this is necessary because the input specification must be composed by random variables). In this example, we consider a dataset of size 150.\n", "\n", "For non-numeric attributes such as `sex`, `educ` or `race` we use distribution over natural numbers with each number denoting a category. We treat them as nominal values (i.e., we assume there is not order relation, $\\leq$ among them). For these attributes (and `married`) we specify a uniform distribution over all possible categories.\n", "\n", "For `age`, we set a binomial distribution prior with support 0 to 120; this distribution gives highest probability to ages close to 60 years old. We remak here that this prior may be refined by using statistical data about age data.\n", "\n", "Finally, `income` is distributed according to a Normal distribution with mean 100 and standard deviation 5. This gives high probability to values close to 100 (i.e., 100k DKK)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "N = 150\n", "N_rv = pv.Constant('N', N)\n", "age = pv.Binomial('age', p=0.5, n=120, num_elements=N)\n", "sex = pv.DiscreteUniform('sex', 0,2,num_elements=N)\n", "educ = pv.DiscreteUniform('educ', 0,10, num_elements=N)\n", "race = pv.DiscreteUniform('race', 0,50, num_elements=N)\n", "income = pv.Normal('income', mu=100, std=5, num_elements=N)\n", "married = pv.DiscreteUniform('married', 0,1,num_elements=N)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset & Program specification \n", "\n", "The input specification above is always wrapped into a `Dataset` object. This object is used as the input for the program." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "ds = pv.Dataset(input_specs = [age, sex, educ, race, income, married, N_rv])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Program specification\n", "\n", "The program specification takes the `Dataset` above and the program to analyze. To this end, we define a `Program` object including the `Dataset`, a python function corresponding to the program to analyze (the input parameters of the function must match those of the `Dataset`). Finally, it is necessary to specify the type of the output of the program. In this case, since we are analyzing a program compute the mean income it is a float. The first parameter of the `Program` constructor is the name of the output distribution (i.e., the distribution of the output of the program under analysis).\n", "\n", " output type specifies the output type of the program, in this case it is a floating point number as the program calulates the mean. The function is the program specified above. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "program = pv.Program('output',dataset=ds, output_type=pv.Float, function=dp_program)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inference\n", "\n", "Lastly we use the privug interface to perform the inference. This is done by calling `infer` and specifying the `program`, number of cores, number of chains, number of draws, and the backend (which is pymc3 in this example)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Multiprocess sampling (2 chains in 4 jobs)\n", "CompoundStep\n", ">CompoundStep\n", ">>Metropolis: [N]\n", ">>Metropolis: [married]\n", ">>Metropolis: [race]\n", ">>Metropolis: [educ]\n", ">>Metropolis: [sex]\n", ">>Metropolis: [age]\n", ">NUTS: [income]\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [42000/42000 02:59<00:00 Sampling 2 chains, 0 divergences]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Sampling 2 chains for 1_000 tune and 20_000 draw iterations (2_000 + 40_000 draws total) took 180 seconds.\n", "/home/pardo/.local/lib/python3.8/site-packages/arviz/stats/diagnostics.py:561: RuntimeWarning: invalid value encountered in double_scalars\n", " (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)\n", "/home/pardo/.local/lib/python3.8/site-packages/xarray/core/nputils.py:227: RuntimeWarning: All-NaN slice encountered\n", " result = getattr(npmodule, name)(values, axis=axis, **kwargs)\n", "The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.\n", "The estimated number of effective samples is smaller than 200 for some parameters.\n" ] } ], "source": [ "trace = pv.infer(program, cores=4, chains=2, draws=20000, method='pymc3')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Privacy Risk Analysis\n", "\n", "### Mutual Information\n", "\n", "To quantify privacy risks, in this tutorial we are using mutual information (though we remark here that privugger can be used to compute many more leakage measures).\n", "\n", "We study the risks for the individual in the first record. We compute the mutual information between the output of the program (mean income + laplace noise) and each of the other attributes in the dataset.\n", "\n", "Since the mutual informaiton estimator we use is not exact, we compute 100 estimates and use box plots to get an impresison of the accuracy of the estimates." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "trace_length=20000\n", "attrs=['age','sex','race','educ','married','income']\n", "\n", "trace_attr = lambda attr : np.concatenate(trace.posterior[attr],axis=0)\n", "\n", "y=[[pv.mi_sklearn([trace_attr(attr)[:trace_length,0], trace_attr('output')[:trace_length]],\n", " n_neigh=40,input_inferencedata=False)[0]\n", " for attr in attrs] for i in range(0,100)]\n", "\n", "plt.boxplot(np.array(y),attrs,\n", " showmeans=False, showfliers=False, patch_artist=True, vert=True)\n", "plt.xticks(range(0,len(attrs)), attrs)\n", "plt.xlabel('attr')\n", "plt.ylabel('$I(attr_0;income)$')\n", "plt.title(\"Mutual information between income, and other attributes\")\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the figure above, we observe that the mutual information between the output and any of the attributes is very low $I(\\mathit{attr};\\mathit{output})\\leq 0.005$ for all attributes ($\\mathit{attr}$)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }