{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n" ] } ], "source": [ "# disable FutureWarnings to have a cleaner output\n", "import warnings\n", "warnings.simplefilter(action='ignore', category=FutureWarning)\n", "\n", "# move to previouse directory to access the privugger code\n", "import os, sys\n", "sys.path.append(os.path.join(\"../../\"))\n", "\n", "# external libraries used in the notebook\n", "import numpy as np\n", "import arviz as az\n", "import matplotlib.pyplot as plt\n", "\n", "# privugger library\n", "import privugger as pv" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Getting started with privugger\n", "\n", "This tutorial shows a simple example on using privugger to analyze a program that computes the average of a set of ages. Privugger is based on the privug method introduced in [this paper](https://arxiv.org/abs/2011.08742)-you may want to have a look at the paper if some of the concepts explained below are new to you." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Program to analyze\n", "\n", "Privugger is a tool to quantify privacy risks in python _programs_. Consequently, the very first step is to define the program we want to analyze.\n", "\n", "In this example, we consider a program computing the average of a list of ages. The program takes as input a `numpy` array whose elements are of type `Float` (each element correspond to an age)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def avg(ages):\n", " return (ages.sum()) / (ages.size)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "At first sight, one might think that releasing publicly the output of the program should no pose any privacy. Let's verify whether this is the case with Privugger." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Input specification (attacker knowledge about the input)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "The next step in using privugger is to specify the knowledge of our (assumed) attacker about the input of the program. In other words, what are our assumptions on what the attacker knows about the ages used in this program?\n", "\n", "In this example, we assume that the dataset contains the ages of 100 individuals, and that the attacker believes that the most likely age for the individuals is 35, but she is not certain about it.\n", "\n", "The above attacker knowledge can be modeled with a [Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) with mean 35 and some standard deviation to amount for the uncertainty (in this case we choose a standard deviation of 2)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# input spec\n", "ages = pv.Normal('ages',mu=35,std=2,num_elements=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we define the specification of the input, we wrap everything into a _dataset_. Datasets are objects modeling the input of the programs we want to analyze. In this case, our program of interest takes an array of ages. Consequently, the dataset constructor takes as only input the specification of this array (defined in the cell above).\n", "\n", "If the program under analysis had more parameters, the `input_specs` parameter below should have as many elements as parameters in the program. The elements in `input_specs` do not necessarily need to be arrays, they can only be scalars. For instance, if ages if defined as `num_elements=1` then it models a single age." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# create a privugger dataset\n", "ds = pv.Dataset(input_specs=[ages])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Program specification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is program specification. That is, we put together the program's code and the specification of the input. Additionally, we need to indicate the type of the output of the program.\n", "\n", "In the cell below, we create a `Program` object that encapsulates the program specification. The first parameter specifies the name of the distribution of the output of this program. The parameter `dataset` takes a privugger `Dataset` matching the signature of the program to analyze. This program is specified in `function` which may be a python method or a file containing the method to analyze. Finally, in `output_type` we specify the type of the output of the program.\n", "\n", "The program output distribution is automatically named `'output'`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "program = pv.Program('output',\n", " dataset=ds, \n", " output_type=pv.Float, \n", " function=avg)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Observation specification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this step, we may add an _observation_ to any of the program variables. Observations are predicates modeling binary inequalities. They have the form $a \\, R \\, b$ with $a, b \\in \\mathbb{N}$ or $a,b \\in \\mathbb{R}$ and $R \\in \\{<,>,=<,>=,==\\}$. Additionally, observation have a `precision` parameter that allow us to model the strength of the observation. This is important as, e.g., in continuous domains the probability of any single outcome is 0. A precision of 0 requires the observation to hold with probability 1, the larger the precision the lower the probability require for the observation to hold.\n", "\n", "Here we add an observation saying that the output equals 44 and with a precision of 0.1." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "program.add_observation('output==44', precision=0.1)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Inference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this step, we infer the distributions of all the variables we defined above. We may choose different backends for this purpose. Currently we support [PyMC3](https://docs.pymc.io/en/stable/) (used in this example) and [scipy](https://www.scipy.org/).\n", "\n", "The output of the `infer` method is a `InferenceData` compatible with many statistics libraries for plotting or computing statistics (see section below)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Auto-assigning NUTS sampler...\n", "Initializing NUTS using jitter+adapt_diag...\n", "Initializing NUTS failed. Falling back to elementwise auto-assignment.\n", "Multiprocess sampling (2 chains in 4 jobs)\n", "Slice: [ages]\n" ] }, { "data": { "text/html": [ "\n", "