{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n" ] } ], "source": [ "import matplotlib.pyplot as plt\n", "import arviz as az\n", "import numpy as np\n", "\n", "import os, sys\n", "sys.path.append(os.path.join(\"../../\"))\n", "\n", "import privugger as pv\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Governor case study\n", "\n", "In this tutorial, we explore the famous case study presented by [Sweeny 2002](https://dl.acm.org/doi/10.1142/S0218488502001648) where she re-identified the the medical records of the governor of Masachussets. The problem the dataset was that records were \"anonymized\" simply by removing identifiers. Nowadays, it is well known that this is a naïve and insufficient method for anonymization data. Yet many real applications still use this type of \"anonymization\". \n", "\n", "This tutorial shows how privugger can be used analyze the uniqueness of records after identifiers have been removed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Naïve anonymization program (removing identifiers)\n", "\n", "We consider a program that takes a dataset with attributes: `name`, `zip`, `age`, `gender` and `diagnosis`. In this dataset, `name` is considered an identifier (this is an unrealistic assumption, since in reality there can be many people with the same name, but it serves the purpose of the tutorial). \n", "\n", "As you can see below, the program simply returns a dataset without names." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def remove_names(names, zips, ages, genders, diagnoses):\n", " output = []\n", " output.append(zips)\n", " output.append(ages)\n", " output.append(genders)\n", " output.append(diagnoses)\n", " return output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input specification\n", "\n", "Note that this specification is _compositional_, we define different parts of the dataset and concatenate them as the bottom.\n", "\n", "We consider a dataset with 100 records. At this point, we fix the attributes of the victim's record (from now on called Alice). Alice is 25 years old, lives in ZIP code 50 (this is just a nominal placeholder for a real zip code), she is female and she is ill. These values are defined as constants of the form `ALICE_XXX` below.\n", "\n", "We define a constant record (using point distributions) that we may add to the specification to ensure that Alice's record is in the dataset. This step is optional. In fact, when the analysis is performed without this record, we model an adversary who doesn't know whether Alice is in the dataset.\n", "\n", "The variables `xxx_others` contain the spec for the records in the dataset. We have 5 different names uniformly distributed, ages are normally distributed with the most likely age being around 55. We consider 100 different ZIP codes uniformly distributed. Two genders (though acknowledging that this is a simplification and more genders exist) which are uniformly distributed. As for the diagnosis, there is a 20% chance of being ill.\n", "\n", "The following lines show how to merge Alice's record with the rest of the dataset. This also exemplifies how to compositionally define input specifications.\n", "\n", "Finally, we create the `Dataset` object with the complete input spec." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "N = 100\n", "\n", "# Victim's record\n", "\n", "## Constant values for Alice's record\n", "ALICE_NAME = 0\n", "ALICE_AGE = 25.0\n", "ALICE_ZIP = 50\n", "ALICE_GENDER = 1\n", "ALICE_DIAGNOSIS = 1\n", "\n", "## Spec for Alice's record (all point distributions with the values above)\n", "alice_name = pv.Constant(\"alice_name\",ALICE_NAME, num_elements=1)\n", "alice_age = pv.Constant(\"alice_age\", ALICE_AGE, num_elements=1)\n", "alice_zip = pv.Constant(\"alice_zip\",ALICE_ZIP, num_elements=1)\n", "alice_gender = pv.Constant(\"alice_gender\",ALICE_GENDER, num_elements=1) # 1: female, 0: male\n", "alice_diagnosis = pv.Constant(\"alice_diagnosis\", ALICE_DIAGNOSIS, num_elements=1) # 0: healthy, 1: ill\n", "\n", "# Spec for the records of others \n", "names_others = pv.DiscreteUniform(\"names_others\", 0, 5, num_elements=N)\n", "ages_others = pv.Normal(\"ages_others\", mu=55.2, std=3.5, num_elements=N)\n", "zips_others = pv.DiscreteUniform(\"zips_others\", 0, 100, num_elements=N)\n", "genders_others = pv.Bernoulli(\"genders_others\",p=.5,num_elements=N) \n", "diagnoses_others = pv.Bernoulli(\"diagnoses_others\", p=.2,num_elements=N)\n", "\n", "# Merging all in a single dataset spec\n", "names = pv.concatenate(names_others, alice_name, \"discrete\")\n", "zips = pv.concatenate(zips_others, alice_zip, \"discrete\")\n", "ages = pv.concatenate(ages_others, alice_age, \"continuous\")\n", "genders = pv.concatenate(genders_others, alice_gender, \"discrete\")\n", "diagnoses = pv.concatenate(diagnoses_others, alice_diagnosis, \"discrete\")\n", "\n", "# Dataset spec\n", "ds = pv.Dataset(input_specs = [names, zips, ages, genders, diagnoses])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Program specification\n", "\n", "The program specification takes the input specification above, and the program to analyze `remove_names`. We give the name `'output'` to the distribution of the output of the program. In this example, it is important to remark that the output of the program is a matrix of floats (i.e., a numeric dataset). In this matrix, each row models a row in the dataset, and each column models an attribute." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "program = pv.Program('output', dataset=ds, output_type=pv.Matrix(pv.Float), function=remove_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inference\n", "\n", "We use the pymc3 backend to perform the inference." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Multiprocess sampling (2 chains in 4 jobs)\n", "CompoundStep\n", ">CompoundStep\n", ">>Metropolis: [alice_diagnosis]\n", ">>Metropolis: [alice_gender]\n", ">>Metropolis: [alice_age]\n", ">>Metropolis: [alice_zip]\n", ">>Metropolis: [zips_others]\n", ">>Metropolis: [alice_name]\n", ">>Metropolis: [names_others]\n", ">NUTS: [ages_others]\n", ">BinaryGibbsMetropolis: [genders_others, diagnoses_others]\n" ] }, { "data": { "text/html": [ "\n", "