{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\n" ] } ], "source": [ "import matplotlib.pyplot as plt\n", "import arviz as az\n", "import numpy as np\n", "\n", "import os, sys\n", "sys.path.append(os.path.join(\"../../\"))\n", "\n", "import privugger as pv\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# Governor case study\n", "\n", "In this tutorial, we explore the famous case study presented by [Sweeny 2002](https://dl.acm.org/doi/10.1142/S0218488502001648) where she re-identified the the medical records of the governor of Masachussets. The problem the dataset was that records were \"anonymized\" simply by removing identifiers. Nowadays, it is well known that this is a naïve and insufficient method for anonymization data. Yet many real applications still use this type of \"anonymization\". \n", "\n", "This tutorial shows how privugger can be used analyze the uniqueness of records after identifiers have been removed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Naïve anonymization program (removing identifiers)\n", "\n", "We consider a program that takes a dataset with attributes: `name`, `zip`, `age`, `gender` and `diagnosis`. In this dataset, `name` is considered an identifier (this is an unrealistic assumption, since in reality there can be many people with the same name, but it serves the purpose of the tutorial). \n", "\n", "As you can see below, the program simply returns a dataset without names." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def remove_names(names, zips, ages, genders, diagnoses):\n", " output = []\n", " output.append(zips)\n", " output.append(ages)\n", " output.append(genders)\n", " output.append(diagnoses)\n", " return output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input specification\n", "\n", "Note that this specification is _compositional_, we define different parts of the dataset and concatenate them as the bottom.\n", "\n", "We consider a dataset with 100 records. At this point, we fix the attributes of the victim's record (from now on called Alice). Alice is 25 years old, lives in ZIP code 50 (this is just a nominal placeholder for a real zip code), she is female and she is ill. These values are defined as constants of the form `ALICE_XXX` below.\n", "\n", "We define a constant record (using point distributions) that we may add to the specification to ensure that Alice's record is in the dataset. This step is optional. In fact, when the analysis is performed without this record, we model an adversary who doesn't know whether Alice is in the dataset.\n", "\n", "The variables `xxx_others` contain the spec for the records in the dataset. We have 5 different names uniformly distributed, ages are normally distributed with the most likely age being around 55. We consider 100 different ZIP codes uniformly distributed. Two genders (though acknowledging that this is a simplification and more genders exist) which are uniformly distributed. As for the diagnosis, there is a 20% chance of being ill.\n", "\n", "The following lines show how to merge Alice's record with the rest of the dataset. This also exemplifies how to compositionally define input specifications.\n", "\n", "Finally, we create the `Dataset` object with the complete input spec." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "N = 100\n", "\n", "# Victim's record\n", "\n", "## Constant values for Alice's record\n", "ALICE_NAME = 0\n", "ALICE_AGE = 25.0\n", "ALICE_ZIP = 50\n", "ALICE_GENDER = 1\n", "ALICE_DIAGNOSIS = 1\n", "\n", "## Spec for Alice's record (all point distributions with the values above)\n", "alice_name = pv.Constant(\"alice_name\",ALICE_NAME, num_elements=1)\n", "alice_age = pv.Constant(\"alice_age\", ALICE_AGE, num_elements=1)\n", "alice_zip = pv.Constant(\"alice_zip\",ALICE_ZIP, num_elements=1)\n", "alice_gender = pv.Constant(\"alice_gender\",ALICE_GENDER, num_elements=1) # 1: female, 0: male\n", "alice_diagnosis = pv.Constant(\"alice_diagnosis\", ALICE_DIAGNOSIS, num_elements=1) # 0: healthy, 1: ill\n", "\n", "# Spec for the records of others \n", "names_others = pv.DiscreteUniform(\"names_others\", 0, 5, num_elements=N)\n", "ages_others = pv.Normal(\"ages_others\", mu=55.2, std=3.5, num_elements=N)\n", "zips_others = pv.DiscreteUniform(\"zips_others\", 0, 100, num_elements=N)\n", "genders_others = pv.Bernoulli(\"genders_others\",p=.5,num_elements=N) \n", "diagnoses_others = pv.Bernoulli(\"diagnoses_others\", p=.2,num_elements=N)\n", "\n", "# Merging all in a single dataset spec\n", "names = pv.concatenate(names_others, alice_name, \"discrete\")\n", "zips = pv.concatenate(zips_others, alice_zip, \"discrete\")\n", "ages = pv.concatenate(ages_others, alice_age, \"continuous\")\n", "genders = pv.concatenate(genders_others, alice_gender, \"discrete\")\n", "diagnoses = pv.concatenate(diagnoses_others, alice_diagnosis, \"discrete\")\n", "\n", "# Dataset spec\n", "ds = pv.Dataset(input_specs = [names, zips, ages, genders, diagnoses])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Program specification\n", "\n", "The program specification takes the input specification above, and the program to analyze `remove_names`. We give the name `'output'` to the distribution of the output of the program. In this example, it is important to remark that the output of the program is a matrix of floats (i.e., a numeric dataset). In this matrix, each row models a row in the dataset, and each column models an attribute." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "program = pv.Program('output', dataset=ds, output_type=pv.Matrix(pv.Float), function=remove_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inference\n", "\n", "We use the pymc3 backend to perform the inference." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Multiprocess sampling (2 chains in 4 jobs)\n", "CompoundStep\n", ">CompoundStep\n", ">>Metropolis: [alice_diagnosis]\n", ">>Metropolis: [alice_gender]\n", ">>Metropolis: [alice_age]\n", ">>Metropolis: [alice_zip]\n", ">>Metropolis: [zips_others]\n", ">>Metropolis: [alice_name]\n", ">>Metropolis: [names_others]\n", ">NUTS: [ages_others]\n", ">BinaryGibbsMetropolis: [genders_others, diagnoses_others]\n" ] }, { "data": { "text/html": [ "\n", "
\n", " \n", " \n", " 100.00% [22000/22000 03:21<00:00 Sampling 2 chains, 0 divergences]\n", "
\n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Sampling 2 chains for 1_000 tune and 10_000 draw iterations (2_000 + 20_000 draws total) took 202 seconds.\n", "/home/pardo/.local/lib/python3.8/site-packages/arviz/stats/diagnostics.py:561: RuntimeWarning: invalid value encountered in double_scalars\n", " (between_chain_variance / within_chain_variance + num_samples - 1) / (num_samples)\n", "/home/pardo/.local/lib/python3.8/site-packages/xarray/core/nputils.py:227: RuntimeWarning: All-NaN slice encountered\n", " result = getattr(npmodule, name)(values, axis=axis, **kwargs)\n", "/home/pardo/.local/lib/python3.8/site-packages/xarray/core/nputils.py:227: RuntimeWarning: All-NaN slice encountered\n", " result = getattr(npmodule, name)(values, axis=axis, **kwargs)\n", "/home/pardo/.local/lib/python3.8/site-packages/xarray/core/nputils.py:227: RuntimeWarning: All-NaN slice encountered\n", " result = getattr(npmodule, name)(values, axis=axis, **kwargs)\n", "/home/pardo/.local/lib/python3.8/site-packages/xarray/core/nputils.py:227: RuntimeWarning: All-NaN slice encountered\n", " result = getattr(npmodule, name)(values, axis=axis, **kwargs)\n", "/home/pardo/.local/lib/python3.8/site-packages/xarray/core/nputils.py:227: RuntimeWarning: All-NaN slice encountered\n", " result = getattr(npmodule, name)(values, axis=axis, **kwargs)\n", "The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.\n", "The estimated number of effective samples is smaller than 200 for some parameters.\n" ] } ], "source": [ "trace = pv.infer(program, cores=4, draws=10000, method='pymc3', return_model=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Privacy Risk Analysis\n", "\n", "In this tutorial, we focus on uniqueness queries. That is, we quantify how unique Alice's record is based on her attribute values.\n", "\n", "### How many records share the Alice's attribute values?\n", "\n", "First we compute the average of the number of records that have the same values as Alice's record.\n", "\n", "Currentely, privugger does not have built-in functions to compute these type of query so the information in the trace directly." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Avg. number of rows with gov's name: 16.20235\n", "Avg. number of rows with gov's zip: 1.95685\n", "Avg. number of rows with gov's age: 1.0\n", "Avg. number of rows with gov's gender: 50.98105\n", "Avg. number of rows with gov's diagnosis: 21.01775\n" ] } ], "source": [ "trace_attr = lambda attr : np.concatenate([trace.posterior['output'][i][:,attr,:] for i in [0,1]])\n", "\n", "names_db = np.concatenate([trace.posterior['names_others'].values[i] for i in [0,1]])\n", "zips_db = trace_attr(0)\n", "ages_db = trace_attr(1)\n", "genders_db = trace_attr(2)\n", "diagnoses_db = trace_attr(3)\n", "\n", "\n", "print(\"Avg. number of rows with gov's name: \",sum([np.count_nonzero(names_db[i]==ALICE_NAME) for i in range(0,len(names_db))])/len(names_db))\n", "print(\"Avg. number of rows with gov's zip: \",sum([np.count_nonzero(zips_db[i]==ALICE_ZIP) for i in range(0,len(zips_db))])/len(zips_db))\n", "print(\"Avg. number of rows with gov's age: \",sum([np.count_nonzero(ages_db[i]==ALICE_AGE) for i in range(0,len(ages_db))])/len(ages_db)) # 1.0 is expected\n", "print(\"Avg. number of rows with gov's gender: \",sum([np.count_nonzero(genders_db[i]==ALICE_GENDER) for i in range(0,len(genders_db))])/len(genders_db))\n", "print(\"Avg. number of rows with gov's diagnosis: \",sum([np.count_nonzero(diagnoses_db[i]==ALICE_DIAGNOSIS) for i in range(0,len(diagnoses_db))])/len(diagnoses_db))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attributed based uniqueness histograms\n", "\n", "A more interesting perspective is to plot histograms showing the probability of having $n$ records equal to Alice's attributes.\n", "\n", "Here we focus on the attributes `gender` and `zip`. The plots below show the probability of having $n \\in \\mathbb{N}$ number of records with: 1) the same gender as Alice, 2) the same zip code as Alice, 3) the same gender _and_ zip code as Alice." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEWCAYAAABv+EDhAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAZFUlEQVR4nO3df5BdZYHm8e8zwaADOqD0TmlCTNCgRsGATdRC0VF+BFkJO/4KjjNBncpgkV1dalzjauFUXHcBd6jZtVDJllHGESPK6LRjKAZB/IVoGohggtEmk5FkUCKJIopgzLN/3DdyuN5O39t9O53O+3yqbvU573nfc99zcnKee37cc2WbiIiozx9MdQciImJqJAAiIiqVAIiIqFQCICKiUgmAiIhKJQAiIiqVAIiIqFQCIPpG0l9J+rtxtv2mpBP63KXokaRDJX1f0kCj7NWSPjOV/YrJkQCIrknaKukhSQ9K+omkT0g6vEybCbwX+GCHdodLuqcMv0XSZW3TXw38wvbtZfzBDq/fSNpSpp8n6Rvd9KvPyy9JWyRt6jDtJkl/2aF8riRLOqRRtkjSOkk/k7RT0nckvblMe7mkPR2W/8X76NdLR1lneyStKXU+Iel/tPVpb72tklYC2H4YWAOs3Dt/218Enivp+HGvvDggJQCiV6+2fThwIjBIa6cPsAT4vu3tHdqcANxehl8A3NY2/Xzgk3tHbB/efAHHAjuB94+jX/10CvAfgGMknTSeGZQd+Y3AV4FnAk8B3gac2aj27+3rwPa3Rpun7a93WGd/CvwSuGy0dsARpe65wEWSFpfyq4Blkg5t1P00sLzHxY0DXAIgxqXs6K8FnleKzqS1U+tkELi1Mfy7AChHDq8YrW355Hw18EXbHx9Hv/ppGfBPwLoyPB4fBK60fYntn7rlVtuv71cnJR0NfAp4m+3vjVW/hMtGyjqzvQ3YBbyoUe0m4Kx+9TEODAmAGJeyk3kVj36yPw7Y3FbnY5J+BlwCvLMMDwI3S9pYqs0H9pSdTieXAocBK8bZr/bp/1xOvXR6/fM+5vuHwGtp7Vg/BSwt4dW1Mo8XA5/rpV2P7zET+CzwOdv/0EV9SToZeC6PXWd3Ac9vG58r6Un97G9MrUPGrhLxGF+QtBv4OfAl4H+W8iOAXzQr2n6rpL8Gvg0cT+u0xMm2L2hU+712e0l6DfBm4AW2fz3Ofj2G7f84xnxG86fAw8C/0Pp/8zhan4g/38M8jqT1oeveMeo9rYRl0yzbv+ziPf4WmAG8o4u6PwUM/BhYafuGxrRf0Pq3aY5Tyh7oYt4xDSQAolfn2P5yh/JdwBP3jkg6G/h7WjvKQ2jtZA4Dfi3pz4BTbQ+3t2u0Pxb4GPAXtrdMoF/9sgy42vZuYLeka0pZLwGwC9gDPBX4/j7q/bvt2b12UNJS4I3AieVi7liOKsvTyROBn7WN01YW01xOAUW/3EHrYi0AtodsH0Hr4u55ZXgnMGD7iLLzBxihdSZi1t625VTJNcBHbQ/1s5OSrh3ljpkHJV07SpvZtK5TvEnSjyX9mNbpoFdJOqrb97b9K+BbwGv6sSxtfXwOsBr4c9v/1odZPgf4btv4Vtv59H8QSQBEv6wDXtah/AXAbZLmAfe2n8qx/Qjw5ba2HwXuB97T707aPrPDHTZ7X2eO0uzPgR8AzwIWltexwDZad9DsdYikxzdej+swr/8GnCfpnZKeAiDp+ZLWjneZJB1GKzD/j+11451PY36zgCcDtzSKX0br4nocRBIA0S9fBJ4t6Wl7C8oOcC6tneeJPHonULsraO1kkTSnDL8I+Hn7p/RJ7P++LAM+bPvHzRetoGreDfQR4KHG6/fuWrJ9M62jiVcAWyTtpPXJvbnjflqHo5N9HTW8htYn9Au7PaoZwxtp3anUPI10Lq1/pziIKL8IFv0iaTmwwPY7xtH2m8CKvV8Gi6lR7v3/LnCK7ftK2atpnVrq262qcWBIAEREVCp3AUVME/s4BXam7a/v187EQSFHABERlTrgjgCOOuooz507d6q7ERExrdx6660/tT0wds1HHXABMHfuXIaHh8euGBERvyOp5+9/5DbQiIhKJQAiIiqVAIiIqFQCICKiUgmAiIhKJQAiIiqVAIiIqFQCICKiUgmAiIhKHXDfBJ4Mc1d+qat6Wy8+a5J7EhFx4MgRQEREpRIAERGVquIU0HjktFFEHOxyBBARUakEQEREpRIAERGVSgBERFQqARARUakEQEREpRIAERGVSgBERFSqqwCQtFjSZkkjklZ2mH6+pDslbZD0DUkLSvlcSQ+V8g2SPtrvBYiIiPEZ85vAkmYAlwOnAduA9ZKGbG9qVLvK9kdL/bOBy4DFZdrdthf2tdcRETFh3RwBLAJGbG+x/QiwFljSrGD7gcboYYD718WIiJgM3QTALOCexvi2UvYYki6QdDdwKfBfGpPmSbpd0lclvbTTG0haLmlY0vCOHTt66H5ERIxX3y4C277c9jOAdwHvLcX3AnNsnwBcCFwl6Ukd2q62PWh7cGBgoF9dioiIfegmALYDRzfGZ5ey0awFzgGw/bDt+8vwrcDdwLHj6mlERPRVNwGwHpgvaZ6kmcBSYKhZQdL8xuhZwA9L+UC5iIykY4D5wJZ+dDwiIiZmzLuAbO+WtAK4DpgBrLG9UdIqYNj2ELBC0qnAb4BdwLLS/BRglaTfAHuA823vnIwFiYiI3nT1gzC21wHr2souagy/fZR21wDXTKSDERExOfJN4IiISiUAIiIqlQCIiKhUAiAiolIJgIiISiUAIiIqlQCIiKhUAiAiolIJgIiISiUAIiIqlQCIiKhUAiAiolIJgIiISiUAIiIqlQCIiKhUAiAiolIJgIiISiUAIiIq1VUASFosabOkEUkrO0w/X9KdkjZI+oakBY1p7y7tNks6o5+dj4iI8RszACTNAC4HzgQWAOc2d/DFVbaPs70QuBS4rLRdACwFngssBj5c5hcREVOsmyOARcCI7S22HwHWAkuaFWw/0Bg9DHAZXgKstf2w7X8FRsr8IiJiih3SRZ1ZwD2N8W3AC9srSboAuBCYCbyi0faWtrazOrRdDiwHmDNnTjf9joiICerbRWDbl9t+BvAu4L09tl1te9D24MDAQL+6FBER+9BNAGwHjm6Mzy5lo1kLnDPOthERsZ90EwDrgfmS5kmaSeui7lCzgqT5jdGzgB+W4SFgqaRDJc0D5gPfmXi3IyJiosa8BmB7t6QVwHXADGCN7Y2SVgHDtoeAFZJOBX4D7AKWlbYbJV0NbAJ2AxfY/u0kLUtERPSgm4vA2F4HrGsru6gx/PZ9tP0A8IHxdjAiIiZHvgkcEVGpBEBERKUSABERlUoARERUKgEQEVGpBEBERKUSABERlUoARERUKgEQEVGpBEBERKUSABERlUoARERUKgEQEVGpBEBERKUSABERlUoARERUKgEQEVGpBEBERKUSABERleoqACQtlrRZ0oiklR2mXyhpk6Q7JN0g6emNab+VtKG8hvrZ+YiIGL8xfxRe0gzgcuA0YBuwXtKQ7U2NarcDg7Z/JeltwKXAG8q0h2wv7G+3IyJioro5AlgEjNjeYvsRYC2wpFnB9lds/6qM3gLM7m83IyKi37oJgFnAPY3xbaVsNG8Frm2MP17SsKRbJJ3TqYGk5aXO8I4dO7roUkRETNSYp4B6IelNwCDwskbx021vl3QMcKOkO23f3WxnezWwGmBwcND97FNERHTWzRHAduDoxvjsUvYYkk4F3gOcbfvhveW2t5e/W4CbgBMm0N+IiOiTbgJgPTBf0jxJM4GlwGPu5pF0AnAFrZ3/fY3yIyUdWoaPAk4GmhePIyJiiox5Csj2bkkrgOuAGcAa2xslrQKGbQ8BHwQOBz4rCeBHts8GngNcIWkPrbC5uO3uoYiImCJdXQOwvQ5Y11Z2UWP41FHa3QwcN5EORkTE5Mg3gSMiKpUAiIioVAIgIqJSCYCIiEolACIiKpUAiIioVAIgIqJSCYCIiEolACIiKpUAiIioVAIgIqJSCYCIiEolACIiKpUAiIioVAIgIqJSCYCIiEolACIiKpUAiIioVFcBIGmxpM2SRiSt7DD9QkmbJN0h6QZJT29MWybph+W1rJ+dj4iI8RszACTNAC4HzgQWAOdKWtBW7XZg0PbxwOeAS0vbJwPvA14ILALeJ+nI/nU/IiLGq5sjgEXAiO0tth8B1gJLmhVsf8X2r8roLcDsMnwGcL3tnbZ3AdcDi/vT9YiImIhuAmAWcE9jfFspG81bgWt7aStpuaRhScM7duzooksRETFRfb0ILOlNwCDwwV7a2V5te9D24MDAQD+7FBERo+gmALYDRzfGZ5eyx5B0KvAe4GzbD/fSNiIi9r9uAmA9MF/SPEkzgaXAULOCpBOAK2jt/O9rTLoOOF3SkeXi7+mlLCIiptghY1WwvVvSClo77hnAGtsbJa0Chm0P0TrlczjwWUkAP7J9tu2dkt5PK0QAVtneOSlLEhERPRkzAABsrwPWtZVd1Bg+dR9t1wBrxtvBiIiYHPkmcEREpbo6AojuzF35pa7qbb34rEnuSUTE2HIEEBFRqQRARESlEgAREZVKAEREVCoBEBFRqQRARESlEgAREZVKAEREVCoBEBFRqQRARESlEgAREZVKAEREVCoBEBFRqQRARESlEgAREZVKAEREVKqrAJC0WNJmSSOSVnaYfoqk2yTtlvTatmm/lbShvIba20ZExNQY8xfBJM0ALgdOA7YB6yUN2d7UqPYj4DzgrzvM4iHbCyfe1YiI6KdufhJyETBiewuApLXAEuB3AWB7a5m2ZxL6GBERk6CbU0CzgHsa49tKWbceL2lY0i2SzumlcxERMXn2x4/CP932dknHADdKutP23c0KkpYDywHmzJmzH7oUERHdHAFsB45ujM8uZV2xvb383QLcBJzQoc5q24O2BwcGBrqddURETEA3AbAemC9pnqSZwFKgq7t5JB0p6dAyfBRwMo1rBxERMXXGDADbu4EVwHXAXcDVtjdKWiXpbABJJ0naBrwOuELSxtL8OcCwpO8CXwEubrt7KCIipkhX1wBsrwPWtZVd1BheT+vUUHu7m4HjJtjHiIiYBPkmcEREpRIAERGVSgBERFQqARARUakEQEREpRIAERGVSgBERFQqARARUakEQEREpRIAERGVSgBERFQqARARUakEQEREpRIAERGVSgBERFQqARARUakEQEREpRIAERGVSgBERFSqqwCQtFjSZkkjklZ2mH6KpNsk7Zb02rZpyyT9sLyW9avjERExMWMGgKQZwOXAmcAC4FxJC9qq/Qg4D7iqre2TgfcBLwQWAe+TdOTEux0RERPVzRHAImDE9hbbjwBrgSXNCra32r4D2NPW9gzgets7be8CrgcW96HfERExQd0EwCzgnsb4tlLWja7aSlouaVjS8I4dO7qcdURETMQBcRHY9mrbg7YHBwYGpro7ERFV6CYAtgNHN8Znl7JuTKRtRERMom4CYD0wX9I8STOBpcBQl/O/Djhd0pHl4u/ppSwiIqbYmAFgezewgtaO+y7gatsbJa2SdDaApJMkbQNeB1whaWNpuxN4P60QWQ+sKmURETHFDummku11wLq2sosaw+tpnd7p1HYNsGYCfYyIiElwQFwEjoiI/S8BEBFRqQRARESlEgAREZVKAEREVCoBEBFRqQRARESlEgAREZVKAEREVCoBEBFRqQRARESlEgAREZVKAEREVCoBEBFRqQRARESlEgAREZVKAEREVCoBEBFRqa4CQNJiSZsljUha2WH6oZI+U6Z/W9LcUj5X0kOSNpTXR/vc/4iIGKcxfxNY0gzgcuA0YBuwXtKQ7U2Nam8Fdtl+pqSlwCXAG8q0u20v7G+3IyJioro5AlgEjNjeYvsRYC2wpK3OEuDKMvw54JWS1L9uRkREv3UTALOAexrj20pZxzq2dwM/B55Sps2TdLukr0p6aac3kLRc0rCk4R07dvS0ABERMT5jngKaoHuBObbvl/QC4AuSnmv7gWYl26uB1QCDg4Oe5D4dUOau/FJX9bZefNYk9yQiatPNEcB24OjG+OxS1rGOpEOAPwLut/2w7fsBbN8K3A0cO9FOR0TExHUTAOuB+ZLmSZoJLAWG2uoMAcvK8GuBG21b0kC5iIykY4D5wJb+dD0iIiZizFNAtndLWgFcB8wA1tjeKGkVMGx7CPgY8ElJI8BOWiEBcAqwStJvgD3A+bZ3TsaCREREb7q6BmB7HbCureyixvCvgdd1aHcNcM0E+xgREZMg3wSOiKhUAiAiolIJgIiISiUAIiIqlQCIiKhUAiAiolIJgIiISiUAIiIqlQCIiKhUAiAiolIJgIiISiUAIiIqlQCIiKhUAiAiolIJgIiISiUAIiIqNdk/Ch+TID8kHxH9kCOAiIhKJQAiIirVVQBIWixps6QRSSs7TD9U0mfK9G9LmtuY9u5SvlnSGX3se0RETMCY1wAkzQAuB04DtgHrJQ3Z3tSo9lZgl+1nSloKXAK8QdICYCnwXOBpwJclHWv7t/1ekBhdt9cMINcNImrSzRHAImDE9hbbjwBrgSVtdZYAV5bhzwGvlKRSvtb2w7b/FRgp84uIiCnWzV1As4B7GuPbgBeOVsf2bkk/B55Sym9pazur/Q0kLQeWl9EHJW3uqvePOgr4aY9tfo8umdZtfm8d7K++HSD6sg1MY7UvP2QdPKvXBgfEbaC2VwOrx9te0rDtwT52adqpfR1k+etefsg6kDTca5tuTgFtB45ujM8uZR3rSDoE+CPg/i7bRkTEFOgmANYD8yXNkzST1kXdobY6Q8CyMvxa4EbbLuVLy11C84D5wHf60/WIiJiIMU8BlXP6K4DrgBnAGtsbJa0Chm0PAR8DPilpBNhJKyQo9a4GNgG7gQsm6Q6gcZ8+OojUvg6y/FH7Ouh5+dX6oB4REbXJN4EjIiqVAIiIqNS0D4CxHlNxsJO0VdKdkjaM5zaw6UjSGkn3Sfpeo+zJkq6X9MPy98ip7ONkGmX5/0bS9rIdbJD0qqns42SSdLSkr0jaJGmjpLeX8pq2gdHWQU/bwbS+BlAeU/EDGo+pAM5te0zFQU3SVmDQdjVfgJF0CvAg8Pe2n1fKLgV22r64fBA40va7prKfk2WU5f8b4EHb/3sq+7Y/SHoq8FTbt0l6InArcA5wHvVsA6Otg9fTw3Yw3Y8AunlMRRxkbH+N1t1mTc3HkVxJ6z/DQWmU5a+G7Xtt31aGfwHcResJAzVtA6Otg55M9wDo9JiKnlfCNGfgXyTdWh6pUas/tn1vGf4x8MdT2ZkpskLSHeUU0UF7+qOpPHn4BODbVLoNtK0D6GE7mO4BEPAS2ycCZwIXlNMDVStfQpy+5zbH5yPAM4CFwL3A305pb/YDSYcD1wDvsP1Ac1ot20CHddDTdjDdA6D6R03Y3l7+3gd8nnqftvqTcl507/nR+6a4P/uV7Z/Y/q3tPcD/4yDfDiQ9jtaO71O2/7EUV7UNdFoHvW4H0z0AunlMxUFL0mHlAhCSDgNOB76371YHrebjSJYB/zSFfdnv9u74iv/EQbwdlEfNfwy4y/ZljUnVbAOjrYNet4NpfRcQQLnN6e949DEVH5jaHu0/ko6h9akfWo/1uKqG5Zf0aeDltB7/+xPgfcAXgKuBOcC/Aa+3fVBeKB1l+V9O67DfwFbgrxrnww8qkl4CfB24E9hTiv87rXPgtWwDo62Dc+lhO5j2ARAREeMz3U8BRUTEOCUAIiIqlQCIiKhUAiAiolIJgIiISiUAYlqT9L8k/YmkcyS9u8e2A5K+Lel2SS+drD720J+bJFX7o+ax/yUAYrp7IXAL8DLgaz22fSVwp+0TbH99tErlqbN9JWnMn2ONmGwJgJiWJH1Q0h3AScC3gL8EPiLpog5150q6sTwg6wZJcyQtBC4FlpTnpj+hrc1WSZdIug14naTTJX1L0m2SPluewYKkkyTdLOm7kr4j6YmSHi/p4+V3Gm6X9Cel7nmShiTdCNwg6QmS1kq6S9LngSeUejMkfULS98o8/uvkrcmoWT6FxLRk+52Srgb+ArgQuMn2yaNU/xBwpe0rJb0F+L+2zylhMWh7xSjt7rd9oqSjgH8ETrX9S0nvAi6UdDHwGeANttdLehLwEPD2Vhd9nKRn03pa67FlnicCx9veKelC4Fe2nyPpeOC2UmchMKvxrP8jxrmaIvYpRwAxnZ0IfBd4Nq3noY/mxcBVZfiTwEu6nP9nyt8XAQuAb0raQOs5M08HngXca3s9gO0HbO8u8/+HUvZ9Wo8l2BsA1zceT3BKo94dwB2lfAtwjKQPSVoMPOZJlxH9kiOAmHbK6ZtP0Hr660+BP2wVawPwYtsP9emtfrn3LWntuM9t68dxE5jnqGzvkvR84AzgfFq/8vSWcbxXxD7lCCCmHdsbbC+k9XOgC4AbgTNsLxxl538zrSfFAvwZrYdo9eIW4GRJz4TfPYX1WGAz8FRJJ5XyJ5aLu18v70OpN6fUbfc14I2l3vOA48vwUcAf2L4GeC+tI52IvssRQExLkgaAXbb3SHr2GL8D/Z+Bj0t6J7ADeHMv72V7h6TzgE9LOrQUv9f2DyS9AfhQuYj8EHAq8GFaF6TvBHYD59l+uPUE38f4SOnXXbROYd1aymeV8r0f0Hq6vTWiW3kaaEREpXIKKCKiUgmAiIhKJQAiIiqVAIiIqFQCICKiUgmAiIhKJQAiIir1/wFEDjmoEiwDEgAAAABJRU5ErkJggg==", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "total_samples=len(genders_db)\n", "total_N=N\n", "\n", "count_gov_genders=np.array([np.count_nonzero(\n", " (genders_db[i]==ALICE_GENDER)\n", ") for i in range(0,total_samples)])\n", "\n", "count_gov_zips=np.array([np.count_nonzero(\n", " (zips_db[i]==ALICE_ZIP)\n", ") for i in range(0,total_samples)])\n", "\n", "count_gov_genders_zips=np.array([np.count_nonzero(\n", " (zips_db[i]==ALICE_ZIP) &\n", " (genders_db[i]==ALICE_GENDER)\n", ") for i in range(0,total_samples)])\n", "\n", "\n", "y1=[np.count_nonzero(count_gov_genders==i)/len(count_gov_genders) for i in range(0,total_N)]\n", "y2=[np.count_nonzero(count_gov_zips==i)/len(count_gov_zips) for i in range(0,total_N)]\n", "y3=[np.count_nonzero(count_gov_genders_zips==i)/len(count_gov_genders_zips) for i in range(0,total_N)]\n", "plt.bar(range(0,total_N),y1)\n", "plt.title(\"P(#(GENDER = ALICE_GENDER))\")\n", "plt.xlabel(\"# of records\")\n", "plt.show()\n", "plt.bar(range(0,total_N),y2)\n", "plt.title(\"P(#(ZIP = ALICE_ZIP))\")\n", "plt.xlim((-1,25))\n", "plt.xlabel(\"# of records\")\n", "plt.show()\n", "plt.bar(range(0,total_N),y3)\n", "plt.title(\"P(#(ZIP = ALICE_ZIP /\\ GENDER = ALICE_GENDER))\")\n", "plt.xlim((-1,15))\n", "plt.xlabel(\"# of records\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Discussion of results\n", "\n", "The results show that gender or zip code in isolation have a low probability of uniquely identifying Alice in the dataset. However for both of them combined we observe that the probability of having only one record with those attribute values is close to 60%. This level of risk may be unacceptable for some types of sensitive data." ] } ], "metadata": { "kernelspec": { "argv": [ "python", "-m", "ipykernel_launcher", "-f", "{connection_file}" ], "display_name": "Python 3", "env": null, "interrupt_mode": "signal", "language": "python", "metadata": null, "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "name": "Governor.ipynb" }, "nbformat": 4, "nbformat_minor": 4 }