How to evaluate k-anonymity for a dataset which is only a sample/subset

Question

I work with a trajectory dataset which holds records from people using a certain ticketing app (for public transportation). The trajectory describes the route (i.e. an array of stations) of buses, trains etc

But, there is only a small set of passengers using this app (most buy regular tickets), leading to many unique trajectories. Thus, most records violate even k=3-anonymity.

I wonder how one should incorporate the fact that only ~ 1% of the passengers are using this tracking app when trying to anonymize this dataset?

score -1 · Answer 1 · answered Jul 31 '23 at 15:50

-1

There are several different options:

Segment rides: Instead of analyzing entire routes, only perform analysis on smaller segments that have enough rides.
Simulate trajectories: Use the current empirical data as seeds to simulate additional synthetic data.

answered Jul 31 '23 at 15:50

Brian Spiering

20,142
2
25
102

2

This is a very, very low-quality answer, especially coming from an eminence in this community. – Multivac Jul 31 '23 at 16:45

score -1 · Answer 2 · answered Aug 02 '23 at 14:09

I agree with Brian. You can play around with synthetic data. I suggest looking into SDV - Synthetic Data Vault library for python. It might make your life easier.

You can also consider injecting the data with noise. Make some records start at an earlier stop for example.

Another option is to survey passengers. Get someone to ask people at a stop when they get off - how many stops have you traveled to here?

How to evaluate k-anonymity for a dataset which is only a sample/subset

2 Answers2