0

I work with a trajectory dataset which holds records from people using a certain ticketing app (for public transportation). The trajectory describes the route (i.e. an array of stations) of buses, trains etc

But, there is only a small set of passengers using this app (most buy regular tickets), leading to many unique trajectories. Thus, most records violate even k=3-anonymity.

I wonder how one should incorporate the fact that only ~ 1% of the passengers are using this tracking app when trying to anonymize this dataset?

2 Answers2

-1

There are several different options:

  1. Segment rides: Instead of analyzing entire routes, only perform analysis on smaller segments that have enough rides.

  2. Simulate trajectories: Use the current empirical data as seeds to simulate additional synthetic data.

Brian Spiering
  • 20,142
  • 2
  • 25
  • 102
-1

I agree with Brian. You can play around with synthetic data. I suggest looking into SDV - Synthetic Data Vault library for python. It might make your life easier.

You can also consider injecting the data with noise. Make some records start at an earlier stop for example.

Another option is to survey passengers. Get someone to ask people at a stop when they get off - how many stops have you traveled to here?