1

I have a software engineering background and relatively new to data engineering. I am building out tables/datamarts in our datalake for data scientists and analysts to use. We use Airflow for dependency management and scheduling. The tables that I am building out will have a couple of years of historical data and will be backfilled using Airflow. I have a few seat of initial columns for consumption. As I get more requirements from partner teams, I have to keep adding more columns to the datamart. however, since the script to generate data has older columns and new ones, I will be recomputing all data for existing columns (few years) every time I re-run a DAG. This doesn't seem efficient. The alternative I see is

  • have new jobs every time there are new columns to be added. Now, I will have multiple jobs for groups of metrics. Also multiple sets of files which have to manually joined by a key for people who are consuming this data. This doesnt seem scalable.

  • have all fields ready before starting implementation. waterfall model. again, not happening in this agile industry

What are the industry best practices around this ? I asked internally and I was told I would have to re-run the whole job with both old and new columns.

Abdul R
  • 11
  • 1

0 Answers0