I have a software engineering background and relatively new to data engineering. I am building out tables/datamarts in our datalake for data scientists and analysts to use. We use Airflow for dependency management and scheduling. The tables that I am building out will have a couple of years of historical data and will be backfilled using Airflow. I have a few seat of initial columns for consumption. As I get more requirements from partner teams, I have to keep adding more columns to the datamart. however, since the script to generate data has older columns and new ones, I will be recomputing all data for existing columns (few years) every time I re-run a DAG. This doesn't seem efficient. The alternative I see is
have new jobs every time there are new columns to be added. Now, I will have multiple jobs for groups of metrics. Also multiple sets of files which have to manually joined by a key for people who are consuming this data. This doesnt seem scalable.
have all fields ready before starting implementation. waterfall model. again, not happening in this agile industry
What are the industry best practices around this ? I asked internally and I was told I would have to re-run the whole job with both old and new columns.