Highest Voted 'data-engineering' Questions - Data Science Stack Exchange

5

votes

2 answers

ETL and Data Engineering - is it purely the knowledge of tools or is there theory behind it?

I would like to better understand what a good Data Englineer must know or what he does. Job descriptions mostly list tools that are required, such as Python. If it is possible to separate Data Engineering from Data Science, on what principles is…

asked Jul 09 '20 at 19:44

MindYB

51
3

5

votes

1 answer

Data engineering good and bad practice?

I'm a Data Analyst in a pretty big company and I'm having a really bad time with the data I'm being given. I spend about 70% of my time thinking about where to find the data and how to pull it instead of analyzing it. I have to pull from tables that…

sql data-engineering etl

asked Feb 23 '22 at 15:02

Marc

222
1
7

4

votes

2 answers

Data Engineering Stack - collect, transform and visualize geospatial data

I'm making a side project, where I collect geospatial data by web scrapping and from OSM API. I've started with simple Java application, however, I would like to make it as a data flow, purely for learning purposes. Unfortunately, my knowledge about…

visualization geospatial tools data-engineering

asked May 12 '20 at 09:01

Forin

141
1

3

votes

2 answers

Storing Large dataset for processing and analysis of data

I am new to data engineering and wanted to know , what is the best way to store more than 3000 GB of data for further processing and analysis ? I am specifically looking for open source resources . I have explored many data formats for storage . The…

dataset data-analysis data-formats processing data-engineering

asked Apr 16 '21 at 08:45

user14519285

41
2

2

votes

3 answers

Is there a cost associated with converting Koalas dataframe to Spark dataframe?

I know that pandas works "under the hood" with numpy arrays stored in dictionaries. In contrast, Koalas works with the underlying Spark framework. Does that mean that there is no extra cost associated with switching back and forth between Koalas and…

python performance pyspark distribution data-engineering

asked Feb 17 '20 at 10:11

DataBach

165
1
9

2

votes

2 answers

Loading models from external source

I have a 500MB model which I am commiting to Git. That is a really bad practice since for newer model versions the repository will be huge. As well, It will slow down all builds for deployments. I thought of using another repository that contains…

python machine-learning-model data-engineering

asked Feb 05 '20 at 07:21

room13

133
5

2

votes

1 answer

How to partition data effectively?

I have a pipeline which outputs model scores to s3. I need to partition the data by model_type and date. Which is the most efficient way to partition the data from the…

data-engineering

asked Mar 31 '22 at 18:49

CyberPunk

141
4

1

vote

0 answers

Best Technologies opening Large Sets of Sensor Time-Series Data to Analytics

My team is exploring options to create a robust "analytics" capability that is well-suited for our large quantities of sensor test data. I'd appreciate any suggestions for technologies that would perform well for my use case. About my data: For…

time-series data-mining data-engineering

asked Aug 03 '21 at 16:58

CrashLandon

11
2

1

vote

1 answer

Alternative to EC2 for running ML batch training jobs on AWS

We are building an ML pipeline on AWS, which will obviously require some heavy-compute components including preprocessing and batch training. Most the the pipeline is on Lambda, but Lambda is known to have time limits on how long a job can be run…

machine-learning pipelines aws data-engineering aws-lambda

asked Jun 29 '21 at 14:22

Cybernetic

770
1
4
10

1

vote

3 answers

Advice on where to continue in the field of data engineering and machine learning

I finished a 28 hours Machine learning with python (Basic course) on Udemy, and it was very beneficial. My aim, is to be able to understand what is ML and how to use its concepts while working with data. I am confused about where to continue. My…

machine-learning data-engineering coursera

asked Jul 20 '20 at 07:39

alim1990

163
7

1

vote

0 answers

Efficient way of adding new columns to datamart without reprocessing complete pipeline

I have a software engineering background and relatively new to data engineering. I am building out tables/datamarts in our datalake for data scientists and analysts to use. We use Airflow for dependency management and scheduling. The tables that I…

apache-spark pipelines data-engineering

asked Apr 28 '20 at 01:58

Abdul R

11
1

1

vote

3 answers

How to sort a multi-level pandas data-frame by a particular column?

I would like to sort a multi-index pandas dataframe by a column, but do not want the entire dataframe to be sorted at once. But rather would like to sort by one of the indices. Here is an example of what I mean: Below is an example of a multi-index…

pandas data-cleaning python-3.x data-engineering

asked Jan 07 '20 at 22:25

user62198

1,091
4
15
32

1

vote

0 answers

Airbyte docker container not able to read local json file on MacOSX

I am just trying to test out airbyte, I am loading up a json, I want to do some data manipulation and export it back out to json. All done locally with while running on docker. Failed to load file:///tmp/airbyte_local/data_board.json:…

json data-engineering etl

asked Oct 31 '22 at 20:53

user3738936

111
2

1

vote

1 answer

Migrating legacy data in Kafka to use a schema registry to support a streaming data pipeline

(If this is not the correct Stack site for this question, please let me know - it seemed the best fit) We currently have a 'legacy' system whose output are events coming in from a distributed application environment. We wish to set up a streaming…

data-engineering processing apache-kafka

asked Jun 10 '22 at 10:06

user991710

71
2
7

1

vote

2 answers

When Does Feature Selection Takes Place?

I have a dataset where there are categorical features as well as numeric features, and I have to perform OneHotEncoding, Normalization and feature selection on it. In what order should I perform these steps on my data? I am new to DataScience,…

machine-learning classification feature-selection feature-engineering data-engineering

asked Sep 27 '21 at 23:46

Jainam Shroff

45
4

Questions tagged [data-engineering]