Questions tagged [etl]
18 questions
5
votes
2 answers
ETL and Data Engineering - is it purely the knowledge of tools or is there theory behind it?
I would like to better understand what a good Data Englineer must know or what he does. Job descriptions mostly list tools that are required, such as Python.
If it is possible to separate Data Engineering from Data Science, on what principles is…
MindYB
- 51
- 3
5
votes
1 answer
Data engineering good and bad practice?
I'm a Data Analyst in a pretty big company and I'm having a really bad time with the data I'm being given. I spend about 70% of my time thinking about where to find the data and how to pull it instead of analyzing it. I have to pull from tables that…
Marc
- 222
- 1
- 7
4
votes
1 answer
What is the best practice to test a ETL pipeline?
In traditional software development practice, before going into production, a piece of code should go through various stages of testing (unit test, integration test, user acceptance test) to secure the stability of the software.
A ETL pipeline, as a…
Costa
- 41
- 2
3
votes
1 answer
How to make R or Python as fast as SAS for ODBC Oracle queries?
I want to use R or Python to query big structured SQL-type data, but they are very slow compared to SAS.
I tried using R and Python to return a 1.3 million record Oracle ODBC passthrough query. The query took 8-15 seconds in SAS, 20-30 seconds in…
Sean McCarthy
- 221
- 3
- 9
2
votes
1 answer
Extracting and Mining PDF Data
I have a pdf file (admission application). I want to read/search the pdf and extract terms with similar meaning and then convert this data into a DataFrame to save as a xlsm file. HELP!
Keetj
- 21
- 2
2
votes
2 answers
Successful ETL Automation: Libraries, Review papers, Use Cases
I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?
I would be interested to see any existing libraries dealing with scalable ETL solutions. …
AN6U5
- 6,798
- 1
- 24
- 42
1
vote
1 answer
PySpark for Big Data and RAM usage
I'm trying to figure out the best and most efficient method of handing ETL operations for big data. My question is this.
Say I have a table that is ~50 GB in size. In order to effectively transfer the data from this table from one source to another,…
Shaun
- 11
- 2
1
vote
2 answers
How to create a parquet file from a query to a mysql table
Updating a legacy ~ETL; on it's base it exports some tables of the prod DB to s3, the export contains a query. The export process generates a csv file using the following logic:
res = sh.sed(
sh.mysql(
'-u',
settings_dict['USER'],
…
Carlos P Ceballos
- 111
- 1
- 6
1
vote
1 answer
what ETL technique should i use for text documents using Hadoop?
I have a school Big Data project where basically the teacher is going to give us a large amount of text documents (from the Gutenberg project data set ) and he want us to give as output the document where a "keyword" is more relevant, he also wants…
Sebastian Delgado
- 121
- 2
1
vote
0 answers
Airbyte docker container not able to read local json file on MacOSX
I am just trying to test out airbyte, I am loading up a json, I want to do some data manipulation and export it back out to json. All done locally with while running on docker.
Failed to load file:///tmp/airbyte_local/data_board.json:…
user3738936
- 111
- 2
0
votes
1 answer
Transitioning from a python script for data transformation to BigQuery
So I have a dataset spread over multiple and ever-growing excel files all of which looks…
Hamza
- 133
- 4
0
votes
0 answers
How to schedule importing data files from SFTP server located on compute engine instance into BigQuery?
What I want to achieve:
Transfer hourly coming data files onto a SFTP file server located on a compute engine VM from several different feeds into Bigquery with real-time updates effectively & cost-efficiently.
Context:
The software I am trying to…
Hamza
- 133
- 4
0
votes
0 answers
Integrating Dagster with Django ORM
I want to integrate Dagster into ongoing Django project. Dagster runs out of Django context and eventually there is no way to directly access django ORM without calling django.setup() somewhere, I did it in init of my app., but this is not…
0
votes
1 answer
Convert date into number - Apache PIG
Imagine that I've a field called date in this format: "yyyy-mm-dd" and I want to convert to number like "yyymmdd". For that I'm trying to use this:
Data_ID = FOREACH File GENERATE…
João_testeSW
- 179
- 2
- 3
- 13
0
votes
2 answers
Data Science Project Data Workflow Structure
I'm in the middle of a project of marketing regarding the sales prediction with promotions. The client has very complex business processes and so the data needs a lot of preprocessing (joins, filters, etc.). I have organize the code in different…
ru.mp
- 63
- 1
- 1
- 7