Questions tagged [etl]

18 questions
5
votes
2 answers

ETL and Data Engineering - is it purely the knowledge of tools or is there theory behind it?

I would like to better understand what a good Data Englineer must know or what he does. Job descriptions mostly list tools that are required, such as Python. If it is possible to separate Data Engineering from Data Science, on what principles is…
MindYB
  • 51
  • 3
5
votes
1 answer

Data engineering good and bad practice?

I'm a Data Analyst in a pretty big company and I'm having a really bad time with the data I'm being given. I spend about 70% of my time thinking about where to find the data and how to pull it instead of analyzing it. I have to pull from tables that…
Marc
  • 222
  • 1
  • 7
4
votes
1 answer

What is the best practice to test a ETL pipeline?

In traditional software development practice, before going into production, a piece of code should go through various stages of testing (unit test, integration test, user acceptance test) to secure the stability of the software. A ETL pipeline, as a…
Costa
  • 41
  • 2
3
votes
1 answer

How to make R or Python as fast as SAS for ODBC Oracle queries?

I want to use R or Python to query big structured SQL-type data, but they are very slow compared to SAS. I tried using R and Python to return a 1.3 million record Oracle ODBC passthrough query. The query took 8-15 seconds in SAS, 20-30 seconds in…
Sean McCarthy
  • 221
  • 3
  • 9
2
votes
1 answer

Extracting and Mining PDF Data

I have a pdf file (admission application). I want to read/search the pdf and extract terms with similar meaning and then convert this data into a DataFrame to save as a xlsm file. HELP!
Keetj
  • 21
  • 2
2
votes
2 answers

Successful ETL Automation: Libraries, Review papers, Use Cases

I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data? I would be interested to see any existing libraries dealing with scalable ETL solutions. …
AN6U5
  • 6,798
  • 1
  • 24
  • 42
1
vote
1 answer

PySpark for Big Data and RAM usage

I'm trying to figure out the best and most efficient method of handing ETL operations for big data. My question is this. Say I have a table that is ~50 GB in size. In order to effectively transfer the data from this table from one source to another,…
Shaun
  • 11
  • 2
1
vote
2 answers

How to create a parquet file from a query to a mysql table

Updating a legacy ~ETL; on it's base it exports some tables of the prod DB to s3, the export contains a query. The export process generates a csv file using the following logic: res = sh.sed( sh.mysql( '-u', settings_dict['USER'], …
1
vote
1 answer

what ETL technique should i use for text documents using Hadoop?

I have a school Big Data project where basically the teacher is going to give us a large amount of text documents (from the Gutenberg project data set ) and he want us to give as output the document where a "keyword" is more relevant, he also wants…
1
vote
0 answers

Airbyte docker container not able to read local json file on MacOSX

I am just trying to test out airbyte, I am loading up a json, I want to do some data manipulation and export it back out to json. All done locally with while running on docker. Failed to load file:///tmp/airbyte_local/data_board.json:…
user3738936
  • 111
  • 2
0
votes
1 answer

Transitioning from a python script for data transformation to BigQuery

So I have a dataset spread over multiple and ever-growing excel files all of which looks…
Hamza
  • 133
  • 4
0
votes
0 answers

How to schedule importing data files from SFTP server located on compute engine instance into BigQuery?

What I want to achieve: Transfer hourly coming data files onto a SFTP file server located on a compute engine VM from several different feeds into Bigquery with real-time updates effectively & cost-efficiently. Context: The software I am trying to…
0
votes
0 answers

Integrating Dagster with Django ORM

I want to integrate Dagster into ongoing Django project. Dagster runs out of Django context and eventually there is no way to directly access django ORM without calling django.setup() somewhere, I did it in init of my app., but this is not…
0
votes
1 answer

Convert date into number - Apache PIG

Imagine that I've a field called date in this format: "yyyy-mm-dd" and I want to convert to number like "yyymmdd". For that I'm trying to use this: Data_ID = FOREACH File GENERATE…
João_testeSW
  • 179
  • 2
  • 3
  • 13
0
votes
2 answers

Data Science Project Data Workflow Structure

I'm in the middle of a project of marketing regarding the sales prediction with promotions. The client has very complex business processes and so the data needs a lot of preprocessing (joins, filters, etc.). I have organize the code in different…
ru.mp
  • 63
  • 1
  • 1
  • 7
1
2