Data Analysis Process

Question

How do you feed data from a data warehouse to Python for ad-hoc analysis?

My day-to-day work is to answer ad-hoc questions, and 95% of the data I need is in our data warehouse. I often query data from our warehouse to CSV file(s), then use Python to load these files with other sources to analyze.

I am going to work with a very gigantic data warehouse that may not be possible to use the same csv method.

Our data warehouse is in Redshift.

What is your experience to feed python/R for data analysis?

score 0 · Answer 1 · answered May 26 '23 at 13:46

>>> import pandas

#Connect to the cluster
>>> import redshift_connector
>>> conn = redshift_connector.connect(
     host='examplecluster.abc123xyz789.us-west-1.redshift.amazonaws.com',
     port=5439,
     database='dev',
     user='awsuser',
     password='my_password'
  )
  
# Create a Cursor object
>>> cursor = conn.cursor()

# Query and receive result set
cursor.execute("select * from book")
result: pandas.DataFrame = cursor.fetch_dataframe()
print(result)

score 0 · Answer 2 · answered May 27 '23 at 04:50

Lot's of BigData libraries out there. PySpark and Hadoop are a couple you could use (personally recommend PySpark for it's pythonic usage).

If you only want to stick to pandas (for some reason), you could sample the data for initial anaylsis. Then increase the sample size gradually in iterations till you memory size is reached. Obviously this technique will require more time and effort!

I would go for the first technique!

Cheers!

Data Analysis Process

2 Answers2