Highest Voted 'pyspark' Questions - Data Science Stack Exchange

35

votes

6 answers

Merging multiple data frames row-wise in PySpark

I have 10 data frames pyspark.sql.dataframe.DataFrame, obtained from randomSplit as (td1, td2, td3, td4, td5, td6, td7, td8, td9, td10) = td.randomSplit([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1], seed = 100) Now I want to join 9 td's into a single…

asked Apr 22 '16 at 04:27

krishna Prasad

1,147
1
14
23

15

votes

3 answers

How to convert categorical data to numerical data in Pyspark

I am using Ipython notebook to work with pyspark applications. I have a CSV file with lots of categorical columns to determine whether the income falls under or over the 50k range. I would like to perform a classification algorithm taking all the…

python apache-spark categorical-data pyspark

asked Jun 29 '15 at 22:55

SRS

1,045
5
11
22

14

votes

4 answers

Import csv file contents into pyspark dataframes

How can I import a .csv file into pyspark dataframes? I even tried to read csv file in Pandas and then convert it to a spark dataframe using createDataFrame, but it is still showing some error. Can someone guide me through this? Also, please tell me…

pyspark

asked Aug 01 '16 at 11:21

neha

141
1
1
4

12

votes

3 answers

How do I set/get heap size for Spark (via Python notebook)

I'm using Spark (1.5.1) from an IPython notebook on a macbook pro. After installing Spark and Anaconda, I start IPython from a terminal by executing: IPYTHON_OPTS="notebook" pyspark. This opens a webpage listing all my IPython notebooks. I can…

apache-spark pyspark ipython anaconda

asked Oct 21 '15 at 18:17

Kai

303
1
2
10

12

votes

3 answers

Issue with IPython/Jupyter on Spark (Unrecognized alias)

I am working on setting up a set of VMs to experiment with Spark before I spend go out and spend money on building up a cluster with some hardware. Quick note: I am an academic with a background in applied machine learning and work quit a bit in…

python apache-spark pyspark ipython

asked Jul 23 '15 at 03:45

gcd

121
1
3

12

votes

1 answer

Spark ALS: recommending for new users

The question How do I predict the rating for a new user in an ALS model trained in Spark? (New = not seen during training time) The problem I'm following the official Spark ALS tutorial…

apache-spark recommender-system pyspark

asked Oct 24 '16 at 21:13

ciri

236
2
7

11

votes

1 answer

PySpark dataframe repartition

What happens when we do repartition on a PySpark dataframe based on the column. For example dataframe.repartition('id') Does this moves the data with the similar 'id' to the same partition? How does the spark.sql.shuffle.partitions value affect the…

apache-spark pyspark

asked Feb 22 '18 at 10:19

Nikhil Baby

213
1
2
6

11

votes

3 answers

When does cache get expired for a RDD in pyspark?

We use .cache() on RDD for persistent caching of an dataset, My concern is when this cached will be expired?. dt = sc.parallelize([2, 3, 4, 5, 6]) dt.cache()

apache-spark pyspark

asked May 10 '16 at 12:38

krishna Prasad

1,147
1
14
23

10

votes

1 answer

Spark, optimally splitting a single RDD into two

I have a large dataset that I need to split into groups according to specific parameters. I want the job to process as efficiently as possible. I can envision two ways of doing so Option 1 - Create map from original RDD and filter def…

apache-spark pyspark

asked May 01 '15 at 20:32

j.a.gartner

1,205
1
9
18

9

votes

3 answers

How to run a pyspark application in windows 8 command prompt

I have a python script written with Spark Context and I want to run it. I tried to integrate IPython with Spark, but I could not do that. So, I tried to set the spark path [ Installation folder/bin ] as an environment variable and called…

python apache-spark pyspark ipython windows

asked Jun 21 '15 at 17:31

SRS

1,045
5
11
22

8

votes

4 answers

How to select particular column in Spark(pyspark)?

testPassengerId = test.select('PassengerId').map(lambda x: x.PassengerId) I want to select PassengerId column and make RDD of it. But .select is not working. It says 'RDD' object has no attribute 'select'

apache-spark pyspark

asked Jan 03 '16 at 02:10

dsl1990

181
1
1
2

7

votes

1 answer

Using Apache Spark to do ML. Keep getting serializing errors

so I'm using Spark to do sentiment analysis, and I keep getting errors with the serializers it uses (I think) to pass python objects around. PySpark worker failed with exception: Traceback (most recent call last): File…

apache-spark pyspark sentiment-analysis

asked Jul 25 '14 at 21:03

seashark97

71
1
3

7

votes

1 answer

Pyspark: Filter dataframe based on separate specific conditions

How can I select only certain entries that match my condition and from those entries, filter again using regex? For instance, I have this data frame…

python data-cleaning apache-spark pyspark

asked Jun 09 '19 at 06:22

randomizer0000

71
3

6

votes

2 answers

Model ensemble with Spark or Scikit Learn

I am using Spark MLLib to make prediction and I would like to know if it is possible to create your custom Estimators. Here is a reproducible of what I would like my model to do with the Spark api from sklearn.datasets import load_diabetes import…

pyspark pipelines

asked Apr 15 '19 at 14:04

Robin Nicole

499
3
13

6

votes

2 answers

Why is there a difference of "ML" vs "MLLIB" in Apache Spark's documentation?

I am trying to figure out which pyspark library to use with Word2Vec and I'm presented with two options according to the pyspark…

pyspark

asked Dec 12 '18 at 06:32

Gabriel Fair

257
3
8

Questions tagged [pyspark]

Useful Links:

Merging multiple data frames row-wise in PySpark

How to convert categorical data to numerical data in Pyspark

Import csv file contents into pyspark dataframes

How do I set/get heap size for Spark (via Python notebook)

Issue with IPython/Jupyter on Spark (Unrecognized alias)

Spark ALS: recommending for new users

PySpark dataframe repartition

When does cache get expired for a RDD in pyspark?

Spark, optimally splitting a single RDD into two

How to run a pyspark application in windows 8 command prompt

How to select particular column in Spark(pyspark)?

Using Apache Spark to do ML. Keep getting serializing errors

Pyspark: Filter dataframe based on separate specific conditions

Model ensemble with Spark or Scikit Learn

Why is there a difference of "ML" vs "MLLIB" in Apache Spark's documentation?