Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

130 questions
35
votes
6 answers

Merging multiple data frames row-wise in PySpark

I have 10 data frames pyspark.sql.dataframe.DataFrame, obtained from randomSplit as (td1, td2, td3, td4, td5, td6, td7, td8, td9, td10) = td.randomSplit([.1, .1, .1, .1, .1, .1, .1, .1, .1, .1], seed = 100) Now I want to join 9 td's into a single…
krishna Prasad
  • 1,147
  • 1
  • 14
  • 23
15
votes
3 answers

How to convert categorical data to numerical data in Pyspark

I am using Ipython notebook to work with pyspark applications. I have a CSV file with lots of categorical columns to determine whether the income falls under or over the 50k range. I would like to perform a classification algorithm taking all the…
SRS
  • 1,045
  • 5
  • 11
  • 22
14
votes
4 answers

Import csv file contents into pyspark dataframes

How can I import a .csv file into pyspark dataframes? I even tried to read csv file in Pandas and then convert it to a spark dataframe using createDataFrame, but it is still showing some error. Can someone guide me through this? Also, please tell me…
neha
  • 141
  • 1
  • 1
  • 4
12
votes
3 answers

How do I set/get heap size for Spark (via Python notebook)

I'm using Spark (1.5.1) from an IPython notebook on a macbook pro. After installing Spark and Anaconda, I start IPython from a terminal by executing: IPYTHON_OPTS="notebook" pyspark. This opens a webpage listing all my IPython notebooks. I can…
Kai
  • 303
  • 1
  • 2
  • 10
12
votes
3 answers

Issue with IPython/Jupyter on Spark (Unrecognized alias)

I am working on setting up a set of VMs to experiment with Spark before I spend go out and spend money on building up a cluster with some hardware. Quick note: I am an academic with a background in applied machine learning and work quit a bit in…
gcd
  • 121
  • 1
  • 3
12
votes
1 answer

Spark ALS: recommending for new users

The question How do I predict the rating for a new user in an ALS model trained in Spark? (New = not seen during training time) The problem I'm following the official Spark ALS tutorial…
ciri
  • 236
  • 2
  • 7
11
votes
1 answer

PySpark dataframe repartition

What happens when we do repartition on a PySpark dataframe based on the column. For example dataframe.repartition('id') Does this moves the data with the similar 'id' to the same partition? How does the spark.sql.shuffle.partitions value affect the…
Nikhil Baby
  • 213
  • 1
  • 2
  • 6
11
votes
3 answers

When does cache get expired for a RDD in pyspark?

We use .cache() on RDD for persistent caching of an dataset, My concern is when this cached will be expired?. dt = sc.parallelize([2, 3, 4, 5, 6]) dt.cache()
krishna Prasad
  • 1,147
  • 1
  • 14
  • 23
10
votes
1 answer

Spark, optimally splitting a single RDD into two

I have a large dataset that I need to split into groups according to specific parameters. I want the job to process as efficiently as possible. I can envision two ways of doing so Option 1 - Create map from original RDD and filter def…
j.a.gartner
  • 1,205
  • 1
  • 9
  • 18
9
votes
3 answers

How to run a pyspark application in windows 8 command prompt

I have a python script written with Spark Context and I want to run it. I tried to integrate IPython with Spark, but I could not do that. So, I tried to set the spark path [ Installation folder/bin ] as an environment variable and called…
SRS
  • 1,045
  • 5
  • 11
  • 22
8
votes
4 answers

How to select particular column in Spark(pyspark)?

testPassengerId = test.select('PassengerId').map(lambda x: x.PassengerId) I want to select PassengerId column and make RDD of it. But .select is not working. It says 'RDD' object has no attribute 'select'
dsl1990
  • 181
  • 1
  • 1
  • 2
7
votes
1 answer

Using Apache Spark to do ML. Keep getting serializing errors

so I'm using Spark to do sentiment analysis, and I keep getting errors with the serializers it uses (I think) to pass python objects around. PySpark worker failed with exception: Traceback (most recent call last): File…
seashark97
  • 71
  • 1
  • 3
7
votes
1 answer

Pyspark: Filter dataframe based on separate specific conditions

How can I select only certain entries that match my condition and from those entries, filter again using regex? For instance, I have this data frame…
6
votes
2 answers

Model ensemble with Spark or Scikit Learn

I am using Spark MLLib to make prediction and I would like to know if it is possible to create your custom Estimators. Here is a reproducible of what I would like my model to do with the Spark api from sklearn.datasets import load_diabetes import…
Robin Nicole
  • 499
  • 3
  • 13
6
votes
2 answers

Why is there a difference of "ML" vs "MLLIB" in Apache Spark's documentation?

I am trying to figure out which pyspark library to use with Word2Vec and I'm presented with two options according to the pyspark…
Gabriel Fair
  • 257
  • 3
  • 8
1
2 3
8 9