Questions tagged [scala]

Scala is a high level language that combines functional and object oriented programming with high performance runtimes. Scala programming language is build to implement scale able solutions to crunch big data / data science in order to produce actionable insights is a great languages for large-scale projects. Designed to express common programming patterns in a concise, elegant, and type-safe way, it fuses both imperative and functional programming styles.

48 questions
15
votes
4 answers

Data Science Tools Using Scala

I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?
sheldonkreger
  • 1,169
  • 8
  • 20
14
votes
2 answers

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select…
the3rdNotch
  • 243
  • 1
  • 2
  • 7
6
votes
3 answers

Summarize and visualize a CSV in Java/Scala?

I would like to summarize (as in R) the contents of a CSV (possibly after loading it, or storing it somewhere, that's not a problem). The summary should contain the quartiles, mean, median, min and max of the data in a CSV file for each numeric…
Trylks
  • 178
  • 8
5
votes
1 answer

Saving Large Spark ML Pipeline to HDFS

I'm having trouble saving a large (relative to spark.rpc.message.maxSize) Spark ML pipeline to HDFS. Specifically, when I try to save the model to HDFS, it gives me an error related to spark's maximum message size: scala> val mod =…
Thomas Cleberg
  • 1,505
  • 7
  • 22
5
votes
2 answers

What is the best deep learning library for scala?

Does any one has a recommendation for what libraries to use for deep learning?
Soerendip
  • 724
  • 1
  • 9
  • 16
5
votes
2 answers

Distributed k-means in Spark

I want to implement K-means algorithm in Spark. I am looking for a starting point and I found Berkeley's naive implementation. However, is that distributed? I mean I see no mapreduce operations. Or maybe, when submitted in Spark, the framework…
gsamaras
  • 291
  • 6
  • 15
4
votes
1 answer

Plotting libraries for Scala on Zeppelin

My main question is it looks like Zeppelin limit the display of the results to on 1000, I know that I can change this number but when I change it Zeppelin become slow. And it looks like the default plotting tool of Zeppelin also plot the first 1000…
Rami
  • 594
  • 1
  • 5
  • 16
4
votes
1 answer

Has anyone succeeded in finding a good Scala/Spark kernel for Jupyter?

The ones I've tried so far Almond: Works very well for just Scala, but you have to import dependencies, and it gets tedious after a while. And unfortunately can't run when using Spark with YARN instead of Local. Spylon-kernel: Kernels connects, but…
4
votes
2 answers

How Mllib in Spark select variables in logistic regression

I have a question about MLlib in Spark.(with Scala) I'm trying to understand how LogisticRegressionWithLBFGS and LogisticRegressionWithSGD work. I usually use SAS or R to do logistic regressions but I now have to do it on Spark to be able to analyze…
4
votes
1 answer

How to set up multi cluster spark without hadoop on Google Compute engine

I'm new to apache spark. Is it possible to configure multi cluster spark without hadoop? If so, can you please provide the steps. I would like to create clusters on Google Compute Engine (1-master, 1-worker)
user4290511
  • 101
  • 6
4
votes
3 answers

Scala vs Java if you're NOT going to use Spark?

I'm facing some indecision when choosing how to allocate my scarce learning time for the next few months between Scala and Java. I would like help objectively understanding the practical tradeoffs. The reason I am interested in Java is that I think…
Hack-R
  • 1,919
  • 1
  • 21
  • 34
3
votes
1 answer

Sampling with replacement, specify the probabilities

I am trying to do sampling with replacement in Scala/Spark, defining the probabilities for each class. This is how I would do it in R. # Vector to sample from x <- c("User1","User2","User3","User4","User5") # Occurenciens from which to obtain…
Stefano
  • 31
  • 2
3
votes
3 answers

How to install Polynote on Windows?

I've been searching around the Internet for a while but I have not been able to find detailed instructions on how to install Polynote (the polyglot notebook with first-class Scala support) for Windows with mixing multiple languages, Python and…
Pluviophile
  • 3,520
  • 11
  • 29
  • 49
3
votes
2 answers

Hashing trick with random forest in scala

I am trying to perform a hashing trick and then a random forest with scala. I have the following code: val documents: RDD[Seq[String]] = sc.textFile("hdfs:///tmp/new_cromosoma12v2.csv").map(_.split(",").toSeq) val hashingTF = new HashingTF() val…
keira
  • 101
  • 1
  • 8
3
votes
3 answers

Task not serializable Error

import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.cassandra.CassandraSQLContext object Test { val sparkConf = new SparkConf(true).set("spark.cassandra.connection.host", ) val…
Credosam
  • 81
  • 1
  • 10
1
2 3 4