Questions tagged [bigdata]

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.

458 questions
94
votes
12 answers

How big is big data?

Lots of people use the term big data in a rather commercial way, as a means of indicating that large datasets are involved in the computation, and therefore potential solutions must have good performance. Of course, big data always carry associated…
Rubens
  • 4,097
  • 5
  • 23
  • 42
59
votes
9 answers

How to deal with version control of large amounts of (binary) data

I am a PhD student of Geophysics and work with large amounts of image data (hundreds of GB, tens of thousands of files). I know svn and git fairly well and come to value a project history, combined with the ability to easily work together and have…
Johann
  • 701
  • 1
  • 5
  • 5
55
votes
9 answers

Is the R language suitable for Big Data

R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R". I've seen a guideline of 5TB for a…
akellyirl
  • 723
  • 1
  • 6
  • 9
48
votes
5 answers

Opening a 20GB file for analysis with pandas

I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB but I keep getting memory errors. From your experience…
Hari Prasad
  • 491
  • 1
  • 5
  • 4
46
votes
12 answers

Data Science in C (or C++)

I'm an R language programmer. I'm also in the group of people who are considered Data Scientists but who come from academic disciplines other than CS. This works out well in my role as a Data Scientist, however, by starting my career in R and only…
Hack-R
  • 1,919
  • 1
  • 21
  • 34
40
votes
10 answers

Do I need to learn Hadoop to be a Data Scientist?

An aspiring data scientist here. I don't know anything about Hadoop, but as I have been reading about Data Science and Big Data, I see a lot of talk about Hadoop. Is it absolutely necessary to learn Hadoop to be a Data Scientist?
Pensu
  • 591
  • 1
  • 4
  • 8
36
votes
6 answers

How to do SVD and PCA with big data?

I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for efficiency. However, MATLAB and Octave cannot load such a large…
David S.
  • 547
  • 2
  • 6
  • 8
28
votes
3 answers

Data Science Project Ideas

I don't know if this is a right place to ask this question, but a community dedicated to Data Science should be the most appropriate place in my opinion. I have just started with Data Science and Machine learning. I am looking for long term project…
Kevin Desai
  • 383
  • 1
  • 3
  • 4
28
votes
5 answers

Improve the speed of t-sne implementation in python for huge data

I would like to do dimensionality reduction on nearly 1 million vectors each with 200 dimensions(doc2vec). I am using TSNE implementation from sklearn.manifold module for it and the major problem is time complexity. Even with method = barnes_hut,…
chmodsss
  • 1,954
  • 2
  • 17
  • 37
21
votes
3 answers

Uses of NoSQL database in data science

How can NoSQL databases like MongoDB be used for data analysis? What are the features in them that can make data analysis faster and powerful?
10land
  • 369
  • 3
  • 10
17
votes
2 answers

Use liblinear on big data for semantic analysis

I use Libsvm to train data and predict classification on semantic analysis problem. But it has a performance issue on large-scale data, because semantic analysis concerns n-dimension problem. Last year, Liblinear was release, and it can solve…
Puffin GDI
  • 283
  • 3
  • 15
14
votes
9 answers

Is Python suitable for big data

I read in this post Is the R language suitable for Big Data that big data constitutes 5TB, and while it does a good job of providing information about the feasibility of working with this type of data in R it provides very little information about…
ragingSloth
  • 1,824
  • 3
  • 14
  • 15
14
votes
3 answers

When are p-values deceptive?

What are the data conditions that we should watch out for, where p-values may not be the best way of deciding statistical significance? Are there specific problem types that fall into this category?
user179
  • 143
  • 1
  • 4
14
votes
4 answers

Looking for example infrastructure stacks/workflows/pipelines

I'm trying to understand how all the "big data" components play together in a real world use case, e.g. hadoop, monogodb/nosql, storm, kafka, ... I know that this is quite a wide range of tools used for different types, but I'd like to get to know…
14
votes
4 answers

Big data case study or use case example

I have read lot of blogs\article on how different type of industries are using Big Data Analytic. But most of these article fails to mention What kinda data these companies used. What was the size of the data What kinda of tools technologies they…
Brown_Dynamite
  • 241
  • 2
  • 6
1
2 3
30 31