2

I have a 7 giga confidential dataset which I want to use for a machine learning application.

I tried :

Every package recommanded for efficient dataset management in R like :

  • data.table,
  • ff
  • and sqldf with no success.

Data.table needs to load all the data in the memory from what I read, so it's obvious that it will not work since my computer has only 4g RAM. Ff leads to a memory error too.

So I decided to turn to sgdb and I tried :

  • Mysql which managed to load my dataset in 2 hours and 21'. Then I began my requests (I have a few requests to do to prepare my data before I export a smaller set in R for machine learning application), and then I had to wait for hours before I got the following message "The total number of locks exceeds the lock table size" (my request was just an update to extract the month from a date for each tuple).
  • I read that postgre was similar to mysql in performance so I didn't try
  • I read that redis was really performant but not at all adapted to massive importation like I want to do here so I didn't try
  • I tried mongoDb, the nosql upraising solution that I heard everywhere about. Not only I find rather disturbing that mongoimport is so limited in options (I had to change all semi-colon in commas using sed before I can import the data), but It seems to be less performant that mysql since I launched the loading yesterday and it is still running.

What I can't try : data are confidential so I don't really want to rent some space on Azure or Amazon clouding solution. I am not sure that it is that big that I have to turn to Hadoop solution but maybe I am wrong about that.

Is there an open-source performant solution that I didn't try that you would recommend to perform some sql-like requests on a biggish dataset ?


Edit : Some more details about what I want to do with these data for you to visualize. These are events with a timestamp and a geolocalisation. I have 8 billions of lines. One example of what I want to do :

  • standardize series identified by geolocalisation (I need to compute mean grouping by geolocalisation for example),
  • compute average count of events by type of season, day... (usual group by sql request)...

Edit

As a beginning of answer for those who have limited hardware like me, rSQLite seems to be a possibility. I am still interested in other people experiences.

Stéphanie C
  • 281
  • 1
  • 2
  • 5
  • I suspect the answer to this question will depend somewhat on the analysis method you're hoping to use; so more detail there would be helpful. – conjectures Jun 22 '15 at 15:07
  • Ok I gave some examples, thank you for your interest @conjectures – Stéphanie C Jun 22 '15 at 15:27
  • out of interest, with mongodb did you try using ensureIndex: http://www.tutorialspoint.com/mongodb/mongodb_indexing.htm (the indexing process itself takes a while, but speeds up queries afterwards) – conjectures Jun 22 '15 at 15:43
  • also the second bullet is totally possible via 1 line at a time reading of file and adding a number to whichever groups the line belongs to and corresponding group counts. – conjectures Jun 22 '15 at 15:45
  • no I didn't try ensureIndex, to be honnest it was my almost first experience with mongoDB so it may be possible to optimize. It's probable that I can do every request I want with mongo but I am worried about the time it will take and if it is not the proper solution for my purpose, I am ready to change. Thank you for the link ! – Stéphanie C Jun 22 '15 at 16:03
  • 1
    Why not buy more RAM? – reinierpost Jun 24 '15 at 09:37
  • I tried Monetdb which failed to load my big dataset but I managed to load it with Rsqlite in 2 hours. – Stéphanie C Jun 24 '15 at 09:30
  • @reinierpost : I will not buy extra RAM for my professionnal computer obviously but I am thinking about improving my personnal hardware. Actually before I struggled with these data, I had no idea that 4GB RAM was very little. It is not clear for beginners what you can do or not do depending on softwares or hardware... – Stéphanie C Jun 24 '15 at 16:19
  • @Stéphanie C: I don't know who is paying for your computer, but if they are also paying for your time, getting them to add more RAM may be actually be the most cost-effective solution for them - unless you plan to work with much larger data sets later on. – reinierpost Jun 25 '15 at 15:37
  • A friend of mine is being hired as a datascientist in private company and get a 16 GB ram computer, is it a benchmark of the acceptable minimum for datascience do you think ? @reinierpost – Stéphanie C Jun 27 '15 at 15:31
  • @Stéphanie C: I don't know, it entirely depends on the amounts of data you need to process and the complexity of the processing (e.g. selecting is usually cheap, joining expensive). – reinierpost Jun 27 '15 at 19:04

1 Answers1

1

analyzing 8 billion lines on a 4gb computer is pretty silly, but you can try

http://www.asdfree.com/2013/03/column-store-r-or-how-i-learned-to-stop.html

  • Thank you for your answer, I will check that. I hope I am not that silly, I am a statistician and still have a lot to learn about computers on a hardware point of view (it is not clear for me why RAM is necessary when you don't need to have access to all the data at the same time, but maybe speed of writing and reading is essentially based on RAM..). I have limited ressources at work and at home, so I am wondering : how do people who participate to challenges like kaggle as a hobby cope with these issues ? Are they necessarily very well equipped ? – Stéphanie C Jun 23 '15 at 09:10
  • I tried Monetdb with no success, but I succeeded in loading at least with rSQLite, thank you for answering though. – Stéphanie C Jun 24 '15 at 16:21
  • You are correct: the reason to add RAM is access speed, primarily for writing. – reinierpost Jun 25 '15 at 15:39