File format where column names are repeated on each row

Question

I have received a dataset in text file with the following format

col1=datac1r1,col2=datac2r1,col3=datac3r1
col1=datac1r2,col2=datac2r2,col3=datac3r2
col1=datac1r3,col2=datac2r3,col3=datac3r3
col1=datac1r4,col2=datac2r4,col3=datac3r4

Each row is a unique entry, with columns separated by comma, just that the column name is repeated in each element.

I need to parse this in R and analyze it. I have worked with csv files extensively, but I have never seen this format before.

Is it a std. format I can import it in? Or do I need to write a script to convert it into a csv format?

aborruso · Accepted Answer · 2019-07-20T07:29:32.343

5

Using Miller (https://github.com/johnkerl/miller)

mlr --ocsv unsparsify input.txt

you will have this CSV

col1,col2,col3
datac1r1,datac2r1,datac3r1
datac1r2,datac2r2,datac3r2
datac1r3,datac2r3,datac3r3
datac1r4,datac2r4,datac3r4

edited Jul 20 '19 at 07:29

answered Jul 18 '19 at 12:17

aborruso

166
5

1

Sorry for the confusion. I didn't intend to have () there. – tallharish Jul 19 '19 at 15:56
Ok, I have edited my script – aborruso Jul 19 '19 at 19:41
@tallharish please try and make me know if now it works – aborruso Jul 20 '19 at 07:29
@tallharish please let me know if it works – aborruso Jul 24 '19 at 06:36
@tallharish did you try? – aborruso Aug 20 '19 at 08:44

score 0 · Answer 2 · answered Aug 20 '19 at 23:05

In a perfect-world, miller package (Aborruso's answer) would do it. However, I needed to clean the dataset as well. I wrote the following function that does it all. The key pieces are the extract_header() and remove_header() functions.

clean_dataset <- function(dataset) {
  #Convert all elements into character (for now)
  dataset[] = lapply(dataset, as.character)

  #Come up with header
  extract_header = function(x) {
    strsplit(x,"=")[[1]][1]
  }

  names(dataset) = lapply(dataset[1,], extract_header)

  #Remove headers from all elements
  remove_header = function(x) {
    strsplit(x,"=")[[1]][2]
  }

  dataset = as.data.frame(apply(dataset, MARGIN=c(1,2), remove_header))
  dataset[!is.na(names(dataset))]
}

dataset <- clean_dataset(dataset)

If you could share your "real" input file I'd like to try to process it — aborruso, Aug 21 '19 at 07:48

File format where column names are repeated on each row

2 Answers2