4

I have received a dataset in text file with the following format

col1=datac1r1,col2=datac2r1,col3=datac3r1
col1=datac1r2,col2=datac2r2,col3=datac3r2
col1=datac1r3,col2=datac2r3,col3=datac3r3
col1=datac1r4,col2=datac2r4,col3=datac3r4

Each row is a unique entry, with columns separated by comma, just that the column name is repeated in each element.

I need to parse this in R and analyze it. I have worked with csv files extensively, but I have never seen this format before.

Is it a std. format I can import it in? Or do I need to write a script to convert it into a csv format?

tallharish
  • 153
  • 3

2 Answers2

5

Using Miller (https://github.com/johnkerl/miller)

mlr --ocsv unsparsify input.txt

you will have this CSV

col1,col2,col3
datac1r1,datac2r1,datac3r1
datac1r2,datac2r2,datac3r2
datac1r3,datac2r3,datac3r3
datac1r4,datac2r4,datac3r4
aborruso
  • 166
  • 5
0

In a perfect-world, miller package (Aborruso's answer) would do it. However, I needed to clean the dataset as well. I wrote the following function that does it all. The key pieces are the extract_header() and remove_header() functions.

clean_dataset <- function(dataset) {
  #Convert all elements into character (for now)
  dataset[] = lapply(dataset, as.character)

  #Come up with header
  extract_header = function(x) {
    strsplit(x,"=")[[1]][1]
  }

  names(dataset) = lapply(dataset[1,], extract_header)

  #Remove headers from all elements
  remove_header = function(x) {
    strsplit(x,"=")[[1]][2]
  }

  dataset = as.data.frame(apply(dataset, MARGIN=c(1,2), remove_header))
  dataset[!is.na(names(dataset))]
}

dataset <- clean_dataset(dataset)
tallharish
  • 153
  • 3