DATA Step, Macro, Functions and more

Split comma separated string on file read

Reply
Contributor
Posts: 56

Split comma separated string on file read

I have a CSV file that contains a field that is a comma separated list. An example would be

 

C1      C2      C3        C4
Tab1   DB1    S1        "A,B,C,D,E"
Tab2   DB1    S1        "A,B,C"
Tab3   DB2    S1        "D,E"
Tab4   DB2    S3        "A,B,C,D,E,G"

The output desired would be

 

C1         C2           C3          C4
Tab1      DB1        S1           A
Tab1      DB1        S1           B
Tab1      DB1        S1           C
Tab1      DB1        S1           D
Tab1      DB1        S1           E
Tab2      DB1        S1           A
Tab2      DB1        S1           B
Tab2      DB1        S1           C

etc...

I am currently able to read the data into SAS with the 4 variables, and then split the C4 variable using SCAN function. 

 

However, I'm curious if this could be handled in the input data step, instead of doing it in two separate steps. Given the size of the input file being in the hundreds of millions of rows, only having to read through the data once would be very helpful.

 

I have tried a few variations of the construct..

 


data t;
  infile "naming_test.csv" missover dlm="," dsd;
  length NAME $ 50;
  input NAME $ @ ;
  var = 0;
  do while (name ne ' ');
    var+1; /* keep track of a counter for testing */
    output;
    input name $ @;
  end;
run;

However, I haven't been able to crack it. Thanks for your help!

PROC Star
Posts: 1,760

Re: Split comma separated string on file read

Like this?

data _null_;
  file "%sysfunc(pathname(WORK))\t.txt";
  put 'C1    ,  C2  ,    C3    ,    C4           ';
  put 'Tab1  , DB1  ,  S1      ,  "A,B,C,D,E"    ';
  put 'Tab2  , DB1  ,  S1      ,  "A,B,C"        ';
  put 'Tab3  , DB2  ,  S1      ,  "D,E"          ';
  put 'Tab4  , DB2  ,  S3      ,  "A,B,C,D,E,G"  ';
run;
data WANT;
  infile "%sysfunc(pathname(WORK))\t.txt" dlm=',' dsd firstobs=2;
  input C1 $ C2 $ C3 $ C4 : $1.;
run;
C1 C2 C3 C4
Tab1 DB1 S1 A
Tab2 DB1 S1 A
Tab3 DB2 S1 D
Tab4 DB2 S3 A

 

Trusted Advisor
Posts: 1,022

Re: Split comma separated string on file read

I didn't  know this before, but if you drop the DSD option on  INFILE, then the quoted comma-separated values will no longer be treated as a single value.  Which means that the first value ("A in data line 1) will be read in alone.  You can  then do a "DO WHILE" loop along the lines you contemplate:

 

dm 'clear log';
data t ;
  infile datalines  missover dlm="," ;
  input (c1-c3) (:$10.) C4 :$8. @;

  do while (C4^=' ');
    c4=compress(c4,'"');
    output;
    input c4 :$14. @;
  end;
datalines;
Tab1,   DB1,    S1,        "A,B,C,D,E"
Tab2,   DB1,    S1,        "A,B,C"
Tab3,   DB2,    S1,        "D,E"
Tab4,   DB2,    S3,        "A,B,C,D,E,G"
run;

This works fine as long as you don't have consecutive commas in your raw data as a way to indicate a missing value.

 

Super User
Posts: 7,854

Re: Split comma separated string on file read

data t;
infile "...." dlm="," dsd;
input c1 $ c2 $ c3 $ _c4 :$20.;
length c4 $1;
do i = 1 to countw(_c4,',');
  c4 = scan(_c4,i,',');
  output;
end;
drop i _c4;
run;

So you see you can do it in the input data step already.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Ask a Question
Discussion stats
  • 3 replies
  • 113 views
  • 0 likes
  • 4 in conversation