I have a CSV file that contains a field that is a comma separated list. An example would be
C1 C2 C3 C4
Tab1 DB1 S1 "A,B,C,D,E"
Tab2 DB1 S1 "A,B,C"
Tab3 DB2 S1 "D,E"
Tab4 DB2 S3 "A,B,C,D,E,G"
The output desired would be
C1 C2 C3 C4
Tab1 DB1 S1 A
Tab1 DB1 S1 B
Tab1 DB1 S1 C
Tab1 DB1 S1 D
Tab1 DB1 S1 E
Tab2 DB1 S1 A
Tab2 DB1 S1 B
Tab2 DB1 S1 C
etc...
I am currently able to read the data into SAS with the 4 variables, and then split the C4 variable using SCAN function.
However, I'm curious if this could be handled in the input data step, instead of doing it in two separate steps. Given the size of the input file being in the hundreds of millions of rows, only having to read through the data once would be very helpful.
I have tried a few variations of the construct..
data t;
infile "naming_test.csv" missover dlm="," dsd;
length NAME $ 50;
input NAME $ @ ;
var = 0;
do while (name ne ' ');
var+1; /* keep track of a counter for testing */
output;
input name $ @;
end;
run;
However, I haven't been able to crack it. Thanks for your help!
Like this?
data _null_;
file "%sysfunc(pathname(WORK))\t.txt";
put 'C1 , C2 , C3 , C4 ';
put 'Tab1 , DB1 , S1 , "A,B,C,D,E" ';
put 'Tab2 , DB1 , S1 , "A,B,C" ';
put 'Tab3 , DB2 , S1 , "D,E" ';
put 'Tab4 , DB2 , S3 , "A,B,C,D,E,G" ';
run;
data WANT;
infile "%sysfunc(pathname(WORK))\t.txt" dlm=',' dsd firstobs=2;
input C1 $ C2 $ C3 $ C4 : $1.;
run;
C1 | C2 | C3 | C4 |
---|---|---|---|
Tab1 | DB1 | S1 | A |
Tab2 | DB1 | S1 | A |
Tab3 | DB2 | S1 | D |
Tab4 | DB2 | S3 | A |
I didn't know this before, but if you drop the DSD option on INFILE, then the quoted comma-separated values will no longer be treated as a single value. Which means that the first value ("A in data line 1) will be read in alone. You can then do a "DO WHILE" loop along the lines you contemplate:
dm 'clear log';
data t ;
infile datalines missover dlm="," ;
input (c1-c3) (:$10.) C4 :$8. @;
do while (C4^=' ');
c4=compress(c4,'"');
output;
input c4 :$14. @;
end;
datalines;
Tab1, DB1, S1, "A,B,C,D,E"
Tab2, DB1, S1, "A,B,C"
Tab3, DB2, S1, "D,E"
Tab4, DB2, S3, "A,B,C,D,E,G"
run;
This works fine as long as you don't have consecutive commas in your raw data as a way to indicate a missing value.
data t;
infile "...." dlm="," dsd;
input c1 $ c2 $ c3 $ _c4 :$20.;
length c4 $1;
do i = 1 to countw(_c4,',');
c4 = scan(_c4,i,',');
output;
end;
drop i _c4;
run;
So you see you can do it in the input data step already.
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.