10-25-2017 10:30 PM
I have a CSV file that contains a field that is a comma separated list. An example would be
C1 C2 C3 C4 Tab1 DB1 S1 "A,B,C,D,E" Tab2 DB1 S1 "A,B,C" Tab3 DB2 S1 "D,E" Tab4 DB2 S3 "A,B,C,D,E,G"
The output desired would be
C1 C2 C3 C4 Tab1 DB1 S1 A Tab1 DB1 S1 B Tab1 DB1 S1 C Tab1 DB1 S1 D Tab1 DB1 S1 E Tab2 DB1 S1 A Tab2 DB1 S1 B Tab2 DB1 S1 C etc...
I am currently able to read the data into SAS with the 4 variables, and then split the C4 variable using SCAN function.
However, I'm curious if this could be handled in the input data step, instead of doing it in two separate steps. Given the size of the input file being in the hundreds of millions of rows, only having to read through the data once would be very helpful.
I have tried a few variations of the construct..
data t; infile "naming_test.csv" missover dlm="," dsd; length NAME $ 50; input NAME $ @ ; var = 0; do while (name ne ' '); var+1; /* keep track of a counter for testing */ output; input name $ @; end; run;
However, I haven't been able to crack it. Thanks for your help!
10-25-2017 11:19 PM
data _null_; file "%sysfunc(pathname(WORK))\t.txt"; put 'C1 , C2 , C3 , C4 '; put 'Tab1 , DB1 , S1 , "A,B,C,D,E" '; put 'Tab2 , DB1 , S1 , "A,B,C" '; put 'Tab3 , DB2 , S1 , "D,E" '; put 'Tab4 , DB2 , S3 , "A,B,C,D,E,G" '; run; data WANT; infile "%sysfunc(pathname(WORK))\t.txt" dlm=',' dsd firstobs=2; input C1 $ C2 $ C3 $ C4 : $1.; run;
10-25-2017 11:25 PM
I didn't know this before, but if you drop the DSD option on INFILE, then the quoted comma-separated values will no longer be treated as a single value. Which means that the first value ("A in data line 1) will be read in alone. You can then do a "DO WHILE" loop along the lines you contemplate:
dm 'clear log'; data t ; infile datalines missover dlm="," ; input (c1-c3) (:$10.) C4 :$8. @; do while (C4^=' '); c4=compress(c4,'"'); output; input c4 :$14. @; end; datalines; Tab1, DB1, S1, "A,B,C,D,E" Tab2, DB1, S1, "A,B,C" Tab3, DB2, S1, "D,E" Tab4, DB2, S3, "A,B,C,D,E,G" run;
This works fine as long as you don't have consecutive commas in your raw data as a way to indicate a missing value.
10-26-2017 03:27 AM
data t; infile "...." dlm="," dsd; input c1 $ c2 $ c3 $ _c4 :$20.; length c4 $1; do i = 1 to countw(_c4,','); c4 = scan(_c4,i,','); output; end; drop i _c4; run;
So you see you can do it in the input data step already.