BookmarkSubscribeRSS Feed
Quartz | Level 8

I have a CSV file that contains a field that is a comma separated list. An example would be


C1      C2      C3        C4
Tab1   DB1    S1        "A,B,C,D,E"
Tab2   DB1    S1        "A,B,C"
Tab3   DB2    S1        "D,E"
Tab4   DB2    S3        "A,B,C,D,E,G"

The output desired would be


C1         C2           C3          C4
Tab1      DB1        S1           A
Tab1      DB1        S1           B
Tab1      DB1        S1           C
Tab1      DB1        S1           D
Tab1      DB1        S1           E
Tab2      DB1        S1           A
Tab2      DB1        S1           B
Tab2      DB1        S1           C


I am currently able to read the data into SAS with the 4 variables, and then split the C4 variable using SCAN function. 


However, I'm curious if this could be handled in the input data step, instead of doing it in two separate steps. Given the size of the input file being in the hundreds of millions of rows, only having to read through the data once would be very helpful.


I have tried a few variations of the construct..


data t;
  infile "naming_test.csv" missover dlm="," dsd;
  length NAME $ 50;
  input NAME $ @ ;
  var = 0;
  do while (name ne ' ');
    var+1; /* keep track of a counter for testing */
    input name $ @;

However, I haven't been able to crack it. Thanks for your help!

Tourmaline | Level 20

Like this?

data _null_;
  file "%sysfunc(pathname(WORK))\t.txt";
  put 'C1    ,  C2  ,    C3    ,    C4           ';
  put 'Tab1  , DB1  ,  S1      ,  "A,B,C,D,E"    ';
  put 'Tab2  , DB1  ,  S1      ,  "A,B,C"        ';
  put 'Tab3  , DB2  ,  S1      ,  "D,E"          ';
  put 'Tab4  , DB2  ,  S3      ,  "A,B,C,D,E,G"  ';
data WANT;
  infile "%sysfunc(pathname(WORK))\t.txt" dlm=',' dsd firstobs=2;
  input C1 $ C2 $ C3 $ C4 : $1.;
C1 C2 C3 C4
Tab1 DB1 S1 A
Tab2 DB1 S1 A
Tab3 DB2 S1 D
Tab4 DB2 S3 A



I didn't  know this before, but if you drop the DSD option on  INFILE, then the quoted comma-separated values will no longer be treated as a single value.  Which means that the first value ("A in data line 1) will be read in alone.  You can  then do a "DO WHILE" loop along the lines you contemplate:


dm 'clear log';
data t ;
  infile datalines  missover dlm="," ;
  input (c1-c3) (:$10.) C4 :$8. @;

  do while (C4^=' ');
    input c4 :$14. @;
Tab1,   DB1,    S1,        "A,B,C,D,E"
Tab2,   DB1,    S1,        "A,B,C"
Tab3,   DB2,    S1,        "D,E"
Tab4,   DB2,    S3,        "A,B,C,D,E,G"

This works fine as long as you don't have consecutive commas in your raw data as a way to indicate a missing value.


The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

Super User
data t;
infile "...." dlm="," dsd;
input c1 $ c2 $ c3 $ _c4 :$20.;
length c4 $1;
do i = 1 to countw(_c4,',');
  c4 = scan(_c4,i,',');
drop i _c4;

So you see you can do it in the input data step already.


Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 4 in conversation