Separate a variable containing a list into multiple variables

raivester · Posted 04-26-2021 03:49 PM

I have a data set that has a variable that lists a set of states (that are separated by a space). The number of states listed varies by observation. Below is a simplified example:

state_list

obs1 MA NE AK

obs2 NE AL ME NJ

obs3 RI

obs4 NV CA UT OR

obs5 MA NV WA CO MN MI

What I would like to do is separate these out so that each state listed for an observation is under a separate variable (e.g., state1, state2, state3, etc), BUT I do not want to manually enter or count the maximum number of states listed within a given instance of the state_list variable. I don't know if the CALL SYMPUT routine can somehow be used? The actual data has many thousands of observations and we receive a new data set every year, so it really is not practical to manually go through and determine the length of the longest list in the state_list variable.

Any advice?

My resulting data set would look something like:

st1 st2 st3 st4 st5 st6

obs1 MA NE AK

obs2 NE AL ME NJ

obs3 RI

obs4 NV CA UT OR

obs5 MA NV WA CO MN MI

Reeza · Posted 04-26-2021 04:10 PM

Split each value out to it's own row and then use PROC TRANSPOSE.

Untested demo:

data long;
set have;

nWords = countw(state_list);

do i=1 to nWords;
state = scan(state_list, i);
output;
end;
run;

proc transpose data=long out=wide prefix = State;
by OBS;
var State;
ID i;
run;

Transposing data tutorials:
Long to Wide:
https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-long-to-wide-using-proc-transpose/

https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/

@raivester wrote:

I have a data set that has a variable that lists a set of states (that are separated by a space). The number of states listed varies by observation. Below is a simplified example:

state_list

obs1 MA NE AK

obs2 NE AL ME NJ

obs3 RI

obs4 NV CA UT OR

obs5 MA NV WA CO MN MI

What I would like to do is separate these out so that each state listed for an observation is under a separate variable (e.g., state1, state2, state3, etc), BUT I do not want to manually enter or count the maximum number of states listed within a given instance of the state_list variable. I don't know if the CALL SYMPUT routine can somehow be used? The actual data has many thousands of observations and we receive a new data set every year, so it really is not practical to manually go through and determine the length of the longest list in the state_list variable.

Any advice?

My resulting data set would look something like:

st1 st2 st3 st4 st5 st6

obs1 MA NE AK

obs2 NE AL ME NJ

obs3 RI

obs4 NV CA UT OR

obs5 MA NV WA CO MN MI

Tom · Posted 04-27-2021 01:57 AM

Use a DO loop with SCAN() and OUTPUT to convert to a tall format dataset.

Here is a method that works even when some observations have an empty list.

data have;
  input id $ state_list $50. ;
cards;
obs1 MA NE AK
obs2 NE AL ME NJ
obs3 RI
obs4 NV CA UT OR
obs5 MA NV WA CO MN MI
obs6
;

data want ;
  set have;
  do state_no=1 by 1 until (state_no >= countw(state_list,' '));
    state=scan(state_list,state_no,' ');
    output;
  end;
run;

Now you can use that dataset for most things. You can even use PROC REPORT to print it in the wide style.

proc report data=want missing;
  column id state,state_no;
  define id / group;
  define state_no / across 'STATE';
  define state / group ' ';
run;

Results:

Or if you really want that wide report as a dataset use PROC TRANSPOSE

proc transpose data=want out=wide(drop=_name_)  prefix=st ;
  by id;
  var state;
  id state_no;
run;

Result:

Ksharp · Posted 04-27-2021 08:46 AM

data have;
  input id $ state_list $50. ;
cards;
obs1 MA NE AK
obs2 NE AL ME NJ
obs3 RI
obs4 NV CA UT OR
obs5 MA NV WA CO MN MI
obs6
;
proc sql;
select max(countw(state_list)) into : n
 from have;
 quit;
data want;
 set have;
 array st{&n} $ 40;
 do i=1 to countw(state_list);
   st{i}=scan(state_list,i);
 end;
drop i;
run;

Or if you have a BIG table.

Separate a variable containing a list into multiple variables

Re: Separate a variable containing a list into multiple variables

Re: Separate a variable containing a list into multiple variables

Re: Separate a variable containing a list into multiple variables