I have a dataset with column names - co_investigator_s___: and co_i_residents___:. Each column name ends in a numeric value, the number of columns are unknown, and the columns are not in numerical order i.e. co_i_residents___205 comes before co_i_residents___176. For each column, when there is a value of 1, I want to create a new variable with an assigned value that is equal to the number at the end of the old variable. I then want to transpose the columns, remove rows with missing data, and create an output with new variables where the number of columns is equal to the highest number of columns that contain data. Below is an example of how the data look upon import and how I want the data to look after processing.
In my macros, I used the number 200 as a safeguard since I was uncertain of how to use a wildcard. I was also unsure how to implement a macro using proc transpose. I realize my code is terribly inefficient! I do not have enough knowledge with arrays, do statements, and proc transpose to make the syntax more efficient.
%macro Do_PI;
%Do I = 1 %To 200;
if co_investigator_s___&I = 1 then co_pi_&I = &I;
%END;
%mend Do_PI;
%macro Do_RES;
%Do I = 1 %To 200;
if co_i_residents___&I = 1 then co_res_&I = &I;
%END;
%mend Do_Loop;
data dataset1;
set sasdata.dataset;
%Do_PI;
%Do_RES;
run;
****************************************************************************;
*Process co_investigator_s___: vars;
proc sort data = dataset1;
by record_id;
proc transpose data = dataset1 out = pi2 (rename = (col1=PI_Code));
by record_id;
var co_pi_:;
run;
proc sort data = pi2;
by record_id PI_Code;
data pi3;
set pi2;
if PI_Code = . then delete;
run;
proc sort data = pi3;
by record_id PI_Code;
data pi4;
set pi3;
by record_id PI_Code;
retain PI_Count;
if first.record_id then PI_Count = 0;
PI_Count = PI_Count + 1;
run;
proc transpose data = pi4 out = pi5 (drop = _NAME_) prefix = co_pi_name;
by record_id;
var PI_Code;
ID PI_Count;
* format co_i_name: co_pi_.;
run;
****************************************************************************;
*Process co_i_residents___: vars;
proc transpose data = dataset1 out = res2 (rename = (col1=res_Code));
by record_id;
var co_res_:;
run;
proc sort data = res2;
by record_id res_Code;
data res3;
set res2;
if res_Code = . then delete;
run;
proc sort data = res3;
by record_id res_Code;
data res4;
set res3;
by record_id res_Code;
retain res_Count;
if first.record_id then res_Count = 0;
res_Count = res_Count + 1;
run;
proc transpose data = res4 out = res5 (drop = _NAME_) prefix = co_res_name;
by record_id;
var res_Code;
ID res_Count;
* format co_res_name: co_res_.;
run;
****************************************************************************;
*Merge datasets;
proc sort data = pi5;
by record_id;
proc sort data = res5;
by record_id;
data merged;
merge pi5 res5;
by record_id;
run;
I'm pretty sure you don't need macros, so I haven't tried to figure out your code. All of this can be done within an array (or several arrays) in a data step. The fact that the columns are not in numerical order doesn't affect anything.
It's not clear to me you need a transpose, in fact it looks as if you don't need one, because both the input data set and the output data set have sequential record IDs in each row. I don't understand how the input variable co_i_residents___205 doesn't result in 205 showing in one of the output variables.
If you would provide a portion of your data following these instructions, I'm sure I can show you how to do this.
I couldn't get past step one of those instructions, unfortunately.
The index into the array is not the value in want to store. Instead you want the value at the end of the NAME of the variable. You can use the VNAME() function to get the name.
Since you didn't provide data I made a trivial example. I put the prefix of the variable names into a macro name because the values were too long to type once, let alone multiple times.
%let prefix=co_i_residents___ ;
data have;
id+1;
input &prefix.205 &prefix.176 &prefix.89 ;
cards;
1 . 1
. . 1
. 1 .
;
You can use one pass thru the data to find the maximum number of values for any line (size of output array). Your example made it look like the values were only ever missing or 1, so I just counted the non-missing values.
data _null_;
do until(eof);
set have end=eof;
nvars=max(nvars,n(of &prefix:));
end;
call symputx('nvars',nvars);
run;
Now to process the data you need two arrays. One for the variables to check and one for the variables to create. You will need two pointers into the arrays. One that scans over all of the input variables and one that increments by one every time you find a value to store. You can keep that second one as a counter of how many values you found on this observation.
data want;
set have ;
array inlist &prefix: ;
array newvar [&nvars];
numvars=0;
do index=1 to dim(inlist);
if inlist[index] then do;
numvars + 1;
newvar[numvars] = input(
substr(vname(inlist[index])
,%length(&prefix)+1
)
,32.)
;
end;
end;
drop index &prefix:;
run;
Results:
Obs id newvar1 newvar2 numvars 1 1 205 89 2 2 2 89 . 1 3 3 176 . 1
I don't even think you need macro variables here.
You could declare your array as
array inlist co_investigator_s:;
Thank you very much for your input. I wanted to spend some time digesting your syntax. It definitely appears to get the job done! Since I need to run this code on several different sets of columns, I incorporated a macro.
data _null_;
do until(eof);
set sasdata.dataset end=eof;
num_coi_vars=max(num_coi_vars,n(of co_investigator_s___:));
num_res_vars=max(num_res_vars,n(of co_i_residents___:));
end;
call symputx('coi_col_total',num_coi_vars);
call symputx('res_col_total',num_res_vars);
run;
%macro array(prefix, array_name_old_vars, array_name_new_vars, column_count_macro);
%let prefix = &prefix;
drop &prefix: data_count index;
array &array_name_old_vars &prefix:;
array &array_name_new_vars [&column_count_macro];
data_count = 0;
do index=1 to dim(&array_name_old_vars);
if &array_name_old_vars[index] then do;
data_count + 1;
&array_name_new_vars[data_count] = input(substr(vname(&array_name_old_vars[index]),%length(&prefix)+1),32.);
end;
end;
%mend;
data want;
set sasdata.dataset;
%array(co_investigator_s___, OLD_coi, co_i, &coi_col_total);
%array(co_i_residents___, OLD_res, res, &res_col_total);
run;
One additional step that would be helpful is to figure out a way to allow the array that creates the new variables, to only create the number of new variables that matches the maximum count variable. Right now, the array creates new variables to match the number calculated in the symput macro. I then need to run a frequency to see which variables have no data so I can remove those columns from the dataset. Any suggestions? @Tom
proc freq data = want;
table co_i: res:;
run;
data want1;
set want;
drop co_pi8-co_pi&coi_col_total res5-res&res_col_total;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.