Solved: Re: Combine the variable values vertically

NewUsrStat

Hi guys,

suppose to have the following dataset:

data DB;
  input ID :$200. Reason1 :$200.Reason2 :$200.;
cards;
0001 Considered     Excluded
0001 Violated       Not_excluded
0001 Excluded       Not_Verified
0001 Excluded       Not_Verified   
run;

Is there a way to get the following?

data DB1;
  input ID :$200. Reason1 :$200.Reason2 :$200.;
cards;
0001 Considered,Violated,Excluded     Excluded,Not_excluded,Not_Verified
0001 Considered,Violated,Excluded     Excluded,Not_excluded,Not_Verified
0001 Considered,Violated,Excluded     Excluded,Not_excluded,Not_Verified   
run;

In other words I would like to combine into one row and by each variable "Reason*" all the values the variable(s) takes.

Note that, after the combination, I would like to repeat the combination the number of times the variable Reason* takes a value ID by ID.

In the real dataset I have Reason1 , Reason12, Reason3, ..., Reason10.

Can anyone help me please?

Tom

You will need to make NEW variable(s).

data want ;
  set have end=eof;
  length newreason $200 ;
  retain newreason;
  newreason = catx(',',newreason,reason);
  if eof; 
  drop reason;
run;

If the data is grouped by something, perhaps your ID variable, then this can be simplified by by placing the SET inside a DO loop and using a BY statement. This will eliminate the need for the RETAIN statement.

data want;
  do until (last.id);
     set have;
     by id;
     length newreason $200;
     newreason=catx(',',newreason,reason);
  end;
  drop reason;
run;

To output multiple copies you could change the DO statement to count the number of observations and then add a second DO loop around an output statement.

data want;
  do _n_=1 by 1 until (last.id);
     set have;
     by id;
     length newreason $200;
     newreason=catx(',',newreason,reason);
  end;
  do _n_=1 to _n_;
    output;
  end;
  drop reason;
run;

But if there are other variables besides ID and REASON you want to keep then instead just add another DO loop with and additional SET statement.

data want;
  do until (last.id);
     set have;
     by id;
     length newreason $200;
     newreason=catx(',',newreason,reason);
  end;
  do until (last.id);
     set have;
     by id;
     output;
  end;
  drop reason;
run;

And to handle multiple variables just use arrays.

data want;
  do until (last.id);
     set have;
     by id;
     array reason[10];
     array newreason[10] $200;
     do over reason;
       newreason=catx(',',newreason,reason);
     end;
  end;
  do until (last.id);
     set have;
     by id;
     output;
  end;
  drop reason1-reason10;
run;

If you don't like using the deprecated DO OVER syntax then just add explicit indexes instead.

     do _n_=1 to dim(reason);
       newreason[_n_]=catx(',',newreason[_n_],reason[_n_]);
     end;

View solution in original post

PaigeMiller

In my opinion, this is not a good idea, and it makes future programming difficult. See Maxim 19. Also, why do you need an output of three identical rows?

Maybe a different re-arrangement of the data would work better. Please tell us what analyses/plots/tables you would create once you have the data re-arranged as you want?

--
Paige Miller

NewUsrStat

Thank you very much for your help. The reason why I have to replicate rows is because the dataset contains other variables with different values and for the subsequent analysis I have to apply different types of filters.

PaigeMiller

@NewUsrStat wrote:
Thank you very much for your help. The reason why I have to replicate rows is because the dataset contains other variables with different values and for the subsequent analysis I have to apply different types of filters.

Unfortunately, this does not answer my question about what analyses/plots/tables are you trying to create.

--
Paige Miller

Tom

You will need to make NEW variable(s).

data want ;
  set have end=eof;
  length newreason $200 ;
  retain newreason;
  newreason = catx(',',newreason,reason);
  if eof; 
  drop reason;
run;

If the data is grouped by something, perhaps your ID variable, then this can be simplified by by placing the SET inside a DO loop and using a BY statement. This will eliminate the need for the RETAIN statement.

data want;
  do until (last.id);
     set have;
     by id;
     length newreason $200;
     newreason=catx(',',newreason,reason);
  end;
  drop reason;
run;

To output multiple copies you could change the DO statement to count the number of observations and then add a second DO loop around an output statement.

data want;
  do _n_=1 by 1 until (last.id);
     set have;
     by id;
     length newreason $200;
     newreason=catx(',',newreason,reason);
  end;
  do _n_=1 to _n_;
    output;
  end;
  drop reason;
run;

But if there are other variables besides ID and REASON you want to keep then instead just add another DO loop with and additional SET statement.

data want;
  do until (last.id);
     set have;
     by id;
     length newreason $200;
     newreason=catx(',',newreason,reason);
  end;
  do until (last.id);
     set have;
     by id;
     output;
  end;
  drop reason;
run;

And to handle multiple variables just use arrays.

data want;
  do until (last.id);
     set have;
     by id;
     array reason[10];
     array newreason[10] $200;
     do over reason;
       newreason=catx(',',newreason,reason);
     end;
  end;
  do until (last.id);
     set have;
     by id;
     output;
  end;
  drop reason1-reason10;
run;

If you don't like using the deprecated DO OVER syntax then just add explicit indexes instead.

     do _n_=1 to dim(reason);
       newreason[_n_]=catx(',',newreason[_n_],reason[_n_]);
     end;

NewUsrStat

Thank you very much for your help. Unfortunately the code that refers to the array does not work. But, anyway, how it become if the value of the variable is repeated identical?

Tom

If the values repeat then they will appear multiple times.

If the goal is to get the distinct values then you would need to make the logic more complex so that the current reason is only added when it does not already appear.

do _n_=1 to dim(reason);
  if not findw(newreason[_n_],reason[_n_],',','t') then 
   newreason[_n_]=catx(',',newreason[_n_],reason[_n_])
  ;
end;

If you don't care about the case then add the i modifier into addition to the t modifier in the FINDW() function call.

But with these extra complication it might be better to use multiple passes so the logic can be simpler and easier to follow. So first build the list of unique values. Then build the concatenations. And finally remerge it with the original data.

* Make TALL dataset with all of the reasons from all of the reason variables ;
proc transpose data=have out=step1;
   by id;
   var reason1-reason10;
run;
* Eliminate duplicates ;
proc sort data=step1 out=step2 nodupkey ;
   by id _name_ col1 ;
run;
* Concatenate ;
data step3;
   do until (last._name_);
      set step2;
      by id _name_;
      length list $200 ;
      list = catx(',',list,col1);
   end;
   drop col1;
run;
* Turn back into separate variables;
proc transpose data=step3 out=step4 ;
   by id;
   id _name_;
   var list ;
run;
* Combine with original data ;
data want;
   merge have(drop=reason1-reason10) step4;
   by id;
run;

dxiao2017

Hi @Tom , I think your solution is very complicated, I think a macro contains sql and data steps solve this question better (in that, this approach address the major issue of the question more directly, i.e., the desired outcome is unique combinations of all variations of reason1 and reason2 for an unique ID, so use sql select into; and data step to set a series of newly created dataset can be proper techniques, and these steps need to be repeated so have to use macro with do loop), what do you think?

My other comment for your solution is that, I do not think array is a proper technique to solve this question, because obviously all of the columns are characters, there is no any chance for you to do a calculation of numbers so no need to use array, also the major issue is to combine several different character values of a columns to one character value. Techniques for process character values and character strings are more suitable.

ballardw

@NewUsrStat wrote:
Thank you very much for your help. Unfortunately the code that refers to the array does not work. But, anyway, how it become if the value of the variable is repeated identical?

What does "array does not work"?

If code throws an error then provide the log with the code and all the messages. Copy the text from the log then on the forum open a text box in a message using the </> icon above the message window and paste the code.

If the result is not as expected then provide the data used in the form of a working data step and then provide more detail as to the expected result for the example data.

I am afraid that a generic "does not work" is very unhelpful in diagnosing problems.

dxiao2017

Hi @NewUsrStat , you can also use macro and sql steps to accomplish this goal, the code and output are as follows. This macro also works when you have more IDs (here I add only two more ids and you can have more) and more variations of reason1 and reason2 columns, this macro solves your question. This macro only has two simple sql steps and two simple data steps, easy to understand and easy to use. The final output contains unique id and the unique combinations of reason1 and reason2, if you want to display the reason1 and reason2 for all rows of an unique ID, just use if/then statement, such as if ID=0001 then reason1="copy the reason1 column here"; in a date step.

data DB;
  input ID :$200. Reason1 :$200.Reason2 :$200.;
cards;
0001 Considered     Excluded
0001 Violated       Not_excluded
0001 Excluded       Not_Verified
0001 Excluded       Not_Verified
0002 considered     excluded
0003 considered     excluded
0003 violated       not_excluded
;   
run;
%macro datawant;
proc sql noprint;
select distinct id
   into :id1-
   from db;
quit;
%put &sqlobs;
%do i=1 %to &sqlobs;
proc sql noprint;
select distinct reason1
   into :reason1&i separated by ','
   from db
   where id="&&id&i";
select distinct reason2
   into :reason2&i separated by ','
   from db
   where id="&&id&i";
quit;
%put &&reason1&i &&reason2&i;
data final&i;
   id="&&id&i";
   reason1="&&reason1&i";
   reason2="&&reason2&i";
run;
%end;
data finalfinal;
   set %do i=1 %to 3;
       final&i
       %end;
       ;
run;
proc print data=finalfinal;run;
%mend datawant;
%datawant;

Tom

Note that is a MUCH slower method since you are making (number of ID values) X (number of variables) passes through the original dataset.

The single data step method only makes ONE pass though the data.

The transpose process I posted later does make multiple passes, but the number of passes does not depend on either the number of ID values or the number of variables.

Not sure how something using the macro language can be consider simpler than something using normal SAS code. Same could be said for using SQL code instead of SAS code. Although the special SEPARATED BY syntax available for creating macro variables does help in this particular case.

PS There is no reason why you should only use NUMERIC variables in an ARRAY. This case is perfect example of where using an ARRAY of character variables makes sense. Remember the main purpose for the creation of the ARRAY statement was to allow the same statements to operate on different variables based on the index into the arrary.

dxiao2017

I think you are right Tom. Right now I am reviewing SAS adv. cert exam techniques so the first things comes to my mind is macro and sql. Nevertheless I think base techniques are more substantial for me (both for my own learning during my unemployment at the moment and in near future when I am lucky enough to get a job in the area), whenever I can code use base techniques (instead of adv. techniques) I should use base techniques (and always improve my skills and comprehend the syntax).

Catch up on SAS Innovate 2026