Hi guys,
suppose to have the following dataset:
data DB;
input ID :$200. Reason1 :$200.Reason2 :$200.;
cards;
0001 Considered Excluded
0001 Violated Not_excluded
0001 Excluded Not_Verified
0001 Excluded Not_Verified
run;
Is there a way to get the following?
data DB1;
input ID :$200. Reason1 :$200.Reason2 :$200.;
cards;
0001 Considered,Violated,Excluded Excluded,Not_excluded,Not_Verified
0001 Considered,Violated,Excluded Excluded,Not_excluded,Not_Verified
0001 Considered,Violated,Excluded Excluded,Not_excluded,Not_Verified
run;
In other words I would like to combine into one row and by each variable "Reason*" all the values the variable(s) takes.
Note that, after the combination, I would like to repeat the combination the number of times the variable Reason* takes a value ID by ID.
In the real dataset I have Reason1 , Reason12, Reason3, ..., Reason10.
Can anyone help me please?
You will need to make NEW variable(s).
data want ;
set have end=eof;
length newreason $200 ;
retain newreason;
newreason = catx(',',newreason,reason);
if eof;
drop reason;
run;
If the data is grouped by something, perhaps your ID variable, then this can be simplified by by placing the SET inside a DO loop and using a BY statement. This will eliminate the need for the RETAIN statement.
data want;
do until (last.id);
set have;
by id;
length newreason $200;
newreason=catx(',',newreason,reason);
end;
drop reason;
run;
To output multiple copies you could change the DO statement to count the number of observations and then add a second DO loop around an output statement.
data want;
do _n_=1 by 1 until (last.id);
set have;
by id;
length newreason $200;
newreason=catx(',',newreason,reason);
end;
do _n_=1 to _n_;
output;
end;
drop reason;
run;
But if there are other variables besides ID and REASON you want to keep then instead just add another DO loop with and additional SET statement.
data want;
do until (last.id);
set have;
by id;
length newreason $200;
newreason=catx(',',newreason,reason);
end;
do until (last.id);
set have;
by id;
output;
end;
drop reason;
run;
And to handle multiple variables just use arrays.
data want;
do until (last.id);
set have;
by id;
array reason[10];
array newreason[10] $200;
do over reason;
newreason=catx(',',newreason,reason);
end;
end;
do until (last.id);
set have;
by id;
output;
end;
drop reason1-reason10;
run;
If you don't like using the deprecated DO OVER syntax then just add explicit indexes instead.
do _n_=1 to dim(reason);
newreason[_n_]=catx(',',newreason[_n_],reason[_n_]);
end;
In my opinion, this is not a good idea, and it makes future programming difficult. See Maxim 19. Also, why do you need an output of three identical rows?
Maybe a different re-arrangement of the data would work better. Please tell us what analyses/plots/tables you would create once you have the data re-arranged as you want?
@NewUsrStat wrote:
Thank you very much for your help. The reason why I have to replicate rows is because the dataset contains other variables with different values and for the subsequent analysis I have to apply different types of filters.
Unfortunately, this does not answer my question about what analyses/plots/tables are you trying to create.
You will need to make NEW variable(s).
data want ;
set have end=eof;
length newreason $200 ;
retain newreason;
newreason = catx(',',newreason,reason);
if eof;
drop reason;
run;
If the data is grouped by something, perhaps your ID variable, then this can be simplified by by placing the SET inside a DO loop and using a BY statement. This will eliminate the need for the RETAIN statement.
data want;
do until (last.id);
set have;
by id;
length newreason $200;
newreason=catx(',',newreason,reason);
end;
drop reason;
run;
To output multiple copies you could change the DO statement to count the number of observations and then add a second DO loop around an output statement.
data want;
do _n_=1 by 1 until (last.id);
set have;
by id;
length newreason $200;
newreason=catx(',',newreason,reason);
end;
do _n_=1 to _n_;
output;
end;
drop reason;
run;
But if there are other variables besides ID and REASON you want to keep then instead just add another DO loop with and additional SET statement.
data want;
do until (last.id);
set have;
by id;
length newreason $200;
newreason=catx(',',newreason,reason);
end;
do until (last.id);
set have;
by id;
output;
end;
drop reason;
run;
And to handle multiple variables just use arrays.
data want;
do until (last.id);
set have;
by id;
array reason[10];
array newreason[10] $200;
do over reason;
newreason=catx(',',newreason,reason);
end;
end;
do until (last.id);
set have;
by id;
output;
end;
drop reason1-reason10;
run;
If you don't like using the deprecated DO OVER syntax then just add explicit indexes instead.
do _n_=1 to dim(reason);
newreason[_n_]=catx(',',newreason[_n_],reason[_n_]);
end;
If the values repeat then they will appear multiple times.
If the goal is to get the distinct values then you would need to make the logic more complex so that the current reason is only added when it does not already appear.
do _n_=1 to dim(reason);
if not findw(newreason[_n_],reason[_n_],',','t') then
newreason[_n_]=catx(',',newreason[_n_],reason[_n_])
;
end;
If you don't care about the case then add the i modifier into addition to the t modifier in the FINDW() function call.
But with these extra complication it might be better to use multiple passes so the logic can be simpler and easier to follow. So first build the list of unique values. Then build the concatenations. And finally remerge it with the original data.
* Make TALL dataset with all of the reasons from all of the reason variables ;
proc transpose data=have out=step1;
by id;
var reason1-reason10;
run;
* Eliminate duplicates ;
proc sort data=step1 out=step2 nodupkey ;
by id _name_ col1 ;
run;
* Concatenate ;
data step3;
do until (last._name_);
set step2;
by id _name_;
length list $200 ;
list = catx(',',list,col1);
end;
drop col1;
run;
* Turn back into separate variables;
proc transpose data=step3 out=step4 ;
by id;
id _name_;
var list ;
run;
* Combine with original data ;
data want;
merge have(drop=reason1-reason10) step4;
by id;
run;
Hi @Tom , I think your solution is very complicated, I think a macro contains sql and data steps solve this question better (in that, this approach address the major issue of the question more directly, i.e., the desired outcome is unique combinations of all variations of reason1 and reason2 for an unique ID, so use sql select into; and data step to set a series of newly created dataset can be proper techniques, and these steps need to be repeated so have to use macro with do loop), what do you think?
My other comment for your solution is that, I do not think array is a proper technique to solve this question, because obviously all of the columns are characters, there is no any chance for you to do a calculation of numbers so no need to use array, also the major issue is to combine several different character values of a columns to one character value. Techniques for process character values and character strings are more suitable.
@NewUsrStat wrote:
Thank you very much for your help. Unfortunately the code that refers to the array does not work. But, anyway, how it become if the value of the variable is repeated identical?
What does "array does not work"?
If code throws an error then provide the log with the code and all the messages. Copy the text from the log then on the forum open a text box in a message using the </> icon above the message window and paste the code.
If the result is not as expected then provide the data used in the form of a working data step and then provide more detail as to the expected result for the example data.
I am afraid that a generic "does not work" is very unhelpful in diagnosing problems.
Hi @NewUsrStat , you can also use macro and sql steps to accomplish this goal, the code and output are as follows. This macro also works when you have more IDs (here I add only two more ids and you can have more) and more variations of reason1 and reason2 columns, this macro solves your question. This macro only has two simple sql steps and two simple data steps, easy to understand and easy to use. The final output contains unique id and the unique combinations of reason1 and reason2, if you want to display the reason1 and reason2 for all rows of an unique ID, just use if/then statement, such as if ID=0001 then reason1="copy the reason1 column here"; in a date step.
data DB;
input ID :$200. Reason1 :$200.Reason2 :$200.;
cards;
0001 Considered Excluded
0001 Violated Not_excluded
0001 Excluded Not_Verified
0001 Excluded Not_Verified
0002 considered excluded
0003 considered excluded
0003 violated not_excluded
;
run;
%macro datawant;
proc sql noprint;
select distinct id
into :id1-
from db;
quit;
%put &sqlobs;
%do i=1 %to &sqlobs;
proc sql noprint;
select distinct reason1
into :reason1&i separated by ','
from db
where id="&&id&i";
select distinct reason2
into :reason2&i separated by ','
from db
where id="&&id&i";
quit;
%put &&reason1&i &&reason2&i;
data final&i;
id="&&id&i";
reason1="&&reason1&i";
reason2="&&reason2&i";
run;
%end;
data finalfinal;
set %do i=1 %to 3;
final&i
%end;
;
run;
proc print data=finalfinal;run;
%mend datawant;
%datawant;
Nearly 200 sessions are now available on demand with the SAS Innovate Digital Pass.
Explore Now →SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.