I have a following datatset with two character variable (subject, arm)
subject arm
------- ----
10001 PA
10001 CA
10001 CA
10002 NA
10002 PA
10002 NA
I need a following dataset where arm variable will be the combination of every unique value
for each subject.
The result dataset should be as below.:
subject arm
------- -----
10001 PA + CA
10002 NA + PA
I know it possible to do it using proc sort using nodupkey and then merging and using catx
function. But that will be a very long ugly code. Will you be kind enough
to help me with a more efficient code?
Thank you so much.
Something like this (untested):
data want;
set have;
by subject;
retain arm_new;
if first.subject then arm_new = '';
arm_new = catx(' + ',arm);
if last.subject then output;
run;
I won't say it would be very long and ugly if using proc sort first. but you do have other options.
data have;
input subject$ arm$50.;
cards;
10001 PA
10001 CA
10001 CA
10002 NA
10002 PA
10002 NA
;
proc sort data=have out=have1 nodupkey;
by subject arm;
run;
data want;
length arm $ 50;
do until (last.subject);
set have1 (rename= arm=_arm);
by subject;
arm=catx('+', arm, _arm);
end;
drop _arm;
run;
/*array : assume your data is at least clustered by subject*/
data want_array;
length arm $ 50;
array t(100)/*100 is arbitary, using Hash to obtain full dynamic*/ $ 2;
do _n_=1 by 1 until (last.subject);
set have (rename=arm=_arm);
by subject notsorted;
if _arm not in t then t(_n_)=_arm;
end;
arm=catx('+', of t(*));
keep subject arm;
run;
Haikuo
Hi,
May i request you to please explain the execution of a portion of your code :
/*array : assume your data is at least clustered by subject*/
data want_array;
length arm $ 50;
array t(100)/*100 is arbitary, using Hash to obtain full dynamic*/ $ 2;
do _n_=1 by 1 until (last.subject);
set have (rename=arm=_arm);
by subject notsorted;
if _arm not in t then t(_n_)=_arm;/*this part please? thanks*/
end;
arm=catx('+', of t(*));
keep subject arm;
run;
I can't find the official online doc myself at the moment,
if _arm not in t /*this t is referring array t*/
then t(_n_)=_arm;
it basically only allow distinct _arm entering array t.
Let me know if I have myself clear enough.
Regards,
Haikuo
The IN operator involving an array is documented here:
SAS(R) 9.4 Language Reference: Concepts, Fourth Edition
PG
proc sql;
create table want as
select distinct subject,arm from have
order by subject;
quit;
data need;
set want;
by subject;
retain arm;
arm=catx(' + ',lag(arm),arm);
if last.subject;
run;
Is this good enough?
data have;
input subject $ arm $;
datalines;
10001 PA
10001 CA
10001 CA
10002 NA
10002 PA
10002 NA
;
proc sort data=have; by subject arm; run;
data want(rename=dum=arm);
length dum $200;
do until(last.subject);
set have; by subject arm;
if last.arm then dum = catx(" + ", dum, arm);
end;
run;
proc print data=want noobs; run;
PG
PG Stats,
I know my SQL is rudimentary at best, but wouldn't something like this work, even on unsorted data?
proc sql noprint;
select subject, distinct arm separated by ' + ' as all_arm
from have
group by subject;
quit;
Astounding.. looks good.. but unfortunately given an error.
2079 proc sql noprint;
2080 select caseno, distinct drnm01 separated by ' + ' as all_arm
------
22
202
ERROR 22-322: Syntax error, expecting one of the following: a quoted string, !, !!, &, (, *, **,
+, ',', -, '.', /, <, <=, <>, =, >, >=, ?, AND, AS, BETWEEN, CONTAINS, EQ, EQT,
FORMAT, FROM, GE, GET, GT, GTT, IN, INFORMAT, INTO, IS, LABEL, LE, LEN, LENGTH,
LET, LIKE, LT, LTT, NE, NET, NOT, NOTIN, OR, TRANSCODE, ^, ^=, |, ||, ~, ~=.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
2081 from &RAWLIB..A02DR
2082 group by caseno;
2083 quit;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
separated by used when building a macro variable
This type of concatenation is only available when selecting INTO a macro variable. As in
Proc sql;
select distinct arm into :myArmList separated by " + "
from have;
quit;
PG
Actually, IMO your stated solution is the most elegant one. All others tend to be more complicated or obfuscated.
a proposal...
The code won't work on unsorted data...
With a a PROC SORT NODUPKEYS, the code will be more simple...
best regards
SR
I tried using hash objects and could not get it to work. But I believe the following code works using perl regular expressions:
data temp_1;
input subject $ arm $;
cards;
10000 NJ
10001 PA
10001 CA
10001 CA
10002 NA
10002 PA
10002 NA
10003 NJ
10004 NJ
;
data temp_2;
set temp_1 end=last;
length old_subject $ 5
arm_list $ 100;
retain old_subject
arm_list;
keep old_subject
arm_list;
rename old_subject = subject
arm_list = arm;
if _N_ = 1 then do;
old_subject = subject;
arm_list = arm;
end;
if old_subject = subject then do;
if ^prxmatch('/' || strip(arm) || '/', arm_list) then
arm_list = strip(arm_list) || ' + ' || strip(arm);
end;
else do;
arm_list = prxchange('s/^[ +]+(.*)/$1/', 1, arm_list);
output;
arm_list = arm;
old_subject = subject;
end;
if last then
output;
run;
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.