Hi. I have a dataset in a wide format where IDs have multiple observations. This was based on where an individual experienced an outcome, underwent demographic changes, group assignment, etc. where a new row was created. I have stripped the superfluous variables and created an abbreviated sample data below.
This abbreviated wide data set has study_id, an exposure variable, a group variable, and an indicator of presence for all months in the sample (1=present and contributing person-time, 0=absent), as below:
data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0
S004 unexposed d 0 0 0 1 1 1
;
RUN;
To prepare for a subsequent analysis, I would like to sum person time by month, exposed status, and group, as follows. I would also like to add a formatted monyy. variable so I can see where in the year each month occurs, as this dataset spans several years.
data want;
input month monyy. month_num exp $ group $ ptime;
format month monyy.;
datalines;
JAN10 1 exp a 1
JAN10 1 exp b 1
JAN10 1 unexp c 2
JAN10 1 unexp d 0
FEB10 2 exp a 1
FEB10 2 exp b 1
FEB10 2 unexp c 2
FEB10 2 unexp d 0
MAR10 3 exp a 1
MAR10 3 exp b 1
MAR10 3 unexp c 2
MAR10 3 unexp d 0
APR10 4 exp a 1
APR10 4 exp b 0
APR10 4 unexp c 2
APR10 4 unexp d 1
MAY10 5 exp a 1
MAY10 5 exp b 0
MAY10 5 unexp c 2
MAY10 5 unexp d 1
JUN10 6 exp a 0
JUN10 6 exp b 0
JUN10 6 unexp c 2
JUN10 6 unexp d 1
;
run;
I understand this may not be the most optimally organized dataset as it stands, but this is what I am working with at present. I may ask an additional question to see if some of this wide dataset formatting could have been bypassed using other start/stop date variables in the dataset, but I will do that in another thread when I have some time.
In its current form, does anyone have any ideas on the most optimal way to get from the multiple wide month variables in "have" to summed person time by month/exp/group as depicted in "want"?
Assuming I understood what you mean.
data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0
S004 unexposed d 0 0 0 1 1 1
;
RUN;
proc sql;
create table want as
select 'Jan10' as month,1 as month_num,exp,group,sum(mon1) as ptime from have group by exp,group
union all
select 'Feb10' as month,2 as month_num,exp,group,sum(mon2) as ptime from have group by exp,group
union all
select 'Mar10' as month,3 as month_num,exp,group,sum(mon3) as ptime from have group by exp,group
union all
select 'Apr10' as month,4 as month_num,exp,group,sum(mon4) as ptime from have group by exp,group
union all
select 'May10' as month,5 as month_num,exp,group,sum(mon5) as ptime from have group by exp,group
union all
select 'Jun10' as month,6 as month_num,exp,group,sum(mon6) as ptime from have group by exp,group
;
quit;
Assuming I understood what you mean.
data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0
S004 unexposed d 0 0 0 1 1 1
;
RUN;
proc sql;
create table want as
select 'Jan10' as month,1 as month_num,exp,group,sum(mon1) as ptime from have group by exp,group
union all
select 'Feb10' as month,2 as month_num,exp,group,sum(mon2) as ptime from have group by exp,group
union all
select 'Mar10' as month,3 as month_num,exp,group,sum(mon3) as ptime from have group by exp,group
union all
select 'Apr10' as month,4 as month_num,exp,group,sum(mon4) as ptime from have group by exp,group
union all
select 'May10' as month,5 as month_num,exp,group,sum(mon5) as ptime from have group by exp,group
union all
select 'Jun10' as month,6 as month_num,exp,group,sum(mon6) as ptime from have group by exp,group
;
quit;
@Ksharp This looks exactly like what I needed. Thank you!
One additional question with proc sql: is there a way to array this pattern through many months, or do I just need to copy the full select statement for every given month? I ask because in my study sample there are approximately 60 months.
Thank you.
There is no OOTB syntax in SQL for this.
But you can use other logic to make it more dynamic.
My go-to for this scenario would a PROC TRANSPOSE, fixing a proper sas date value and the sum using SQL.
data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
rowid = _n_;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0
S004 unexposed d 0 0 0 1 1 1
;
RUN;
proc transpose data=have
out=want1(rename=(col1=present))
name=month;
by study_id exp group rowid;
var mon1-mon6;
run;
%let startmon=01jan2010;
data want2;
set want1;
month_num = intnx('MONTH',"&startMon"d,input(substr(month,4),2.));
format month_num monyy.;
run;
proc sql;
create table want3 as
select study_id, exp, group, month_num, sum(present) as sumpres
from want2
group by 1, 2, 3, 4
;
quit;
@LinusH Your code worked pretty well with a few modifications.The proc sql code was initially summing 1 for each row, and each row was subject id versus grouping by month/exp/group. If I modified it like the below, it produced exactly what I wanted.
proc sql;
create table want3 as
select month_num, exp, group,sum(present) as sumpres
from want2
group by month_num, disposition, typenum
;
quit;
The one other thing I can't figure out is the month_num is all offset 1 month, so the first month is February and the last month ends one month after the desired month. I set the %let month back one month and that didn't work. Any ideas?
My bad, intnx offsets using the month no, but month1 is the same as your start month (Jan2010).
Sp solution is to subtract 1 from month:
data want2;
set want1;
month_n = input(substr(month,4),2.);
month_num = intnx('MONTH',"&startMon"d,month_n-1);
format month_num monyy.;
run;
proc sql;
create table want3 as
select month_num, month_n, exp, group, sum(present) as sumpres
from want2
group by 1, 2, 3, 4
order by 1, 2, 3, 4
;
quit;
Then I didn't get your naming of group variables, they didn't match the select clause...?
OK . Sure. If you have a variety of month variable。
Try this one:
data have; input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6; datalines; S001 exposed a 1 1 0 0 0 0 S001 exposed a 0 0 1 1 0 0 S001 exposed a 0 0 0 0 1 0 S002 exposed b 1 0 0 0 0 0 S002 exposed b 0 1 1 0 0 0 S002 unexposed c 0 0 0 1 1 1 S003 unexposed c 1 1 1 1 1 1 S004 unexposed c 1 1 1 0 0 0 S004 unexposed d 0 0 0 1 1 1 ; proc summary data=have nway; class exp group; var mon1-mon6; output out=temp sum=; run; proc transpose data=temp out=temp2; by exp group; var mon1-mon6; run; data temp3; set temp2(rename=(col1=ptime)); month_num=input(compress(_name_,,'kd'),best.); mon=scan(put(mdy(month_num,10,2024),worddate12.),1,','); drop _name_; run; proc sort data=temp3 out=want; by month_num exp group; run;
@Ksharp I tried this code and it appears this line:
mon=scan(put(mdy(month_num,10,2024),worddate12.),1,',');
results in the error message:
Invalid argument to function MDY(13,10,2024) at line 486 column 18.
Invalid argument to function MDY(14,10,2024) at line 486 column 18.
Invalid argument to function MDY(15,10,2024) at line 486 column 18.
and so forth.
I'm assuming this is because my months are numbered beyond 12. Any idea how to rectify?
Yes. I understood. If you have variables like mon1-mon24, then you need to consider the YEAR.
data have; input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6; datalines; S001 exposed a 1 1 0 0 0 0 S001 exposed a 0 0 1 1 0 0 S001 exposed a 0 0 0 0 1 0 S002 exposed b 1 0 0 0 0 0 S002 exposed b 0 1 1 0 0 0 S002 unexposed c 0 0 0 1 1 1 S003 unexposed c 1 1 1 1 1 1 S004 unexposed c 1 1 1 0 0 0 S004 unexposed d 0 0 0 1 1 1 ; proc summary data=have nway; class exp group; var mon1-mon6; output out=temp sum=; run; proc transpose data=temp out=temp2; by exp group; var mon1-mon6; run; data temp3; set temp2(rename=(col1=ptime)); month_num=input(compress(_name_,,'kd'),best.); mon=put( intnx('month','10dec2009'd,month_num,'s') ,monyy.); drop _name_; run; proc sort data=temp3 out=want; by month_num exp group; run;
@Ksharp That worked, thank you!
Another possibility - just stuffing everything into a multidimensional array and then spitting it out at the end - the SQL and FORMAT steps are just to deal with counting the number of groups and translating them to numbers. I realized afterward that this outputs some irrelevant rows, as it seems in your data, EXP and GROUP are mutually exclusive (e.g., there are no 'exposed' group 'c' people). If group 'a' is equivalent to group 'c' and group 'b' is equivalent to group 'd', then the data would be simpler if you just used groups 'a' and 'b' and used the value of EXP to stratify them. Not sure if that makes sense.
data have;
infile cards dsd truncover firstobs=1 dlm=' ';
length study_id $4 exp $10 group $1 mon1-mon6 3;
input study_id exp group mon1 mon2 mon3 mon4 mon5 mon6;
cards;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0
S004 unexposed d 0 0 0 1 1 1
;
run;
proc sql noprint;
create table grpnum as
select 'fgrpnum' as fmtname length=8, 'C' as type length=1,
group as start, monotonic() as label
from (select distinct group from have);
select count(*) into :ngrps from grpnum;
quit;
proc format cntlin=grpnum; run;
data want;
set have end=last;
array m {*} mon:;
array T {6,0:1,&ngrps} _temporary_;
do monnum=1 to dim(T,1);
T[monnum,(exp='exposed'),put(group,$fgrpnum.)*1]+m[monnum];
end;
if last then do;
do monnum=1 to dim(T,1);
do exposed=0 to 1;
do grpnum=1 to dim(T,3);
totmon=max(0,T[monnum,exposed,grpnum]);
output;
end;
end;
end;
end;
keep monnum exposed grpnum totmon;
run;
proc print data=want; run;
Convert from WIDE to TALL. So assuming MON1 means JAN2010 you could use a data step like this to transpose it while also generating your MONTH variable.
data have;
input study_id $ exp :$9. group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0
S004 unexposed d 0 0 0 1 1 1
;
data tall;
set have;
array mon[6];
start='01JAN2010'd ;
do month_num=1 to dim(mon);
month = intnx('month',start,month_num-1);
present = mon[month_num];
output;
end;
drop mon1-mon6 ;
format start month monyy7.;
run;
Now summaries are easy.
proc summary nway data=tall ;
class month month_num exp group ;
var present ;
output out=want sum=ptime;
run;
Results:
...
@Tom This worked perfectly, and was very succint. Thank you!!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.