Solved: Summing person-time from wide month variables into long form

sasgorilla · Posted 01-22-2025 01:36 AM

Hi. I have a dataset in a wide format where IDs have multiple observations. This was based on where an individual experienced an outcome, underwent demographic changes, group assignment, etc. where a new row was created. I have stripped the superfluous variables and created an abbreviated sample data below.

This abbreviated wide data set has study_id, an exposure variable, a group variable, and an indicator of presence for all months in the sample (1=present and contributing person-time, 0=absent), as below:

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
RUN;

To prepare for a subsequent analysis, I would like to sum person time by month, exposed status, and group, as follows. I would also like to add a formatted monyy. variable so I can see where in the year each month occurs, as this dataset spans several years.

data want;
input month monyy. month_num exp $ group $ ptime;
format month monyy.;
datalines;
JAN10 1 exp a 1
JAN10 1 exp b 1
JAN10 1 unexp c 2
JAN10 1 unexp d 0
FEB10 2 exp a 1
FEB10 2 exp b 1
FEB10 2 unexp c 2
FEB10 2 unexp d 0
MAR10 3 exp a 1
MAR10 3 exp b 1
MAR10 3 unexp c 2
MAR10 3 unexp d 0
APR10 4 exp a 1
APR10 4 exp b 0
APR10 4 unexp c 2 
APR10 4 unexp d 1
MAY10 5 exp a 1
MAY10 5 exp b 0
MAY10 5 unexp c 2
MAY10 5 unexp d 1
JUN10 6 exp a 0
JUN10 6 exp b 0
JUN10 6 unexp c 2 
JUN10 6 unexp d 1
;
run;

I understand this may not be the most optimally organized dataset as it stands, but this is what I am working with at present. I may ask an additional question to see if some of this wide dataset formatting could have been bypassed using other start/stop date variables in the dataset, but I will do that in another thread when I have some time.

In its current form, does anyone have any ideas on the most optimal way to get from the multiple wide month variables in "have" to summed person time by month/exp/group as depicted in "want"?

Ksharp · Posted 01-22-2025 01:55 AM

Assuming I understood what you mean.

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
RUN;
proc sql;
create table want as
select 'Jan10' as month,1 as month_num,exp,group,sum(mon1) as ptime from have group by exp,group
union all
select 'Feb10' as month,2 as month_num,exp,group,sum(mon2) as ptime from have group by exp,group
union all
select 'Mar10' as month,3 as month_num,exp,group,sum(mon3) as ptime from have group by exp,group
union all
select 'Apr10' as month,4 as month_num,exp,group,sum(mon4) as ptime from have group by exp,group
union all
select 'May10' as month,5 as month_num,exp,group,sum(mon5) as ptime from have group by exp,group
union all
select 'Jun10' as month,6 as month_num,exp,group,sum(mon6) as ptime from have group by exp,group
;
quit;

View solution in original post

Ksharp · Posted 01-22-2025 01:55 AM

Assuming I understood what you mean.

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
RUN;
proc sql;
create table want as
select 'Jan10' as month,1 as month_num,exp,group,sum(mon1) as ptime from have group by exp,group
union all
select 'Feb10' as month,2 as month_num,exp,group,sum(mon2) as ptime from have group by exp,group
union all
select 'Mar10' as month,3 as month_num,exp,group,sum(mon3) as ptime from have group by exp,group
union all
select 'Apr10' as month,4 as month_num,exp,group,sum(mon4) as ptime from have group by exp,group
union all
select 'May10' as month,5 as month_num,exp,group,sum(mon5) as ptime from have group by exp,group
union all
select 'Jun10' as month,6 as month_num,exp,group,sum(mon6) as ptime from have group by exp,group
;
quit;

sasgorilla · Posted 01-22-2025 02:01 AM

@Ksharp This looks exactly like what I needed. Thank you!

One additional question with proc sql: is there a way to array this pattern through many months, or do I just need to copy the full select statement for every given month? I ask because in my study sample there are approximately 60 months.

Thank you.

LinusH · Posted 01-22-2025 02:28 AM

There is no OOTB syntax in SQL for this.

But you can use other logic to make it more dynamic.

My go-to for this scenario would a PROC TRANSPOSE, fixing a proper sas date value and the sum using SQL.

data have;
	input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
	rowid = _n_;
	datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
RUN;

proc transpose 	data=have 
				out=want1(rename=(col1=present))
				name=month;
	by study_id exp group rowid;
	var mon1-mon6;
run;

%let startmon=01jan2010;

data want2;
	set want1;
	month_num = intnx('MONTH',"&startMon"d,input(substr(month,4),2.));
	format month_num monyy.;
run;

proc sql;
	create table want3 as	
		select study_id, exp, group, month_num, sum(present) as sumpres
		from want2
		group by 1, 2, 3, 4
	;
quit;

Data never sleeps

sasgorilla · Posted 01-24-2025 12:22 AM

@LinusH Your code worked pretty well with a few modifications.The proc sql code was initially summing 1 for each row, and each row was subject id versus grouping by month/exp/group. If I modified it like the below, it produced exactly what I wanted.

proc sql;
	create table want3 as	
		select month_num, exp, group,sum(present) as sumpres
		from want2
		group by month_num, disposition, typenum
	;
quit;

The one other thing I can't figure out is the month_num is all offset 1 month, so the first month is February and the last month ends one month after the desired month. I set the %let month back one month and that didn't work. Any ideas?

LinusH · Posted 01-24-2025 05:43 AM

My bad, intnx offsets using the month no, but month1 is the same as your start month (Jan2010).

Sp solution is to subtract 1 from month:

data want2;
	set want1;
	month_n = input(substr(month,4),2.);
	month_num = intnx('MONTH',"&startMon"d,month_n-1);
	format month_num monyy.;
run;

proc sql;
	create table want3 as	
		select month_num, month_n, exp, group, sum(present) as sumpres
		from want2
		group by 1, 2, 3, 4
		order by 1, 2, 3, 4 
	;
quit;

Then I didn't get your naming of group variables, they didn't match the select clause...?

Data never sleeps

Ksharp · Posted 01-22-2025 03:28 AM

OK . Sure. If you have a variety of month variable。

Try this one:

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
proc summary data=have nway;
class exp group;
var mon1-mon6;
output out=temp sum=;
run;
proc transpose data=temp out=temp2;
by exp group;
var mon1-mon6;
run;
data temp3;
 set temp2(rename=(col1=ptime));
 month_num=input(compress(_name_,,'kd'),best.);
 mon=scan(put(mdy(month_num,10,2024),worddate12.),1,',');
 drop _name_;
run;
proc sort data=temp3 out=want;
by month_num exp group;
run;

sasgorilla · Posted 01-24-2025 12:26 AM

@Ksharp I tried this code and it appears this line:

mon=scan(put(mdy(month_num,10,2024),worddate12.),1,',');

results in the error message:

Invalid argument to function MDY(13,10,2024) at line 486 column 18.

Invalid argument to function MDY(14,10,2024) at line 486 column 18.

Invalid argument to function MDY(15,10,2024) at line 486 column 18.

and so forth.

I'm assuming this is because my months are numbered beyond 12. Any idea how to rectify?

Ksharp · Posted 01-24-2025 01:14 AM

Yes. I understood. If you have variables like mon1-mon24, then you need to consider the YEAR.

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
proc summary data=have nway;
class exp group;
var mon1-mon6;
output out=temp sum=;
run;
proc transpose data=temp out=temp2;
by exp group;
var mon1-mon6;
run;
data temp3;
 set temp2(rename=(col1=ptime));
 month_num=input(compress(_name_,,'kd'),best.);
 mon=put( intnx('month','10dec2009'd,month_num,'s') ,monyy.);
 drop _name_;
run;
proc sort data=temp3 out=want;
by month_num exp group;
run;

sasgorilla · Posted 01-24-2025 01:23 AM

@Ksharp That worked, thank you!

quickbluefish · Posted 01-22-2025 08:48 AM

Another possibility - just stuffing everything into a multidimensional array and then spitting it out at the end - the SQL and FORMAT steps are just to deal with counting the number of groups and translating them to numbers. I realized afterward that this outputs some irrelevant rows, as it seems in your data, EXP and GROUP are mutually exclusive (e.g., there are no 'exposed' group 'c' people). If group 'a' is equivalent to group 'c' and group 'b' is equivalent to group 'd', then the data would be simpler if you just used groups 'a' and 'b' and used the value of EXP to stratify them. Not sure if that makes sense.

data have;
infile cards dsd truncover firstobs=1 dlm=' ';
length study_id $4 exp $10 group $1 mon1-mon6 3;
input study_id exp group mon1 mon2 mon3 mon4 mon5 mon6;
cards;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0
S004 unexposed d 0 0 0 1 1 1
;
run;

proc sql noprint;
create table grpnum as
select 'fgrpnum' as fmtname length=8, 'C' as type length=1,
group as start, monotonic() as label
from (select distinct group from have);

select count(*) into :ngrps from grpnum;
quit;

proc format cntlin=grpnum; run;

data want;
set have end=last;
array m {*} mon:;
array T {6,0:1,&ngrps} _temporary_;
do monnum=1 to dim(T,1);
	T[monnum,(exp='exposed'),put(group,$fgrpnum.)*1]+m[monnum];
end;
if last then do;
	do monnum=1 to dim(T,1);
		do exposed=0 to 1;
			do grpnum=1 to dim(T,3);
				totmon=max(0,T[monnum,exposed,grpnum]);
				output;
			end;
		end;
	end;
end;
keep monnum exposed grpnum totmon;
run;

proc print data=want; run;

Tom · Posted 01-22-2025 10:34 AM

Convert from WIDE to TALL. So assuming MON1 means JAN2010 you could use a data step like this to transpose it while also generating your MONTH variable.

data have;
  input study_id $ exp :$9. group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;

data tall;
  set have;
  array mon[6];
  start='01JAN2010'd ;
  do month_num=1 to dim(mon);
    month = intnx('month',start,month_num-1);
    present = mon[month_num];
    output;
  end;
  drop mon1-mon6 ;
  format start month monyy7.;
run;

Now summaries are easy.

proc summary nway data=tall ;
  class month month_num exp group ;
  var present ;
  output out=want sum=ptime;
run;

Results:

...

sasgorilla · Posted 01-24-2025 12:32 AM

@Tom This worked perfectly, and was very succint. Thank you!!

Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Re: Summing person-time from wide month variables into long form

Registration is open

Registration is open

SAS Training: Just a Click Away