BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
sasgorilla
Pyrite | Level 9

Hi. I have a dataset in a wide format where IDs have multiple observations. This was based on where an individual experienced an outcome, underwent demographic changes, group assignment, etc. where a new row was created. I have stripped the superfluous variables and created an abbreviated sample data below.

This abbreviated wide data set has study_id, an exposure variable, a group variable, and an indicator of presence for all months in the sample (1=present and contributing person-time, 0=absent), as below:

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
RUN;

To prepare for a subsequent analysis, I would like to sum person time by month, exposed status, and group, as follows. I would also like to add a formatted monyy. variable so I can see where in the year each month occurs, as this dataset spans several years. 

data want;
input month monyy. month_num exp $ group $ ptime;
format month monyy.;
datalines;
JAN10 1 exp a 1
JAN10 1 exp b 1
JAN10 1 unexp c 2
JAN10 1 unexp d 0
FEB10 2 exp a 1
FEB10 2 exp b 1
FEB10 2 unexp c 2
FEB10 2 unexp d 0
MAR10 3 exp a 1
MAR10 3 exp b 1
MAR10 3 unexp c 2
MAR10 3 unexp d 0
APR10 4 exp a 1
APR10 4 exp b 0
APR10 4 unexp c 2 
APR10 4 unexp d 1
MAY10 5 exp a 1
MAY10 5 exp b 0
MAY10 5 unexp c 2
MAY10 5 unexp d 1
JUN10 6 exp a 0
JUN10 6 exp b 0
JUN10 6 unexp c 2 
JUN10 6 unexp d 1
;
run;

I understand this may not be the most optimally organized dataset as it stands, but this is what I am working with at present. I may ask an additional question to see if some of this wide dataset formatting could have been bypassed using other start/stop date variables in the dataset, but I will do that in another thread when I have some time.

 

In its current form, does anyone have any ideas on the most optimal way to get from the multiple wide month variables in "have" to summed person time by month/exp/group as depicted in "want"? 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Assuming I understood what you mean.

 

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
RUN;
proc sql;
create table want as
select 'Jan10' as month,1 as month_num,exp,group,sum(mon1) as ptime from have group by exp,group
union all
select 'Feb10' as month,2 as month_num,exp,group,sum(mon2) as ptime from have group by exp,group
union all
select 'Mar10' as month,3 as month_num,exp,group,sum(mon3) as ptime from have group by exp,group
union all
select 'Apr10' as month,4 as month_num,exp,group,sum(mon4) as ptime from have group by exp,group
union all
select 'May10' as month,5 as month_num,exp,group,sum(mon5) as ptime from have group by exp,group
union all
select 'Jun10' as month,6 as month_num,exp,group,sum(mon6) as ptime from have group by exp,group
;
quit;

View solution in original post

12 REPLIES 12
Ksharp
Super User

Assuming I understood what you mean.

 

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
RUN;
proc sql;
create table want as
select 'Jan10' as month,1 as month_num,exp,group,sum(mon1) as ptime from have group by exp,group
union all
select 'Feb10' as month,2 as month_num,exp,group,sum(mon2) as ptime from have group by exp,group
union all
select 'Mar10' as month,3 as month_num,exp,group,sum(mon3) as ptime from have group by exp,group
union all
select 'Apr10' as month,4 as month_num,exp,group,sum(mon4) as ptime from have group by exp,group
union all
select 'May10' as month,5 as month_num,exp,group,sum(mon5) as ptime from have group by exp,group
union all
select 'Jun10' as month,6 as month_num,exp,group,sum(mon6) as ptime from have group by exp,group
;
quit;
sasgorilla
Pyrite | Level 9

@Ksharp This looks exactly like what I needed. Thank you! 

One additional question with proc sql: is there a way to array this pattern through many months, or do I just need to copy the full select statement for every given month? I ask because in my study sample there are approximately 60 months. 

 

Thank you. 

LinusH
Tourmaline | Level 20

There is no OOTB syntax in SQL for this.

But you can use other logic to make it more dynamic.

My go-to for this scenario would a PROC TRANSPOSE, fixing a proper sas date value and the sum using SQL.

data have;
	input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
	rowid = _n_;
	datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
RUN;

proc transpose 	data=have 
				out=want1(rename=(col1=present))
				name=month;
	by study_id exp group rowid;
	var mon1-mon6;
run;

%let startmon=01jan2010;

data want2;
	set want1;
	month_num = intnx('MONTH',"&startMon"d,input(substr(month,4),2.));
	format month_num monyy.;
run;

proc sql;
	create table want3 as	
		select study_id, exp, group, month_num, sum(present) as sumpres
		from want2
		group by 1, 2, 3, 4
	;
quit; 
Data never sleeps
sasgorilla
Pyrite | Level 9

@LinusH Your code worked pretty well with a few modifications.The proc sql code was initially summing 1 for each row, and each row was subject id versus grouping by month/exp/group. If I modified it like the below, it produced exactly what I wanted. 

 

proc sql;
	create table want3 as	
		select month_num, exp, group,sum(present) as sumpres
		from want2
		group by month_num, disposition, typenum
	;
quit; 

The one other thing I can't figure out is the month_num is all offset 1 month, so the first month is February and the last month ends one month after the desired month. I set the %let month back one month and that didn't work. Any ideas? 

 

LinusH
Tourmaline | Level 20

My bad, intnx offsets using the month no, but month1 is the same as your start month (Jan2010).

Sp solution is to subtract 1 from month:

data want2;
	set want1;
	month_n = input(substr(month,4),2.);
	month_num = intnx('MONTH',"&startMon"d,month_n-1);
	format month_num monyy.;
run;

proc sql;
	create table want3 as	
		select month_num, month_n, exp, group, sum(present) as sumpres
		from want2
		group by 1, 2, 3, 4
		order by 1, 2, 3, 4 
	;
quit;

Then I didn't get your naming of group variables, they didn't match the select clause...?

Data never sleeps
Ksharp
Super User

OK . Sure. If you have a variety of month variable。

Try this one:

 

 

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
proc summary data=have nway;
class exp group;
var mon1-mon6;
output out=temp sum=;
run;
proc transpose data=temp out=temp2;
by exp group;
var mon1-mon6;
run;
data temp3;
 set temp2(rename=(col1=ptime));
 month_num=input(compress(_name_,,'kd'),best.);
 mon=scan(put(mdy(month_num,10,2024),worddate12.),1,',');
 drop _name_;
run;
proc sort data=temp3 out=want;
by month_num exp group;
run;

 

sasgorilla
Pyrite | Level 9

@Ksharp I tried this code and it appears this line: 

mon=scan(put(mdy(month_num,10,2024),worddate12.),1,',');

results in the error message: 

Invalid argument to function MDY(13,10,2024) at line 486 column 18.

Invalid argument to function MDY(14,10,2024) at line 486 column 18.

Invalid argument to function MDY(15,10,2024) at line 486 column 18.

 

and so forth. 

 

I'm assuming this is because my months are numbered beyond 12. Any idea how to rectify? 

Ksharp
Super User

Yes. I understood. If you have variables like  mon1-mon24, then you need to consider the YEAR.

 

data have;
input study_id $ exp $ group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;
proc summary data=have nway;
class exp group;
var mon1-mon6;
output out=temp sum=;
run;
proc transpose data=temp out=temp2;
by exp group;
var mon1-mon6;
run;
data temp3;
 set temp2(rename=(col1=ptime));
 month_num=input(compress(_name_,,'kd'),best.);
 mon=put( intnx('month','10dec2009'd,month_num,'s') ,monyy.);
 drop _name_;
run;
proc sort data=temp3 out=want;
by month_num exp group;
run;
sasgorilla
Pyrite | Level 9

@Ksharp That worked, thank you!

quickbluefish
Barite | Level 11

Another possibility - just stuffing everything into a multidimensional array and then spitting it out at the end - the SQL and FORMAT steps are just to deal with counting the number of groups and translating them to numbers.  I realized afterward that this outputs some irrelevant rows, as it seems in your data, EXP and GROUP are mutually exclusive (e.g., there are no 'exposed' group 'c' people).  If group 'a' is equivalent to group 'c' and group 'b' is equivalent to group 'd', then the data would be simpler if you just used groups 'a' and 'b' and used the value of EXP to stratify them.  Not sure if that makes sense.

data have;
infile cards dsd truncover firstobs=1 dlm=' ';
length study_id $4 exp $10 group $1 mon1-mon6 3;
input study_id exp group mon1 mon2 mon3 mon4 mon5 mon6;
cards;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0
S004 unexposed d 0 0 0 1 1 1
;
run;

proc sql noprint;
create table grpnum as
select 'fgrpnum' as fmtname length=8, 'C' as type length=1,
group as start, monotonic() as label
from (select distinct group from have);

select count(*) into :ngrps from grpnum;
quit;

proc format cntlin=grpnum; run;

data want;
set have end=last;
array m {*} mon:;
array T {6,0:1,&ngrps} _temporary_;
do monnum=1 to dim(T,1);
	T[monnum,(exp='exposed'),put(group,$fgrpnum.)*1]+m[monnum];
end;
if last then do;
	do monnum=1 to dim(T,1);
		do exposed=0 to 1;
			do grpnum=1 to dim(T,3);
				totmon=max(0,T[monnum,exposed,grpnum]);
				output;
			end;
		end;
	end;
end;
keep monnum exposed grpnum totmon;
run;

proc print data=want; run;
Tom
Super User Tom
Super User

Convert from WIDE to TALL.  So assuming MON1 means JAN2010 you could use a data step like this to transpose it while also generating your MONTH variable.

data have;
  input study_id $ exp :$9. group $ mon1 mon2 mon3 mon4 mon5 mon6;
datalines;
S001 exposed a 1 1 0 0 0 0
S001 exposed a 0 0 1 1 0 0
S001 exposed a 0 0 0 0 1 0
S002 exposed b 1 0 0 0 0 0
S002 exposed b 0 1 1 0 0 0
S002 unexposed c 0 0 0 1 1 1
S003 unexposed c 1 1 1 1 1 1
S004 unexposed c 1 1 1 0 0 0 
S004 unexposed d 0 0 0 1 1 1
;

data tall;
  set have;
  array mon[6];
  start='01JAN2010'd ;
  do month_num=1 to dim(mon);
    month = intnx('month',start,month_num-1);
    present = mon[month_num];
    output;
  end;
  drop mon1-mon6 ;
  format start month monyy7.;
run;

Now  summaries are easy.

proc summary nway data=tall ;
  class month month_num exp group ;
  var present ;
  output out=want sum=ptime;
run;

Results:

Tom_0-1737560086348.png

...

sasgorilla
Pyrite | Level 9

@Tom This worked perfectly, and was very succint. Thank you!!

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 2165 views
  • 5 likes
  • 5 in conversation