Hi all,
Thanks for all your help so far, this is the last step that I've been trying to figure out and was wondering if there's an efficient way to go about doing this.
Currently I have data set up in this format:
ID EndDate Value Days
1 1/1/00 2 1
1 1/2/00 2 2
1 2/1/00 3 3
1 etc 3 4
1 etc 3 5
1
2 1/1/00 2 1
2 1/2/00 2 2
2 2/1/00 3 3
2 etc 3 4
2 etc 3 5
The end product is going to be
ID Month1Sum Month1Avg Month2Sum Month2Avg
1
2
3
4
etc
So in essence, what I'm trying to do is for "Days" 1-30, to get the sum and avg of the "Value" column as two new variables, then 31-60, etc. This is a slice of code to show how the breakdown of months would work m12 would be the enddate-30.
Days,Days+30,m1);
Days+31,Days+60,m2);
Days+61,Days+90,m3);
Days+91,Days+120,m4);
Days+121,Days+150),m5);
Days+151,Days+180),m6);
Days+181,Days+210),m7);
Days+211,Days+240),m8);
Days+241,Days+270),m9);
Days+271,Days+300),m10);
Days+301,Days+330),m11);
Thanks again for all your help!
For completeness I include the entire code now, together with the generated values for 10 IDs.
data have;
do Days=1 to 300;
do id=1 to 10;
Value=int(30*ranuni(1)/1);
output;
end;
end;
run;
data have2;
format id 8. Days best4.;
set have;
month=int(Days/30)+1;
if Days<=30 then month=1;
else if Days<=60 then month=2;
else if Days<=90 then month=3;
else if Days<=120 then month=4;
else if Days<=150 then month=5;
else if Days<=180 then month=6;
else if Days<=210 then month=7;
else if Days<=240 then month=8;
else if Days<=270 then month=9;
else if Days<=300 then month=10;
else month=11; *just in case *;
if month=1 then Monthval1=Value;
else if month=2 then Monthval2=Value;
else if month=3 then Monthval3=Value;
else if month=4 then Monthval4=Value;
else if month=5 then Monthval5=Value;
else if month=6 then Monthval6=Value;
else if month=7 then Monthval7=Value;
else if month=8 then Monthval8=Value;
else if month=9 then Monthval9=Value;
else if month=10 then Monthval10=Value;
else Monthval11=Value;
run;
proc sort data=have2;
by id days;
run;
%let months=Monthval1 Monthval2 Monthval3 Monthval4 Monthval5
Monthval6 Monthval7 Monthval8 Monthval9 Monthval10;
%let sum_months=sum_Monthval1 sum_Monthval2 sum_Monthval3 sum_Monthval4 sum_Monthval5
sum_Monthval6 sum_Monthval7 sum_Monthval8 sum_Monthval9 sum_Monthval10;
%let avg_months=avg_Monthval1 avg_Monthval2 avg_Monthval3 avg_Monthval4 avg_Monthval5
avg_Monthval6 avg_Monthval7 avg_Monthval8 avg_Monthval9 avg_Monthval10;
proc summary data=have2 ;
var Monthval1 -- Monthval10;
by id;
output out=monthly_stats (drop=_FREQ_ _TYPE_)
sum=&sum_months mean=&avg_months;
run;
data want;
format id best4. sum_Monthval1 best8. avg_Monthval1 8.2
sum_Monthval2 best8. avg_Monthval2 8.2
sum_Monthval3 best8. avg_Monthval3 8.2
sum_Monthval4 best8. avg_Monthval4 8.2
sum_Monthval5 best8. avg_Monthval5 8.2
sum_Monthval6 best8. avg_Monthval6 8.2
sum_Monthval7 best8. avg_Monthval7 8.2
sum_Monthval8 best8. avg_Monthval8 8.2
sum_Monthval9 best8. avg_Monthval9 8.2
sum_Monthval10 best8. avg_Monthval10 8.2;
set monthly_stats;
run;
options nocenter nodate;
proc print data=want uniform noobs;
run;
We need more precise definition of Days 1-30. Does the first column "EndDate" determine the number of days, i.e. are these days in a calendar month specified by EndDate?
The "days" column is calculated using two columns of info. One is a startdate and the other is the enddate. I left it out since I've already calculated the "days" column using that, but if needed you can code it with start/enddate variables. The thing is, the start and end dates are approximately 10 months apart and I'm looking at 30 day intervals rather than months. Meaning, I'm looking at 300 days prior to the end date, (the 300 days is not uniform for all) and trying to figure out the sum/avg value for the first 30 days per ID, then the next 30 days, etc.
I did not check this code... I gave you two solutions for the month calculation, one the calendar month (commented out) and then another which should give you 30 day months. Also, I dropped the variable _name_ from the output datasets only to simplify the merge. Me, I'd leave it in for a while to double check the transposition. I'm not a transposing whiz, so there might be a way to optimize here, too.
proc sql;
create table add_month as
select
id,
end_date,
/* intck("month", "31Dec1999"d, end_date) as month, */
int((end_date - "01Jan2000"d) / 30) + 1 as month,
value,
days
from
original_data;
create table monthly_values as
select
id,
month,
"Month" || put(month, z2.0) || "Sum" as sum_ID,
"Month" || put(month, z2.0) || "Avg" as avg_ID,
sum(value) as sum_value,
avg(value) as avg_value
from
add_month
group by
month
order by
id,
month;
quit;
proc transpose data=monthly_values out=sum_value (drop=_name_);
id sum_ID;
var sum_value;
by id;
run;
proc transpose data=monthly_values out=avg_value (drop=_name_);
id avg_ID;
var avg_value;
by id;
run;
data desired_data;
merge sum_value avg_value;
by id;
run;
Hope this helps,
Tish
Does your date criteria, of 30 day range change per id or is it from a fixed start date and everything is in reference to that?
It resets per ID and it is all in chronological order. So it would go 1,2,3.... then reset for the next ID.
example:
data have;
do id=1,2;
do days=1 to 90;
value=days+id;
output;
end;
end;
data temp;
set have;
if days<31 then group=1;
else if days<61 then group=2;
else group=3;
run;
proc sql;
create table temp as
select id,group ,sum(value) as sum, mean(value) as mean
from temp
group by id,group
order by id,group;
quit;
data want(drop=group mean sum);
retain month1sum month2sum month3sum month1mean month2mean month3mean;
set temp;
by id;
array _sum(*) month1sum month2sum month3sum;
array _mean(*) month1mean month2mean month3mean;
_sum(group)=sum;
_mean(group)=mean;
if last.id then
output;
run;
proc print;run;
Obs month1sum month2sum month3sum month1mean month2mean month3mean id
1 495 1395 2295 16.5 46.5 76.5 1
2 525 1425 2325 17.5 47.5 77.5 2
Linlin
LinLin:
Thanks for this code, but I currently have around 200k ID's to work with and was wondering if there was any way to make it more streamlined where it would auto detect the number of ID's.
Reeza:
Would this work if I needed to submit it into a macro because I need to eventually pull about 120 sets of values since I'm looking for monthly and trimester level values? And to clarify, each ID has it's own start/end dates.
I'm not sure where the start and end dates come in to play. The proc means should be versatile and fine in a macro as long as the days definition is the same across variables, ie all 0 to 30 belong to month 1.
My thought was that if the ranges are fixed, you could use a format for the dates to specify them as well, but a bit unclear about what you're trying to do.
Maybe explaining your data and the final end result and or business context may help, if you can.
Okay, I'm currently trying to use measurements taken during pregnancy (end date = date of delivery) to quantify how much of each measurement was taken for the periods on a monthly and trimester basis. eg. how much cereal a person ate in the first month of pregnancy, 2nd, etc. as well as during first trimester, etc.
I have several items I'm analyzing, eg. cereal, milk, juice, etc. and the final dataset would have each person and variables for each item and the values for each month and trimester on one row for ease of analysis.
Currently, I'm trying to finalize the processing of lets say cereal, so I can create a macro to deal with all of the other items in the same manner without having to manually do each one. The days are defined by date of delivery minus the length of the pregnancy so that's how they're involved in this data.
So you have a start_date, end_date and a measurement_date? and then values?
Is your data like this :
start_date end_date measurement_date cereal juice milk
or like this:
start_date end_date measurement_date type value
Oh, sorry, so what we have are monthly measurements instead of daily. Therefore, I broke it out so that each monthly measurement is divided into the number of days in that month so I have a daily measurement.
From there, I'm trying to put it back together by month (not calendar, but of pregnancy) so that I can see the values for the first, second, third months of pregnancy etc.
I have:
ID start_date end_date day(this is the count that starts at 1 on the start date and goes up to delivery date, then resets for the next ID) value
Each item is in a separate dataset as of now, so I'll eventually need to merge all the data back together once I get the monthly and trimester measurements set up.
So the final dataset would look like:
ID Item1Month1sum, item1month1avg, item1month2sum/avg,.... item2month1sum/avg.... etc
Thanks again for your help!
Well, I think what I've suggested will work fine, create the format and then go from there with proc means. It'll reduce the hard coding a lot.
proc format;
value month_fmt
0 - 30 = 1
31 - 60 = 2
61 - 90 = 3
91 - 120 = 4
121 - 150 = 5
151 - 180 = 6
181 - 210 = 7
211 - 240 = 8
241 - 270 = 9
271 - 300 = 10
301 - 330 = 11;
run;
proc means data=have;
class id days;
format days month_fmt.;
var value;
output out=sample n= mean=/autoname;
run;
andp: Linlin's code will work for any number of id's, she just had to generate test data. I understand you already have data; so you don't need the first data step. If you name your variables sum_Month1, sum_Month2, ..., sum_Month10, so that the consecutive number is at the end of the variable name, then you can sometimes use syntax sum_Month1 -- sum_Month10 instead the explicit list of all vars, for instance in defining an array to summarize over a trimester or whatever.
data have;
informat ID best4. EndDate mmddyy8. Value best8. Days 3.;
input ID EndDate Value Days;
format EndDate date9.;
cards;
1 1/1/00 2 1
1 1/2/00 2 2
1 2/1/00 3 3
1 2/1/00 3 4
1 2/1/00 3 5
2 1/1/00 4 1
2 1/2/00 5 2
2 2/1/00 6 3
2 2/1/00 7 4
2 2/1/00 3 5
3 1/1/00 4 1
3 1/2/00 4 2
3 2/1/00 3 3
3 2/1/00 2 4
3 2/1/00 1 5
;
run;
Data have2;
format id 8. Days best4.;
set have;
if mod(Days,30)=1 then Monthval1=Value;
else if mod(Days,30)=2 then Monthval2=Value;
else if mod(Days,30)=3 then Monthval3=Value;
else if mod(Days,30)=4 then Monthval4=Value;
else if mod(Days,30)=5 then Monthval5=Value;
*more Monthvaln definitions here*;
run;
proc sort data=have2;
by id days;
run;
%let months=Monthval1 Monthval2 Monthval3 Monthval4 Monthval5;
%let sum_months=sum_Monthval1 sum_Monthval2 sum_Monthval3 sum_Monthval4 sum_Monthval5;
%let avg_months=avg_Monthval1 avg_Monthval2 avg_Monthval3 avg_Monthval4 avg_Monthval5;
proc summary data=have2 ;
var Monthval1 -- Monthval5;
by id;
output out=monthly_stats (drop=_FREQ_ _TYPE_)
sum=&sum_months mean=&avg_months;
run;
* this is the output:
s s s s s a a a a a
u u u u u v v v v v
m m m m m g g g g g
_ _ _ _ _ _ _ _ _ _
M M M M M M M M M M
o o o o o o o o o o
n n n n n n n n n n
t t t t t t t t t t
h h h h h h h h h h
v v v v v v v v v v
O a a a a a a a a a a
b i l l l l l l l l l l
s d 1 2 3 4 5 1 2 3 4 5
1 1 2 2 3 3 3 2 2 3 3 3
2 2 4 5 6 7 3 4 5 6 7 3
3 3 4 4 3 2 1 4 4 3 2 1
*;
Hi Dorota,
Thanks for the response, I just ran your code with the necessary modifications and noticed that it was summing the months in an odd fashion. Instead of summing just month1, the month1 variable has a value in it every 30 days, and likewise for the rest of the months so that every 30 days each month has a value. I'm wondering if it's the mod() function that's causing this and if there's a way to modify that portion so it sums the first 30 days into one month instead of one day every 30 days.
Tish:
Is there a way to write your code in a way that I could specify the length of the months? I'm not working with all months at 30 days so I was wondering, even if it takes more writing, to specify each month length?
Thanks again!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.