Hi all,
I am hoping to get your help with the following:
I have a database that looks like this
year | sex | are a | status | incidence | prevalence |
2000 | F | ON | 0 | 5 | 200 |
2001 | F | ON | 0 | 6 | 300 |
2002 | F | ON | 0 | 5 | 350 |
2003 | F | ON | 0 | 5 | 201 |
2004 | F | ON | 0 | 4 | 222 |
2005 | F | ON | 0 | 2 | 333 |
and I want to 'mush' together rows to get this :
year | sex | are a | status | incidence | prevalence |
2000/2002 | F | ON | 0 | 16 | 850 |
2003/2005 | F | ON | 0 | 11 | 756 |
Note that I am simply adding the years. I have several thousands of these so doing it by hand in excel is not very practical.
Any idea of how to do this in SAS?
Thank you in advance!
@K_S wrote:
Hi all,
I am hoping to get your help with the following:
I have a database that looks like this
year sex are a status incidence prevalence 2000 F ON 0 5 200 2001 F ON 0 6 300 2002 F ON 0 5 350 2003 F ON 0 5 201 2004 F ON 0 4 222 2005 F ON 0 2 333
and I want to 'mush' together rows to get this :
year sex are a status incidence prevalence 2000/2002 F ON 0 16 850 2003/2005 F ON 0 11 756
Note that I am simply adding the years. I have several thousands of these so doing it by hand in excel is not very practical.
Any idea of how to do this in SAS?
Thank you in advance!
"Adding the years" ... do you mean adding the incidence and prevalence within years?
You can create a format that groups together the different years, and then run PROC SUMMARY on the formatted years to get the sums.
For example:
proc format; value yearf 2000-2002='2000/2002' 2003-2005='2003/2005'; run; proc summary nway data=have; class year; format year yearf.; var incidence prevalence; output out=want sum=; run;
data have;
input year sex $ are_a $ status $ incidence prevalence;
datalines;
2000 F ON 0 5 200
2001 F ON 0 6 300
2002 F ON 0 5 350
2003 F ON 0 5 201
2004 F ON 0 4 222
2005 F ON 0 2 333
;
proc sql;
create table want(drop=k) as
select distinct catx('/',min(year),max(year)) as year, sex , are_a ,status ,round(year/3) as k, sum(incidence) as sum_incidence, sum(prevalence) as sum_prevalence
from have
group by k;
quit;
data have;
input year sex $ area $ status incidence prevalence;
cards;
2000 F ON 0 5 200
2001 F ON 0 6 300
2002 F ON 0 5 350
2003 F ON 0 5 201
2004 F ON 0 4 222
2005 F ON 0 2 333
;
proc sql noprint;
create table want1 as
select catx('/',min(year),max(year)) as year_,
sex,
area,
status,
sum(incidence) as sum_incidence1,
sum(prevalence) as sum_prevalence1
from have
where year in (2000:2002);
create table want2 as
select catx('/',min(year),max(year)) as year_,
sex,
area,
status,
sum(incidence) as sum_incidence2,
sum(prevalence) as sum_prevalence2
from have
where year in (2003:2005);
quit;
data tmp;
set want1 (rename = (sum_incidence1 = incidence sum_prevalence1 = prevalence year_ = year))
want2 (rename = (sum_incidence2 = incidence sum_prevalence2 = prevalence year_ = year));
run;
data want;
set tmp;
by year notsorted;
if last.year;
run;
data have;
input year sex $ are_a $ status $ incidence prevalence;
datalines;
2000 F ON 0 5 200
2001 F ON 0 6 300
2002 F ON 0 5 350
2003 F ON 0 5 201
2004 F ON 0 4 222
2005 F ON 0 2 333
;
data want;
retain _year;
if 0 then set have;
sum_incidence=0;
sum_prevalence=0;
do _n_=1 by 1 until(_n_=3);
set have;
by year;
sum_incidence+incidence;
sum_prevalence+prevalence;
if _n_=1 then temp=year;
else if _n_=3 then do;
_year=catx('/',temp,year);
output;
end;
end;
drop year temp incidence prevalence;
run;
@novinosrin wrote:
data have;
input year sex $ are_a $ status $ incidence prevalence;
datalines;
2000 F ON 0 5 200
2001 F ON 0 6 300
2002 F ON 0 5 350
2003 F ON 0 5 201
2004 F ON 0 4 222
2005 F ON 0 2 333
;
data want;
retain _year;
if 0 then set have;
sum_incidence=0;
sum_prevalence=0;
do _n_=1 by 1 until(_n_=3);
set have;
by year;
sum_incidence+incidence;
sum_prevalence+prevalence;
if _n_=1 then temp=year;
else if _n_=3 then do;
_year=catx('/',temp,year);
output;
end;
end;
drop year temp incidence prevalence;
run;
Assuming the year combinations wanted are always three-year combinations, this works fine. It also assumes that the final year is some multiple of 3 years after the initial year.
I find it opposite to my way of thinking to perform these types of summations in a data step, you have to create your own looping and end of loop conditions; when PROC MEANS and PROC SUMMARY were built to do this.
I don't know what the speed implications are for large data sets (which of course this example is not), however I would imagine that using a PROC would be faster for large data sets, but I am not aware of any study that shows this.
@PaigeMiller You are absolutely right. I would like the OP to clarify the grouping sets of years that he/she wants in the required output. I was merely having fun with Sql and datastep approaches as it didn't take more than a couple of mins. lol
I am not sure I understand what I have to clarify, so please forgive me if I am off topic.
I have about 10 years worth of data for 10 different geographic areas. I am trying to get disease prevalence estimates but inctea of yearly estimates, i want to get estimates based on 3 year avearegs.
I hope this clarifies. Also I need to keep the in-between columns, so they cannot be eliminated.
Thank you!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.