Hello,
I'm using Base SAS v9.3.
I have data set HAVE as follows:
data HAVE;
input Dept $ CM_start CM_end Year Count;
length dept $ 4;
datalines;
M82T 1 3 2005 789
M82T 4 5 2005 789
M82T 7 9 2005 789
M82T 10 11 2006 985
M82T 12 14 2006 985
UTYD 5 7 2008 980
UTYD 8 9 2008 980
UTYD 10 15 2011 784
UTYD 16 17 2011 784
UTYD 18 20 2011 784
SRT6 5 7 2008 452
SRT6 8 9 2008 452
SRT6 10 15 2008 452
SRT6 16 17 2008 452
SRT6 18 20 2008 452
;
run;
I want data set WANT that collapses the rows into one record for all rows that have the same DEPT, YEAR and COUNT.
BUT, I also want CM_START to represent the lowest CM value and CM_END to represent the highest from that DEPT/YEAR/COUNT group.
For example:
Dept | CM_start | CM_end | Year | Count |
M82T | 1 | 9 | 2005 | 789 |
M82T | 10 | 14 | 2006 | 985 |
UTYD | 5 | 9 | 2008 | 980 |
UTYD | 10 | 20 | 2011 | 784 |
SRT6 | 5 | 20 | 2008 | 452 |
I feel like this is a multi-step process, that might incorporate by-group processing, with use of first and last but can't for the life of me figure it out.
Thanks for your time!
Does it need to be so complicated, why not just:?
proc sql;
create table want as
select dept, min(cm_start) as cm_start, max(cm_end) as cm_end, year, count
from have
group by dept, year, count
;
quit;
//Fredrik
Something like:
data have;
length dept $ 4;
input dept $ cm_start cm_end year count;
datalines;
M82T 1 3 2005 789
M82T 4 5 2005 789
M82T 7 9 2005 789
M82T 10 11 2006 985
M82T 12 14 2006 985
UTYD 5 7 2008 980
UTYD 8 9 2008 980
UTYD 10 15 2011 784
UTYD 16 17 2011 784
UTYD 18 20 2011 784
SRT6 5 7 2008 452
SRT6 8 9 2008 452
SRT6 10 15 2008 452
SRT6 16 17 2008 452
SRT6 18 20 2008 452
;
run;
data want;
set have;
by dept year count notsorted;
retain f_start f_year f_count;
if first.count then do;
f_start=cm_start;
f_year=year;
f_count=count;
end;
if last.count then do;
cm_start=f_start;
year=f_year;
count=f_count;
output;
end;
run;
Does it need to be so complicated, why not just:?
proc sql;
create table want as
select dept, min(cm_start) as cm_start, max(cm_end) as cm_end, year, count
from have
group by dept, year, count
;
quit;
//Fredrik
It depends, if there is a lot of data sql is not the best procedure to process it - been shown in other posts. SQL isn't also known by everyone. Everyone however who uses SAS knows Base SAS datastep language, and to be perfectly honest, there is very little difference between the two codes there anyways, a by group, you use a min/max function, I use a retained variable, big whoop.
What would be the expected result of data like this
bob42 1 3 2009 123
bob42 5 8 2009 123
bob42 9 10 2009 123
or is it sure that there are no breaks between cm_end and the next cm_start?
This is a valid point @andreas_lds and one I should have included in my post (apologies). The data provider assures me that there are no breaks and review of the data indicates this is true. Of course, that doesn't mean that an error could occur in pre-processing in the future. I believe (perhaps wrongly) that @FredrikE's code would account for such an instance and creating separate rows where a break occurs. This is sufficient for my purposes.
Here's a proc summary solution
proc summary data=have nway;
class Dept year count;
output out=want(drop=_:) min(CM_start)=CM_start max(CM_end)=CM_end;
run;quit;
proc print;run;
Thanks @himalayan. This worked well, and I don't regularly use proc summary so it was nice to see an alternative method! It was however a bit slower to process than the SQL method, so have accepted that solution.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.