Hello there, I recently get stuck on how to do data aggregation, any input would be appreciated.
The table I would like to do aggregation has multiple lines of info for users with different IDs
.
I would like to summarize the data into only one line of all desired info for each member. However, different rules in choosing the right value for different variables makes it hard for me.
Example: Original table and desired outcome for the table is listed below (only one member is shown here):
ID | Name | DOB | Event1 | Time1-A | Time1-B | Event2 | Time2-A | Time2-B | Event3 | Time3-A | Time3-B | Time3-C |
123 | JaneD | ABC | CCC | 1/1/2017 | 1/3/2017 | . | . | . | . | . | . | . |
345 | JaneD | ABC | NNN | 3/3/2017 | 3/7/2017 | . | . | . | QQQ | 4/3/2017 | 3/5/2017 | 3/9/2017 |
456 | JaneD | ABC | . | . | . | NNN | 11/26/2016 | 2/9/2017 | NNN | 5/17/2017 | 4/7/2017 | 5/20/2017 |
to:
Name | DOB | Event1 | Time1 | Event2 | Time2 | Event3 | Time3 |
JaneD | ABC | CCC | 1/3/2017 | NNN | 2/9/2017 | QQQ | 4/3/2017 |
Rules are:
Event123
has to be selected based on the priority of CCC>QQQ>NNN
.TimeABC etc.
have to match Event and always choose the max value between A&B
.I tried proc
summary but it won't always have the right match between Event and max time.
Can anyone help on what aggregation method might be useful in summarizing info when too many variables are involved?
Thanks!
1) Break the row into separated events either by proc transpose or by a data step:
data temp;
set have;
keep ID NAME DOB EventX PRIORITY TimeX;
array ev Evwnt1-Event3;
array tm Time_1A Time_1B Time_2A Time_2B TIme_3A TIme_3B;
do i=1 to 3;
EventX = ev(i);
/* add PRIORITY variable */
if not missing(EventX) then do;
TimeX = tm(i);
output;
end;
end;
run;
Proc sort data=temp; by Name Priority TimeX; run;
How does Time_3C of Enent3 play with the rules ?
2) Add a new variable, the PRIORITY code, assigned according to event code.
You can add it in same above step, either using a format or a CASE statement.
3) Sort the TEMP dataset by Name PRIORITY and TimeX.
4) Select desired event per Name ant transpose into desired row format, using tool
depending on output type: Dataset or Report.
1. Post more test cases that cover all scenarios. Otheriwse we're going to code something and you're going to come back and say it doesn't work.
2. Convert CCC/QQQ/NNN to numbers. CCC = 3 QQQ = 2 NNN = 1. Then you can take the max as need and convert at the end.
3. Look at ARRAYS -> 3, one for each series, Event1-3, TIMEA1-3, TIMEB1-3 so you can loop through as necessary.
Can you more clearly define define how the rules are implemented for dates, I didn't quite get that.
As mentioned on your original post, a data step is likely warranted.
Another option is to transpose your data and deal with it in a long form and then transposing back.
The following could get you a start.
data want;
set have;
array x{*} event:;
array y{*} time1:;
array z{*} time2:;
c=whichc('CCC',of x{*});
q=whichc('QQQ',of x{*});
n=whichc('NNN',of x{*});
if c then do;
want_event=x{c};
want_time=max(y{c},z{c});
end;
else if q then do;
want_event=x{q};
want_time=max(y{q},z{q});
end;
else if n then do;
want_event=x{n};
want_time=max(y{n},z{n});
end;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.