Hello,
I have a dataset such that for each value of 'acc', I want to select the row with the highest 'DT' value where flag='Y'. Below is the input dataset:
Form | ID | DT | acc | flag |
60663233 | 2210 | 01Oct2019 0:00:00.000 | test | Y |
60699093 | 2210 | 03Oct2019 0:00:00.000 | test | |
60699143 | 2210 | 03Oct2019 0:00:00.000 | test | |
60699213 | 2210 | 03Oct2019 0:00:00.000 | test | Y |
60699983 | 2210 | 03Oct2019 0:00:00.000 | test | |
60699998 | 2210 | 03Oct2019 0:00:00.000 | test | |
60700023 | 2210 | 03Oct2019 0:00:00.000 | test | |
61120139 | 2317 | 12Nov2019 0:00:00.000 | test1 | Y |
This is what I've come up with so far:
proc sql;
create table abc as
select *
from sample where flag = 'Y' having dt = max(dt);
quit;
But this gives me the following output:
Form | ID | DT | acc | flag |
61120139 | 2317 | 12Nov2019 0:00:00.000 | test1 | Y |
But I'm trying to get the following output: I don't need the max value of DT for the entire table, but looking to get the max DT value per acc.
Form | ID | DT | acc | flag |
60699213 | 2210 | 03Oct2019 0:00:00.000 | test | Y |
61120139 | 2317 | 12Nov2019 0:00:00.000 | test1 | Y |
Appreciate your help and time 🙂
Thank you.
Add a
group by acc;
proc sql;
create table abc as
select *
from sample where flag = 'Y'
group by acc
having dt = max(dt);
quit;
I don't understand this. Acc is constand besides one obs?
Add a
group by acc;
proc sql;
create table abc as
select *
from sample where flag = 'Y'
group by acc
having dt = max(dt);
quit;
The SQL clauses are as follows:
1. SELECT- MUST
2. FROM-MUST
3. WHERE (IF ANY)
4. GROUP BY (IF ANY)
5. HAVING (IF A GROUPED FILTER IS REQUIRED)
6. ORDER BY (IF YOU WANT TO ORDER)
Courtesy: @PGStats aka prodigy genius stats who corrected me 2 years ago. I took those notes seriously after 🙂
Here is the test:
data have;
infile cards truncover;
input Form ID @21 DT datetime21. acc $ flag $;
format dt datetime20.;
cards;
60663233 2210 01Oct2019 0:00:00.000 test Y
60699093 2210 03Oct2019 0:00:00.000 test
60699143 2210 03Oct2019 0:00:00.000 test
60699213 2210 03Oct2019 0:00:00.000 test Y
60699983 2210 03Oct2019 0:00:00.000 test
60699998 2210 03Oct2019 0:00:00.000 test
60700023 2210 03Oct2019 0:00:00.000 test
61120139 2317 12Nov2019 0:00:00.000 test1 Y
;
proc sql;
create table abc as
select *
from have
where flag = 'Y'
group by acc
having dt = max(dt);
quit;
proc print noobs;run;
Form | ID | DT | acc | flag |
---|---|---|---|---|
60699213 | 2210 | 03OCT2019:00:00:00 | test | Y |
61120139 | 2317 | 12NOV2019:00:00:00 | test1 | Y |
If your data are already sorted by ACC:
data abc;
infile datalines missover;
input Form ID DT :datetime21.3 acc :$5. flag :$1. ;
format dt datetime24.3;
datalines;
60663233 2210 01Oct2019:0:00:00.000 test Y
60699093 2210 03Oct2019:0:00:00.000 test
60699143 2210 03Oct2019:0:00:00.000 test
60699213 2210 03Oct2019:0:00:00.000 test Y
60699983 2210 03Oct2019:0:00:00.000 test
60699998 2210 03Oct2019:0:00:00.000 test
60700023 2210 03Oct2019:0:00:00.000 test
61120139 2317 12Nov2019:0:00:00.000 test1 Y
run;
data want (drop=_:);
do until (last.acc);
set abc;
where flag='Y';
by acc;
_maxdt=max(dt,_maxdt);
end;
do until (last.acc);
set abc;
where flag='Y';
by acc;
if dt=_maxdt then put (_all_) (=); *output;
end;
run;
This program doesn't bother reading in observations unless it has flag='Y'. The where flag="Y"; statements outsource the filtering process to the data engine, so only the qualifyi8ng records are seen by the data step.
A non-SQL way you can approach this with depends on whether your input is sorted by ACC or not. If it is sorted:
data have ;
input form id dt :datetime18. acc $ flag $ ;
format dt datetime. ;
cards ;
60663233 2210 01Oct2019:00:00:00 test Y
60699093 2210 03Oct2019:00:00:00 test .
60699143 2210 03Oct2019:00:00:00 test .
60699213 2210 03Oct2019:00:00:00 test Y
60699983 2210 03Oct2019:00:00:00 test .
60699998 2210 03Oct2019:00:00:00 test .
60700023 2210 03Oct2019:00:00:00 test .
61120139 2317 12Nov2019:00:00:00 test1 Y
;
run ;
data want ;
do until (last.acc) ;
set have (where=(flag="Y")) curobs = _q ;
by acc ;
if dt < _n_ then continue ;
_n_ = dt ;
_s = _q ;
end ;
set have point = _s ;
run ;
If it is not sorted:
data want ;
dcl hash h () ;
h.definekey ("acc") ;
h.definedata ("_n_", "_s") ;
h.definedone () ;
do until (z) ;
set have curobs = _q end = z ;
where flag = "Y" ;
if h.find() ne 0 then call missing (_n_, _s) ;
if dt <= _n_ then continue ;
_n_ = dt ;
_s = _q ;
h.replace() ;
end ;
dcl hiter hi ("h") ;
do while (hi.next() = 0) ;
set have point = _s ;
output ;
end ;
run ;
In both cases, if there's a tie between two "Y" records with the same maximal date, the first occurrence is selected. Note that self-merging SQL offered by @novinosrin won't handle this and instead return all the tied records.
Kind regards
PaulD.
Guru @hashman Yes fully agree and point well taken. Yet another exuberance of "attention to detail". Kudos!
My thoughts & assumptions:
1. if your dataset doesn't have any more variables than the ones listed and the values are truly representative of your real, select distinct * should handle the duplicates and will pick only instance.
2. Point 1 essentially means you could have extra records of max dt and relies on this consistency. Strictly speaking, if even FORMNO changes for the same records that tied with the max dates, the select distinct * will of course fail.
3. If what Hashman pointed happens to be true, his solution is bullet proof and robust.
Hashman's diligence should have opened your eyes too to carefully examine the ties and associated variables in a record. Please make sure you keep note of it.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.