Hello,
I have a dataset such that for each value of 'acc', I want to select the row with the highest 'DT' value where flag='Y'. Below is the input dataset:
| Form | ID | DT | acc | flag | 
| 60663233 | 2210 | 01Oct2019 0:00:00.000 | test | Y | 
| 60699093 | 2210 | 03Oct2019 0:00:00.000 | test | |
| 60699143 | 2210 | 03Oct2019 0:00:00.000 | test | |
| 60699213 | 2210 | 03Oct2019 0:00:00.000 | test | Y | 
| 60699983 | 2210 | 03Oct2019 0:00:00.000 | test | |
| 60699998 | 2210 | 03Oct2019 0:00:00.000 | test | |
| 60700023 | 2210 | 03Oct2019 0:00:00.000 | test | |
| 61120139 | 2317 | 12Nov2019 0:00:00.000 | test1 | Y | 
This is what I've come up with so far:
proc sql;
create table abc as
select * 
from sample where flag = 'Y' having dt = max(dt);
quit;But this gives me the following output:
| Form | ID | DT | acc | flag | 
| 61120139 | 2317 | 12Nov2019 0:00:00.000 | test1 | Y | 
But I'm trying to get the following output: I don't need the max value of DT for the entire table, but looking to get the max DT value per acc.
| Form | ID | DT | acc | flag | 
| 60699213 | 2210 | 03Oct2019 0:00:00.000 | test | Y | 
| 61120139 | 2317 | 12Nov2019 0:00:00.000 | test1 | Y | 
Appreciate your help and time 🙂
Thank you.
Add a
group by acc;proc sql;
create table abc as
select * 
from sample where flag = 'Y' 
group by acc
having dt = max(dt);
quit;
I don't understand this. Acc is constand besides one obs?
Add a
group by acc;proc sql;
create table abc as
select * 
from sample where flag = 'Y' 
group by acc
having dt = max(dt);
quit;
The SQL clauses are as follows:
1. SELECT- MUST
2. FROM-MUST
3. WHERE (IF ANY)
4. GROUP BY (IF ANY)
5. HAVING (IF A GROUPED FILTER IS REQUIRED)
6. ORDER BY (IF YOU WANT TO ORDER)
Courtesy: @PGStats aka prodigy genius stats who corrected me 2 years ago. I took those notes seriously after 🙂
Here is the test:
data have;
infile cards truncover;
input Form	ID @21	DT datetime21. acc $	flag $;
format dt datetime20.;
cards;
60663233    2210   	01Oct2019 0:00:00.000	test	Y
60699093	2210	03Oct2019 0:00:00.000	test	 
60699143	2210	03Oct2019 0:00:00.000	test	 
60699213	2210	03Oct2019 0:00:00.000	test	Y
60699983	2210	03Oct2019 0:00:00.000	test	 
60699998	2210	03Oct2019 0:00:00.000	test	 
60700023	2210	03Oct2019 0:00:00.000	test	 
61120139	2317	12Nov2019 0:00:00.000	test1	Y
;
proc sql;
create table abc as
select * 
from have 
where flag = 'Y' 
group by acc
having dt = max(dt);
quit;
proc print noobs;run;| Form | ID | DT | acc | flag | 
|---|---|---|---|---|
| 60699213 | 2210 | 03OCT2019:00:00:00 | test | Y | 
| 61120139 | 2317 | 12NOV2019:00:00:00 | test1 | Y | 
If your data are already sorted by ACC:
data abc;
  infile datalines missover;
  input Form ID DT :datetime21.3 acc :$5. flag :$1. ;
  format dt datetime24.3;
datalines;
60663233 2210 01Oct2019:0:00:00.000 test Y 
60699093 2210 03Oct2019:0:00:00.000 test   
60699143 2210 03Oct2019:0:00:00.000 test   
60699213 2210 03Oct2019:0:00:00.000 test Y 
60699983 2210 03Oct2019:0:00:00.000 test   
60699998 2210 03Oct2019:0:00:00.000 test   
60700023 2210 03Oct2019:0:00:00.000 test   
61120139 2317 12Nov2019:0:00:00.000 test1 Y 
run;
data want (drop=_:);
  do until (last.acc);
    set abc;
    where flag='Y';
    by acc;
    _maxdt=max(dt,_maxdt);
  end;
  do until (last.acc);
    set abc;
    where flag='Y';
    by acc;
    if dt=_maxdt then put (_all_) (=); *output;
  end;
run;This program doesn't bother reading in observations unless it has flag='Y'. The where flag="Y"; statements outsource the filtering process to the data engine, so only the qualifyi8ng records are seen by the data step.
A non-SQL way you can approach this with depends on whether your input is sorted by ACC or not. If it is sorted:
data have ;                                                                                                                             
  input form id dt :datetime18. acc $ flag $ ;                                                                                          
  format dt datetime. ;                                                                                                                 
  cards ;                                                                                                                               
60663233  2210  01Oct2019:00:00:00  test   Y                                                                                            
60699093  2210  03Oct2019:00:00:00  test   .                                                                                            
60699143  2210  03Oct2019:00:00:00  test   .                                                                                            
60699213  2210  03Oct2019:00:00:00  test   Y                                                                                            
60699983  2210  03Oct2019:00:00:00  test   .                                                                                            
60699998  2210  03Oct2019:00:00:00  test   .                                                                                            
60700023  2210  03Oct2019:00:00:00  test   .                                                                                            
61120139  2317  12Nov2019:00:00:00  test1  Y                                                                                            
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want ;                                                                                                                             
  do until (last.acc) ;                                                                                                                 
    set have (where=(flag="Y")) curobs = _q ;                                                                                           
    by acc ;                                                                                                                            
    if dt < _n_ then continue ;                                                                                                         
    _n_ = dt ;                                                                                                                          
    _s  = _q ;                                                                                                                          
  end ;                                                                                                                                 
  set have point = _s ;                                                                                                                 
run ;               
If it is not sorted:
data want ;                                                                                                                             
  dcl hash h () ;                                                                                                                       
  h.definekey ("acc") ;                                                                                                                 
  h.definedata ("_n_", "_s") ;                                                                                                          
  h.definedone () ;                                                                                                                     
  do until (z) ;                                                                                                                        
    set have curobs = _q end = z ;                                                                                                      
    where flag = "Y" ;                                                                                                                  
    if h.find() ne 0 then call missing (_n_, _s) ;                                                                                      
    if dt <= _n_ then continue ;                                                                                                        
    _n_ = dt ;                                                                                                                          
    _s  = _q ;                                                                                                                          
    h.replace() ;                                                                                                                       
  end ;                                                                                                                                 
  dcl hiter hi ("h") ;                                                                                                                  
  do while (hi.next() = 0) ;                                                                                                            
    set have point = _s ;                                                                                                               
    output ;                                                                                                                            
  end ;                                                                                                                                 
run ;                     
In both cases, if there's a tie between two "Y" records with the same maximal date, the first occurrence is selected. Note that self-merging SQL offered by @novinosrin won't handle this and instead return all the tied records.
Kind regards
PaulD.
Guru @hashman Yes fully agree and point well taken. Yet another exuberance of "attention to detail". Kudos!
My thoughts & assumptions:
1. if your dataset doesn't have any more variables than the ones listed and the values are truly representative of your real, select distinct * should handle the duplicates and will pick only instance.
2. Point 1 essentially means you could have extra records of max dt and relies on this consistency. Strictly speaking, if even FORMNO changes for the same records that tied with the max dates, the select distinct * will of course fail.
3. If what Hashman pointed happens to be true, his solution is bullet proof and robust.
Hashman's diligence should have opened your eyes too to carefully examine the ties and associated variables in a record. Please make sure you keep note of it.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
