Solved: Re: observation selection based on max value of another column

AJ_Brien · Posted 11-14-2019 03:36 PM

Hello,

I have a dataset such that for each value of 'acc', I want to select the row with the highest 'DT' value where flag='Y'. Below is the input dataset:

Form	ID	DT	acc	flag
60663233	2210	01Oct2019 0:00:00.000	test	Y
60699093	2210	03Oct2019 0:00:00.000	test
60699143	2210	03Oct2019 0:00:00.000	test
60699213	2210	03Oct2019 0:00:00.000	test	Y
60699983	2210	03Oct2019 0:00:00.000	test
60699998	2210	03Oct2019 0:00:00.000	test
60700023	2210	03Oct2019 0:00:00.000	test
61120139	2317	12Nov2019 0:00:00.000	test1	Y

This is what I've come up with so far:

proc sql;
create table abc as
select * 
from sample where flag = 'Y' having dt = max(dt);
quit;

But this gives me the following output:

Form	ID	DT	acc	flag
61120139	2317	12Nov2019 0:00:00.000	test1	Y

But I'm trying to get the following output: I don't need the max value of DT for the entire table, but looking to get the max DT value per acc.

Form	ID	DT	acc	flag
60699213	2210	03Oct2019 0:00:00.000	test	Y
61120139	2317	12Nov2019 0:00:00.000	test1	Y

Appreciate your help and time 🙂

Thank you.

novinosrin · Posted 11-14-2019 03:44 PM

Add a

group by acc;

proc sql;
create table abc as
select * 
from sample where flag = 'Y' 
group by acc
having dt = max(dt);
quit;

View solution in original post

PeterClemmensen · Posted 11-14-2019 03:44 PM

I don't understand this. Acc is constand besides one obs?

The DATA to DATA Step Macro
Blog: SASnrd

AJ_Brien · Posted 11-14-2019 03:48 PM

so diff forms are submitted for a certain 'acc' value on different dates 'dt', hence you see multiple entries with different form values for the same account but on different dates. So I need to look for the most recent form submission for every acc value. hope this provides some context.

novinosrin · Posted 11-14-2019 03:44 PM

Add a

group by acc;

proc sql;
create table abc as
select * 
from sample where flag = 'Y' 
group by acc
having dt = max(dt);
quit;

AJ_Brien · Posted 11-14-2019 03:49 PM

I tried that earlier, that gives an error:

ERROR 22-322: Syntax error, expecting one of the following: ;, !, !!, &, *, **, +, -, /, <, <=, <>, =, >, >=, AND, EQ, EQT, EXCEPT,
GE, GET, GT, GTT, INTERSECT, LE, LET, LT, LTT, NE, NET, NOT, OR, ORDER, OUTER, UNION, ^, ^=, |, ||, ~, ~=.

ERROR 202-322: The option or parameter is not recognized and will be ignored.

novinosrin · Posted 11-14-2019 03:55 PM

The SQL clauses are as follows:

1. SELECT- MUST

2. FROM-MUST

3. WHERE (IF ANY)

4. GROUP BY (IF ANY)

5. HAVING (IF A GROUPED FILTER IS REQUIRED)

6. ORDER BY (IF YOU WANT TO ORDER)

Courtesy: @PGStats aka prodigy genius stats who corrected me 2 years ago. I took those notes seriously after 🙂

Here is the test:



data have;
infile cards truncover;
input Form	ID @21	DT datetime21. acc $	flag $;
format dt datetime20.;
cards;
60663233    2210   	01Oct2019 0:00:00.000	test	Y
60699093	2210	03Oct2019 0:00:00.000	test	 
60699143	2210	03Oct2019 0:00:00.000	test	 
60699213	2210	03Oct2019 0:00:00.000	test	Y
60699983	2210	03Oct2019 0:00:00.000	test	 
60699998	2210	03Oct2019 0:00:00.000	test	 
60700023	2210	03Oct2019 0:00:00.000	test	 
61120139	2317	12Nov2019 0:00:00.000	test1	Y
;

proc sql;
create table abc as
select * 
from have 
where flag = 'Y' 
group by acc
having dt = max(dt);
quit;

proc print noobs;run;

Form	ID	DT	acc	flag
60699213	2210	03OCT2019:00:00:00	test	Y
61120139	2317	12NOV2019:00:00:00	test1	Y

AJ_Brien · Posted 11-14-2019 03:56 PM

this is excellent piece, I will note it down 🙂
thank you!

AJ_Brien · Posted 11-14-2019 03:50 PM

nevermind, I was adding group by after having instead of before. This works. Thank you!!

mkeintz · Posted 11-14-2019 03:50 PM

If your data are already sorted by ACC:

data abc;
  infile datalines missover;
  input Form ID DT :datetime21.3 acc :$5. flag :$1. ;
  format dt datetime24.3;
datalines;
60663233 2210 01Oct2019:0:00:00.000 test Y 
60699093 2210 03Oct2019:0:00:00.000 test   
60699143 2210 03Oct2019:0:00:00.000 test   
60699213 2210 03Oct2019:0:00:00.000 test Y 
60699983 2210 03Oct2019:0:00:00.000 test   
60699998 2210 03Oct2019:0:00:00.000 test   
60700023 2210 03Oct2019:0:00:00.000 test   
61120139 2317 12Nov2019:0:00:00.000 test1 Y 
run;

data want (drop=_:);
  do until (last.acc);
    set abc;
    where flag='Y';
    by acc;
    _maxdt=max(dt,_maxdt);
  end;
  do until (last.acc);
    set abc;
    where flag='Y';
    by acc;
    if dt=_maxdt then put (_all_) (=); *output;
  end;
run;

This program doesn't bother reading in observations unless it has flag='Y'. The where flag="Y"; statements outsource the filtering process to the data engine, so only the qualifyi8ng records are seen by the data step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hashman · Posted 11-14-2019 04:39 PM

@AJ_Brien:

A non-SQL way you can approach this with depends on whether your input is sorted by ACC or not. If it is sorted:

data have ;                                                                                                                             
  input form id dt :datetime18. acc $ flag $ ;                                                                                          
  format dt datetime. ;                                                                                                                 
  cards ;                                                                                                                               
60663233  2210  01Oct2019:00:00:00  test   Y                                                                                            
60699093  2210  03Oct2019:00:00:00  test   .                                                                                            
60699143  2210  03Oct2019:00:00:00  test   .                                                                                            
60699213  2210  03Oct2019:00:00:00  test   Y                                                                                            
60699983  2210  03Oct2019:00:00:00  test   .                                                                                            
60699998  2210  03Oct2019:00:00:00  test   .                                                                                            
60700023  2210  03Oct2019:00:00:00  test   .                                                                                            
61120139  2317  12Nov2019:00:00:00  test1  Y                                                                                            
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want ;                                                                                                                             
  do until (last.acc) ;                                                                                                                 
    set have (where=(flag="Y")) curobs = _q ;                                                                                           
    by acc ;                                                                                                                            
    if dt < _n_ then continue ;                                                                                                         
    _n_ = dt ;                                                                                                                          
    _s  = _q ;                                                                                                                          
  end ;                                                                                                                                 
  set have point = _s ;                                                                                                                 
run ;

If it is not sorted:

data want ;                                                                                                                             
  dcl hash h () ;                                                                                                                       
  h.definekey ("acc") ;                                                                                                                 
  h.definedata ("_n_", "_s") ;                                                                                                          
  h.definedone () ;                                                                                                                     
  do until (z) ;                                                                                                                        
    set have curobs = _q end = z ;                                                                                                      
    where flag = "Y" ;                                                                                                                  
    if h.find() ne 0 then call missing (_n_, _s) ;                                                                                      
    if dt <= _n_ then continue ;                                                                                                        
    _n_ = dt ;                                                                                                                          
    _s  = _q ;                                                                                                                          
    h.replace() ;                                                                                                                       
  end ;                                                                                                                                 
  dcl hiter hi ("h") ;                                                                                                                  
  do while (hi.next() = 0) ;                                                                                                            
    set have point = _s ;                                                                                                               
    output ;                                                                                                                            
  end ;                                                                                                                                 
run ;

In both cases, if there's a tie between two "Y" records with the same maximal date, the first occurrence is selected. Note that self-merging SQL offered by @novinosrin won't handle this and instead return all the tied records.

Kind regards

PaulD.

novinosrin · Posted 11-14-2019 05:00 PM

Guru @hashman Yes fully agree and point well taken. Yet another exuberance of "attention to detail". Kudos!

@AJ_Brien

My thoughts & assumptions:

1. if your dataset doesn't have any more variables than the ones listed and the values are truly representative of your real, select distinct * should handle the duplicates and will pick only instance.

2. Point 1 essentially means you could have extra records of max dt and relies on this consistency. Strictly speaking, if even FORMNO changes for the same records that tied with the max dates, the select distinct * will of course fail.

3. If what Hashman pointed happens to be true, his solution is bullet proof and robust.

Hashman's diligence should have opened your eyes too to carefully examine the ties and associated variables in a record. Please make sure you keep note of it.

SAS Training: Just a Click Away