Function for choosing closest observation to a value

Reply
New Contributor
Posts: 3

Function for choosing closest observation to a value

Hello,

 

I was wondering if there is a function in SAS to select the observation where the value of the variable is closest to value x within a certain group.

 

I am working with financial data, so for example, say i have company YYYY on 03/23/18. I may have 50+ observations within that group. I need to select three observations out of that group, where the value of variable A is closest to x, y, and z. I am planning to put this code inside a macro so that is is easy to change around values of x, y, and z for various tests. 

 

I would like for the output dataset to be all variables currently in my dataset but just those three observations for each company/date combination.

 

The only way I can think of is adding columns with the difference between x, y, z and the value of variable A and selecting the minimum. Is this really the most efficient way?

 

Please let me know if anything is unclear and thank you in advance!

Super User
Posts: 13,552

Re: Function for choosing closest observation to a value

Example input data

Example output data

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

 

Are the variables you need "closest to A" character or numeric? (goes to how to select "closest") If you have ties for "distance" between the values do you have a rule for selecting the lowest, largest or random between them? Do the three selected x, y and z values have to be unique or may one repeat (some value on different observations)?

SAS Super FREQ
Posts: 4,242

Re: Function for choosing closest observation to a value

There are many definitions of distance. The results will depend on which you use.

 

For Euclidean distance, you can use the EUCLID function in the DATA step. Suppose that you want to find students whose (Age, Height, Weight) values are closest to the targets (13, 62, 100). You can compute the Euclidean distance as follows:

 

data Closest;
/* target (Age, Height, Weight) = (13, 62, 100) */
set Sashelp.Class;
EuclidDist = Euclid(Age-13, Height-62, Weight-100);
run;

proc sort data=Closest;
by EuclidDist;
run;

proc print data=Closest(obs=3); /* 3 closest students */
run;

For multivariate distances, the more reasonable choice of distance is the Mahalanobis distance, which takes into account correlations between variables. There are several ways to compute the Mahalanobis distance in SAS/STAT software. You can also use the MAHALANOBIS function in SAS/IML, as follows:

 

proc iml;
use Sashelp.Class;
read all var _NUM_ into X[colname=VarNames];
read all var "Name";
close;

target = {13 62 100};
MD = Mahalanobis(X, target);
print MD X[c=VarNames];

call sortndx(idx, MD);
Y = MD[idx,] || X[idx,];
Y = Y[1:3, ]; /* only 3 closest */
print Y[colname=("Dist" || varNames) rowname=(Name[idx,])];
Trusted Advisor
Posts: 1,339

Re: Function for choosing closest observation to a value

You want to rank the 50 observations for the absolute difference of variable A against 3 specified values - i.e. 3 rankings.  And you want to keep the records for RANK1=1  (rank of A vs the first specified value), RANK2=1 and RANK3=1:

 

Here an example using the IBM daily stock records in SASHELP.STOCKS, where the criterion variable is VOLUME:

 

data ibm_stocks;
  set sashelp.stocks;
  where stock='IBM';
run;

data need /view=need;
  set ibm_stocks;
  array dist {*} dist1-dist3;
  array values {3} _temporary_ (4700000,11600000,5500000);

  do i=1 to 3;
    dist{i}=abs(volume-values{i});
  end;
run;

proc rank data=need out=want (where=(rank1=1 or rank2=1 or rank3=1));
  var dist1 dist2 dist3;
  ranks rank1 rank2 rank3;
run;

The data set NEED is a data set VIEW, not a data set FILE.  I.e. it is not processed until a subsequent step call for data set NEED.  As a result, its content is not written to disk, but rather streamed to the calling step  (the PROC RANK in this case).

 

 

I have used specified volumes of 4,700,000;   11,600,000; and 5,500,000.

 

Of course, it's possible that one record could be closest to 2 (or more) target values.  But I assume you are unlikely to issue such targets.

Esteemed Advisor
Posts: 5,530

Re: Function for choosing closest observation to a value

I would start with something simple like this, and come back here if it isn't efficient enough

 

%let x=90;
%let y=110;
%let z=125;

%let A=weight;

proc sql;
select *
from sashelp.class
group by sex
having 
    abs(&A-&x) = min(abs(&A-&x)) or 
    abs(&A-&y) = min(abs(&A-&y)) or 
    abs(&A-&z) = min(abs(&A-&z));
quit;
                    Name      Sex       Age    Height    Weight
                    -------------------------------------------
                    Judy      F          14      64.3        90
                    Mary      F          15      66.5       112
                    Janet     F          15      62.5     112.5
                    William   M          15      66.5       112
                    Thomas    M          11      57.5        85
                    Robert    M          12      64.8       128
PG
New Contributor
Posts: 3

Re: Function for choosing closest observation to a value

Thank you everyone, I really appreciate the help! I tried the simple SQL code and it worked well, I just need to refine my results a little bit more now that I'm seeing the output. 

this is the sql code that I ran from PG's answer: 

 

%MACRO value(year=,mon=,a=,t=,d=,dataset=);
proc sql;
create table mylib.&dataset._&year as
select *
from mylib.master_&year
group by security_id, date, mat
having
abs(&mon-&a) = min(abs(&mon-&a)) or
abs(&mon-&o) = min(abs(&mon-&o)) or
abs(&mon-&d) = min(abs(&mon-&d));
quit;
%MEND value;

 

this is the dataset that I end up with:

 

data mylib.monvalues_2010;
infile datalines dsd truncover;
input dateSmiley Very HappyATE9. SECURITY_ID:32. c_vol:32. mat:32. p_vol:32. mon:32.;
datalines4;
04JAN2010,100892,,12,1.544622,0.5255413075
04JAN2010,100892,,12,0.602076,0.7988227875
04JAN2010,100892,0.301887,12,0.285932,1.0090393105
04JAN2010,100892,,47,0.749772,0.5255413075
04JAN2010,100892,0.351766,47,0.420842,0.7988227875
04JAN2010,100892,0.304283,47,0.30528,1.0090393105
04JAN2010,100892,0.383761,138,0.398305,0.7988227875
04JAN2010,100892,,138,0.524082,0.5255413075
04JAN2010,100892,0.322661,138,0.333717,1.0090393105
04JAN2010,100892,0.326735,229,0.336752,1.0090393105
04JAN2010,100892,0.362469,229,0.378213,0.8408660921
04JAN2010,100892,,229,0.494307,0.5255413075
04JAN2010,100892,,383,0.474313,0.5255413075
04JAN2010,100892,0.325869,383,0.338696,1.0510826151
04JAN2010,100892,0.367339,383,0.37962,0.8408660921
04JAN2010,100892,0.369302,747,0.394957,0.8408660921
04JAN2010,100892,0.38087,747,0.432982,0.6306495691
04JAN2010,100892,0.348698,747,0.361181,1.0510826151
05JAN2010,100892,,11,0.635303,0.7967313585
05JAN2010,100892,0.300998,11,0.301864,1.0010214505
05JAN2010,100892,,11,1.680322,0.5107252298
05JAN2010,100892,,46,0.786692,0.5107252298
05JAN2010,100892,,46,0.40056,0.7967313585
05JAN2010,100892,0.305781,46,0.309152,1.0010214505
05JAN2010,100892,0.321738,137,0.322548,1.0010214505
05JAN2010,100892,0.380163,137,0.391764,0.7967313585
05JAN2010,100892,,137,0.544914,0.5107252298
;;;;

 

In one of my macro scenarios i need to add keep the min observation within the interval. I can use an if-then macro but i'm not sure how to add it to this sql code. For the example values d=0.5, t=0.8, a=1, I need to choose the observations with the values closest to 1 and 0.8 but within [0.8,1], and the value closest to 0.5 but within [0.5, 1]. 

 

Additionally, c_vol can't be missing for the observation that is selected as the closest to a (but it definitely can be missing for the other values selected), so I need to select the closest to a with a value for c_vol. This doesn't come up in the first 25 obs above, but it does occur in the dataset.

 

 

Esteemed Advisor
Posts: 5,530

Re: Function for choosing closest observation to a value

This what your scenario would translate to

 

proc sql;
create table mylib.&dataset._&year as

select *
from mylib.master_&year
where &mon between &t and &a
group by security_id, date, mat
having 
    &mon = min(&mon) or 
    &mon = max(&mon) 

union

select *
from mylib.master_&year
where c_vol is not missing and &mon between &d and &a
group by security_id, date, mat
having &mon = max(&mon);
quit;

 

(untested)

  

Suggestion - Test each scenario with a small example data set and experiment.

PG
SAS Super FREQ
Posts: 4,242

Re: Function for choosing closest observation to a value

For a discussion of various definitions of distances and how to compute each in SAS, see "Find the distances between observations and a target value".

Ask a Question
Discussion stats
  • 7 replies
  • 198 views
  • 3 likes
  • 5 in conversation