BookmarkSubscribeRSS Feed
ag8711
Calcite | Level 5

Hello,

 

I was wondering if there is a function in SAS to select the observation where the value of the variable is closest to value x within a certain group.

 

I am working with financial data, so for example, say i have company YYYY on 03/23/18. I may have 50+ observations within that group. I need to select three observations out of that group, where the value of variable A is closest to x, y, and z. I am planning to put this code inside a macro so that is is easy to change around values of x, y, and z for various tests. 

 

I would like for the output dataset to be all variables currently in my dataset but just those three observations for each company/date combination.

 

The only way I can think of is adding columns with the difference between x, y, z and the value of variable A and selecting the minimum. Is this really the most efficient way?

 

Please let me know if anything is unclear and thank you in advance!

9 REPLIES 9
ballardw
Super User

Example input data

Example output data

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

 

Are the variables you need "closest to A" character or numeric? (goes to how to select "closest") If you have ties for "distance" between the values do you have a rule for selecting the lowest, largest or random between them? Do the three selected x, y and z values have to be unique or may one repeat (some value on different observations)?

Rick_SAS
SAS Super FREQ

There are many definitions of distance. The results will depend on which you use.

 

For Euclidean distance, you can use the EUCLID function in the DATA step. Suppose that you want to find students whose (Age, Height, Weight) values are closest to the targets (13, 62, 100). You can compute the Euclidean distance as follows:

 

data Closest;
/* target (Age, Height, Weight) = (13, 62, 100) */
set Sashelp.Class;
EuclidDist = Euclid(Age-13, Height-62, Weight-100);
run;

proc sort data=Closest;
by EuclidDist;
run;

proc print data=Closest(obs=3); /* 3 closest students */
run;

For multivariate distances, the more reasonable choice of distance is the Mahalanobis distance, which takes into account correlations between variables. There are several ways to compute the Mahalanobis distance in SAS/STAT software. You can also use the MAHALANOBIS function in SAS/IML, as follows:

 

proc iml;
use Sashelp.Class;
read all var _NUM_ into X[colname=VarNames];
read all var "Name";
close;

target = {13 62 100};
MD = Mahalanobis(X, target);
print MD X[c=VarNames];

call sortndx(idx, MD);
Y = MD[idx,] || X[idx,];
Y = Y[1:3, ]; /* only 3 closest */
print Y[colname=("Dist" || varNames) rowname=(Name[idx,])];
bharath86
Obsidian | Level 7

@Rick_SAS 

 

What if i know the age (For ex 24) and I just want to keep the max height row for every subject.  

mkeintz
PROC Star

You want to rank the 50 observations for the absolute difference of variable A against 3 specified values - i.e. 3 rankings.  And you want to keep the records for RANK1=1  (rank of A vs the first specified value), RANK2=1 and RANK3=1:

 

Here an example using the IBM daily stock records in SASHELP.STOCKS, where the criterion variable is VOLUME:

 

data ibm_stocks;
  set sashelp.stocks;
  where stock='IBM';
run;

data need /view=need;
  set ibm_stocks;
  array dist {*} dist1-dist3;
  array values {3} _temporary_ (4700000,11600000,5500000);

  do i=1 to 3;
    dist{i}=abs(volume-values{i});
  end;
run;

proc rank data=need out=want (where=(rank1=1 or rank2=1 or rank3=1));
  var dist1 dist2 dist3;
  ranks rank1 rank2 rank3;
run;

The data set NEED is a data set VIEW, not a data set FILE.  I.e. it is not processed until a subsequent step call for data set NEED.  As a result, its content is not written to disk, but rather streamed to the calling step  (the PROC RANK in this case).

 

 

I have used specified volumes of 4,700,000;   11,600,000; and 5,500,000.

 

Of course, it's possible that one record could be closest to 2 (or more) target values.  But I assume you are unlikely to issue such targets.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PGStats
Opal | Level 21

I would start with something simple like this, and come back here if it isn't efficient enough

 

%let x=90;
%let y=110;
%let z=125;

%let A=weight;

proc sql;
select *
from sashelp.class
group by sex
having 
    abs(&A-&x) = min(abs(&A-&x)) or 
    abs(&A-&y) = min(abs(&A-&y)) or 
    abs(&A-&z) = min(abs(&A-&z));
quit;
                    Name      Sex       Age    Height    Weight
                    -------------------------------------------
                    Judy      F          14      64.3        90
                    Mary      F          15      66.5       112
                    Janet     F          15      62.5     112.5
                    William   M          15      66.5       112
                    Thomas    M          11      57.5        85
                    Robert    M          12      64.8       128
PG
ag8711
Calcite | Level 5

Thank you everyone, I really appreciate the help! I tried the simple SQL code and it worked well, I just need to refine my results a little bit more now that I'm seeing the output. 

this is the sql code that I ran from PG's answer: 

 

%MACRO value(year=,mon=,a=,t=,d=,dataset=);
proc sql;
create table mylib.&dataset._&year as
select *
from mylib.master_&year
group by security_id, date, mat
having
abs(&mon-&a) = min(abs(&mon-&a)) or
abs(&mon-&o) = min(abs(&mon-&o)) or
abs(&mon-&d) = min(abs(&mon-&d));
quit;
%MEND value;

 

this is the dataset that I end up with:

 

data mylib.monvalues_2010;
infile datalines dsd truncover;
input date:DATE9. SECURITY_ID:32. c_vol:32. mat:32. p_vol:32. mon:32.;
datalines4;
04JAN2010,100892,,12,1.544622,0.5255413075
04JAN2010,100892,,12,0.602076,0.7988227875
04JAN2010,100892,0.301887,12,0.285932,1.0090393105
04JAN2010,100892,,47,0.749772,0.5255413075
04JAN2010,100892,0.351766,47,0.420842,0.7988227875
04JAN2010,100892,0.304283,47,0.30528,1.0090393105
04JAN2010,100892,0.383761,138,0.398305,0.7988227875
04JAN2010,100892,,138,0.524082,0.5255413075
04JAN2010,100892,0.322661,138,0.333717,1.0090393105
04JAN2010,100892,0.326735,229,0.336752,1.0090393105
04JAN2010,100892,0.362469,229,0.378213,0.8408660921
04JAN2010,100892,,229,0.494307,0.5255413075
04JAN2010,100892,,383,0.474313,0.5255413075
04JAN2010,100892,0.325869,383,0.338696,1.0510826151
04JAN2010,100892,0.367339,383,0.37962,0.8408660921
04JAN2010,100892,0.369302,747,0.394957,0.8408660921
04JAN2010,100892,0.38087,747,0.432982,0.6306495691
04JAN2010,100892,0.348698,747,0.361181,1.0510826151
05JAN2010,100892,,11,0.635303,0.7967313585
05JAN2010,100892,0.300998,11,0.301864,1.0010214505
05JAN2010,100892,,11,1.680322,0.5107252298
05JAN2010,100892,,46,0.786692,0.5107252298
05JAN2010,100892,,46,0.40056,0.7967313585
05JAN2010,100892,0.305781,46,0.309152,1.0010214505
05JAN2010,100892,0.321738,137,0.322548,1.0010214505
05JAN2010,100892,0.380163,137,0.391764,0.7967313585
05JAN2010,100892,,137,0.544914,0.5107252298
;;;;

 

In one of my macro scenarios i need to add keep the min observation within the interval. I can use an if-then macro but i'm not sure how to add it to this sql code. For the example values d=0.5, t=0.8, a=1, I need to choose the observations with the values closest to 1 and 0.8 but within [0.8,1], and the value closest to 0.5 but within [0.5, 1]. 

 

Additionally, c_vol can't be missing for the observation that is selected as the closest to a (but it definitely can be missing for the other values selected), so I need to select the closest to a with a value for c_vol. This doesn't come up in the first 25 obs above, but it does occur in the dataset.

 

 

PGStats
Opal | Level 21

This what your scenario would translate to

 

proc sql;
create table mylib.&dataset._&year as

select *
from mylib.master_&year
where &mon between &t and &a
group by security_id, date, mat
having 
    &mon = min(&mon) or 
    &mon = max(&mon) 

union

select *
from mylib.master_&year
where c_vol is not missing and &mon between &d and &a
group by security_id, date, mat
having &mon = max(&mon);
quit;

 

(untested)

  

Suggestion - Test each scenario with a small example data set and experiment.

PG
Rick_SAS
SAS Super FREQ

For a discussion of various definitions of distances and how to compute each in SAS, see "Find the distances between observations and a target value".

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 4081 views
  • 4 likes
  • 7 in conversation