Hello,
I was wondering if there is a function in SAS to select the observation where the value of the variable is closest to value x within a certain group.
I am working with financial data, so for example, say i have company YYYY on 03/23/18. I may have 50+ observations within that group. I need to select three observations out of that group, where the value of variable A is closest to x, y, and z. I am planning to put this code inside a macro so that is is easy to change around values of x, y, and z for various tests.
I would like for the output dataset to be all variables currently in my dataset but just those three observations for each company/date combination.
The only way I can think of is adding columns with the difference between x, y, z and the value of variable A and selecting the minimum. Is this really the most efficient way?
Please let me know if anything is unclear and thank you in advance!
Example input data
Example output data
Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.
Are the variables you need "closest to A" character or numeric? (goes to how to select "closest") If you have ties for "distance" between the values do you have a rule for selecting the lowest, largest or random between them? Do the three selected x, y and z values have to be unique or may one repeat (some value on different observations)?
There are many definitions of distance. The results will depend on which you use.
For Euclidean distance, you can use the EUCLID function in the DATA step. Suppose that you want to find students whose (Age, Height, Weight) values are closest to the targets (13, 62, 100). You can compute the Euclidean distance as follows:
data Closest;
/* target (Age, Height, Weight) = (13, 62, 100) */
set Sashelp.Class;
EuclidDist = Euclid(Age-13, Height-62, Weight-100);
run;
proc sort data=Closest;
by EuclidDist;
run;
proc print data=Closest(obs=3); /* 3 closest students */
run;
For multivariate distances, the more reasonable choice of distance is the Mahalanobis distance, which takes into account correlations between variables. There are several ways to compute the Mahalanobis distance in SAS/STAT software. You can also use the MAHALANOBIS function in SAS/IML, as follows:
proc iml;
use Sashelp.Class;
read all var _NUM_ into X[colname=VarNames];
read all var "Name";
close;
target = {13 62 100};
MD = Mahalanobis(X, target);
print MD X[c=VarNames];
call sortndx(idx, MD);
Y = MD[idx,] || X[idx,];
Y = Y[1:3, ]; /* only 3 closest */
print Y[colname=("Dist" || varNames) rowname=(Name[idx,])];
What if i know the age (For ex 24) and I just want to keep the max height row for every subject.
Please state your question in a new thread of your own. Supply example data in usable form, and show the expected outcome.
Also tell where the conceptual difference is to this: https://communities.sas.com/t5/Statistical-Procedures/Assign-category-to-max-age-below-24Months/m-p/...
You want to rank the 50 observations for the absolute difference of variable A against 3 specified values - i.e. 3 rankings. And you want to keep the records for RANK1=1 (rank of A vs the first specified value), RANK2=1 and RANK3=1:
Here an example using the IBM daily stock records in SASHELP.STOCKS, where the criterion variable is VOLUME:
data ibm_stocks;
set sashelp.stocks;
where stock='IBM';
run;
data need /view=need;
set ibm_stocks;
array dist {*} dist1-dist3;
array values {3} _temporary_ (4700000,11600000,5500000);
do i=1 to 3;
dist{i}=abs(volume-values{i});
end;
run;
proc rank data=need out=want (where=(rank1=1 or rank2=1 or rank3=1));
var dist1 dist2 dist3;
ranks rank1 rank2 rank3;
run;
The data set NEED is a data set VIEW, not a data set FILE. I.e. it is not processed until a subsequent step call for data set NEED. As a result, its content is not written to disk, but rather streamed to the calling step (the PROC RANK in this case).
I have used specified volumes of 4,700,000; 11,600,000; and 5,500,000.
Of course, it's possible that one record could be closest to 2 (or more) target values. But I assume you are unlikely to issue such targets.
I would start with something simple like this, and come back here if it isn't efficient enough
%let x=90;
%let y=110;
%let z=125;
%let A=weight;
proc sql;
select *
from sashelp.class
group by sex
having
abs(&A-&x) = min(abs(&A-&x)) or
abs(&A-&y) = min(abs(&A-&y)) or
abs(&A-&z) = min(abs(&A-&z));
quit;
Name Sex Age Height Weight ------------------------------------------- Judy F 14 64.3 90 Mary F 15 66.5 112 Janet F 15 62.5 112.5 William M 15 66.5 112 Thomas M 11 57.5 85 Robert M 12 64.8 128
Thank you everyone, I really appreciate the help! I tried the simple SQL code and it worked well, I just need to refine my results a little bit more now that I'm seeing the output.
this is the sql code that I ran from PG's answer:
%MACRO value(year=,mon=,a=,t=,d=,dataset=);
proc sql;
create table mylib.&dataset._&year as
select *
from mylib.master_&year
group by security_id, date, mat
having
abs(&mon-&a) = min(abs(&mon-&a)) or
abs(&mon-&o) = min(abs(&mon-&o)) or
abs(&mon-&d) = min(abs(&mon-&d));
quit;
%MEND value;
this is the dataset that I end up with:
data mylib.monvalues_2010;
infile datalines dsd truncover;
input date:DATE9. SECURITY_ID:32. c_vol:32. mat:32. p_vol:32. mon:32.;
datalines4;
04JAN2010,100892,,12,1.544622,0.5255413075
04JAN2010,100892,,12,0.602076,0.7988227875
04JAN2010,100892,0.301887,12,0.285932,1.0090393105
04JAN2010,100892,,47,0.749772,0.5255413075
04JAN2010,100892,0.351766,47,0.420842,0.7988227875
04JAN2010,100892,0.304283,47,0.30528,1.0090393105
04JAN2010,100892,0.383761,138,0.398305,0.7988227875
04JAN2010,100892,,138,0.524082,0.5255413075
04JAN2010,100892,0.322661,138,0.333717,1.0090393105
04JAN2010,100892,0.326735,229,0.336752,1.0090393105
04JAN2010,100892,0.362469,229,0.378213,0.8408660921
04JAN2010,100892,,229,0.494307,0.5255413075
04JAN2010,100892,,383,0.474313,0.5255413075
04JAN2010,100892,0.325869,383,0.338696,1.0510826151
04JAN2010,100892,0.367339,383,0.37962,0.8408660921
04JAN2010,100892,0.369302,747,0.394957,0.8408660921
04JAN2010,100892,0.38087,747,0.432982,0.6306495691
04JAN2010,100892,0.348698,747,0.361181,1.0510826151
05JAN2010,100892,,11,0.635303,0.7967313585
05JAN2010,100892,0.300998,11,0.301864,1.0010214505
05JAN2010,100892,,11,1.680322,0.5107252298
05JAN2010,100892,,46,0.786692,0.5107252298
05JAN2010,100892,,46,0.40056,0.7967313585
05JAN2010,100892,0.305781,46,0.309152,1.0010214505
05JAN2010,100892,0.321738,137,0.322548,1.0010214505
05JAN2010,100892,0.380163,137,0.391764,0.7967313585
05JAN2010,100892,,137,0.544914,0.5107252298
;;;;
In one of my macro scenarios i need to add keep the min observation within the interval. I can use an if-then macro but i'm not sure how to add it to this sql code. For the example values d=0.5, t=0.8, a=1, I need to choose the observations with the values closest to 1 and 0.8 but within [0.8,1], and the value closest to 0.5 but within [0.5, 1].
Additionally, c_vol can't be missing for the observation that is selected as the closest to a (but it definitely can be missing for the other values selected), so I need to select the closest to a with a value for c_vol. This doesn't come up in the first 25 obs above, but it does occur in the dataset.
This what your scenario would translate to
proc sql;
create table mylib.&dataset._&year as
select *
from mylib.master_&year
where &mon between &t and &a
group by security_id, date, mat
having
&mon = min(&mon) or
&mon = max(&mon)
union
select *
from mylib.master_&year
where c_vol is not missing and &mon between &d and &a
group by security_id, date, mat
having &mon = max(&mon);
quit;
(untested)
Suggestion - Test each scenario with a small example data set and experiment.
For a discussion of various definitions of distances and how to compute each in SAS, see "Find the distances between observations and a target value".
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.