Greetings,
I have a data set with 904 observations.
All of the observations have a distance field, that shows the distance from a particular point. (The range of the distances is 23 miles to 3,807 miles)
I am attempting to run this data set through a macro do loop, that increments the mileage requirements by 5 miles, in order to find the approximately 100 nearest values.
If there are 99 values within a certain mileage range, and 103 within the previous range + 5 - then I want to keep the 103 observations.
The code I am currently trying is this:
%macro FILTER_OBSERVATIONS(howmany);
%LET SAMPLE_SIZE=0;
%LET MILES=0;
%DO %UNTIL (&SAMPLE_SIZE >= &howmany);
DATA CIP_DATA;
RETAIN DISTANCE /*USE*/ ;
SET Ps_final_cip4;
if cip4='52.02';
RUN;
%IF 0 < DISTANCE <= &MILES %THEN %DO;
%LET SAMPLE_SIZE=%EVAL(&SAMPLE_SIZE + 1);
*USE=1;
%END;
%LET MILES=%EVAL(&MILES + 5);
%END;
%mend FILTER_OBSERVATIONS;
I have a feeling that there is A LOT wrong with the above code, so please, let me know how I should go about doing this.
You should be able do it with a simple data step. If you want to make the target number of observations or granularity variable then replace the hard coded 100 and 5 to macro variable references.
data want ;
if _n_=1 then do;
if 100 < nobs then do;
set have (keep=distance rename=(distance=cutoff) firstobs=100);
put cutoff=;
cutoff = 5*ceil(cutoff/5);
end;
else cutoff = constant('big');
put cutoff=;
retain cutoff ;
drop cutoff ;
end;
set have nobs=nobs ;
by distance;
if distance <= cutoff;
run;
I'm curious - why not just sort the data by "distance" and output the first 100 observations?
Well, because the obs directly after 100 could be close enough to also be considered.
Say for example that obs 100 is 403.25 miles away, obs 101 is 403.26, obs 102 is 404.10 obs 103 is 406.07. We are find dropping obs 103 and higher, but would want to keep 101 and 102. Starting at 0 miles and incrementing by 5 miles until we get a number >= 100.
We want at least 100 observations, but want to stop once at least 100 is reached - in increments of 5 miles.
You should be able do it with a simple data step. If you want to make the target number of observations or granularity variable then replace the hard coded 100 and 5 to macro variable references.
data want ;
if _n_=1 then do;
if 100 < nobs then do;
set have (keep=distance rename=(distance=cutoff) firstobs=100);
put cutoff=;
cutoff = 5*ceil(cutoff/5);
end;
else cutoff = constant('big');
put cutoff=;
retain cutoff ;
drop cutoff ;
end;
set have nobs=nobs ;
by distance;
if distance <= cutoff;
run;
Tom,
Can you please explain what the constant('big') is doing?
Thank you for the answer, it appears to do exactly what I want.
Greg
Look up the CONSTANT() function in the help pages. Or read this blog entry. Constant function in SAS | Statistical Programming
If there are less than 100 observations in the source dataset the program uses the CONSTANT() function to set CUTOFF to the maximum possible number so that the IF statement does not eliminate any observations.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
