Hi wonderful SAS community,
I cant seem to figure out how to subset my dataset for the middle 45-70%. I have 107 variables and about 2000 rows (observations). I need to sort observations by variable X and select 45%-70% percentile of the dataset (that is only observations 900 to 1400). But i do not want to put observations 900 or 1400 as i may need to play with the percentile thus i need a way to figure out how to constrain dataset by %.
Any help will be apprecaited.
Thanks,
Tita
Here's a program that may be "close enough".
data want;
set have nobs=_nobs_;
if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);
run;
It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count. Some notes:
Is this close enough?
Use PROC RANK to calcuate the percentiles - group=100 and then you can filter from there?
It looks like I cannot choose customized percent?
I need this (very rough)
Data Data1;
set Data0;
sort by Variable X;
keep observations between 0.45*Data0 and 0.7*Data0;
run;
I looked at the proc rank, but couldn't find how it can do this. Is it possible for you to elaborate a bit more?
Thanks so much
Please post sample data, if you'd like sample code 🙂
EDIT: or reference a SASHELP data set that can mimic your data.
SASHELP.CARS and CLASS are popular.
Haha, good one.
Here is very very rough sample (original one is too complicated).
libname ECLIB000 "D:\Reverse mort";
proc sql;
create table ECLIB000.sample
(Home num,
avgclosing num,
variabley num,
variablex num,
variablez num);
insert into ECLIB000.sample
values(600000,4342,100,1200,1400)
values(500000,3980,4000,5000,1000)
values(400000,3551,1231,12312,111)
values(300000,3184,555,6666,8888)
values(200000,2801,1231,3453,23432)
values(700000,3980,4000,5000,1000)
values(200000,3551,1231,12312,111)
values(20000,3184,555,6666,8888)
values(100000,2801,1231,3453,23432);
title 'sample';
select *
from ECLIB000.sample;
proc printto; run;
proc sort ECLIB000.sample;
by variablex;
run;
*/here is when i need to say keep only the 40%-80% of the dataset (for all variables)*/;
You're using percent and percentile interchangeably - they're not. Are you looking for a percent of data or data within a specific percentile based on a single variable?
Reeza, you are right. I am looking into percentiles (all sorted by variable x);
obs 1 (delete)
obs 2 (delete)
obs 3 (delete)
obs 4 (keep)
obs 5 (keep)
obs 6 (keep)
obs 7 (keep)
obs 8 (delete)
obs 9 (delete)
obs10 (delete)
new dataset;
obs 4
obs 5
obs 6
obs 7
And it can't be a random sample. I specifically want 40%-70% percentile (if i used the word correctly).
Thanks so much
Use this example - I don't have time to code something at the moment:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473612.htm
Except use groups=100. Then you'll have your data into percentile groups and can choose the groups you need. If your data has a lot of ties this won't work well...and you should specify how to handle ties anyways.
I'm not sure why you need to sample what appears to be within a specific range of values of your variable X but another option might be to use Proc Survey select. You can specify a sample rate of any specific percentage: Samprate= 0.45 or Samprate=45 and generate a random sample of the specified rate. Or use SAMPSIZE=n to specify a specific number of records.
Here's a program that may be "close enough".
data want;
set have nobs=_nobs_;
if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);
run;
It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count. Some notes:
Is this close enough?
A minor tweak (simple, but improves efficiency), or a major one (more complex but a lot more efficiency in the case of large data sets).
Minor (don't bother reading the last 30% of the file):
data want;
set have nobs=_nobs_;
if _n_ > 0.45*_nobs_;
if _n_ > 0.70 * _nobs_ then stop;
run;
The above actually reads (and throws away) the first unwanted 45% of the file, but at least stops once the 70% obs has been reached.
Major tweak. But if your file is big, you might want to avoid reading the first 45%, just to throw it away. You can use the data set parameter FIRSTOBS to tell sas to skip the first 45% (and the OBS parameter to tell it to stop at 70%). To do that ask dictionary.tables for the number of observations, and use macro expressions to calculate FIRSTOBS and OBS:
proc sql;
select nobs into :nrecs from dictionary.tables where libname='WORK' and memname='HAVE';
quit;
%let fobs=%sysfunc(floor(%sysevalf(0.45*&nrecs)));
%let obs=%sysfunc(ceil(%sysevalf(0.7*&nrecs)));
data want;
set have (firstobs = &fobs obs=&ob );
run;
regards,
Mark
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.