Hi wonderful SAS community,
I cant seem to figure out how to subset my dataset for the middle 45-70%. I have 107 variables and about 2000 rows (observations). I need to sort observations by variable X and select 45%-70% percentile of the dataset (that is only observations 900 to 1400). But i do not want to put observations 900 or 1400 as i may need to play with the percentile thus i need a way to figure out how to constrain dataset by %.
Any help will be apprecaited.
Thanks,
Tita
Here's a program that may be "close enough".
data want;
set have nobs=_nobs_;
if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);
run;
It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count. Some notes:
Is this close enough?
Use PROC RANK to calcuate the percentiles - group=100 and then you can filter from there?
It looks like I cannot choose customized percent?
I need this (very rough)
Data Data1;
set Data0;
sort by Variable X;
keep observations between 0.45*Data0 and 0.7*Data0;
run;
I looked at the proc rank, but couldn't find how it can do this. Is it possible for you to elaborate a bit more?
Thanks so much
Please post sample data, if you'd like sample code 🙂
EDIT: or reference a SASHELP data set that can mimic your data.
SASHELP.CARS and CLASS are popular.
Haha, good one.
Here is very very rough sample (original one is too complicated).
libname ECLIB000 "D:\Reverse mort";
proc sql;
create table ECLIB000.sample
(Home num,
avgclosing num,
variabley num,
variablex num,
variablez num);
insert into ECLIB000.sample
values(600000,4342,100,1200,1400)
values(500000,3980,4000,5000,1000)
values(400000,3551,1231,12312,111)
values(300000,3184,555,6666,8888)
values(200000,2801,1231,3453,23432)
values(700000,3980,4000,5000,1000)
values(200000,3551,1231,12312,111)
values(20000,3184,555,6666,8888)
values(100000,2801,1231,3453,23432);
title 'sample';
select *
from ECLIB000.sample;
proc printto; run;
proc sort ECLIB000.sample;
by variablex;
run;
*/here is when i need to say keep only the 40%-80% of the dataset (for all variables)*/;
You're using percent and percentile interchangeably - they're not. Are you looking for a percent of data or data within a specific percentile based on a single variable?
Reeza, you are right. I am looking into percentiles (all sorted by variable x);
obs 1 (delete)
obs 2 (delete)
obs 3 (delete)
obs 4 (keep)
obs 5 (keep)
obs 6 (keep)
obs 7 (keep)
obs 8 (delete)
obs 9 (delete)
obs10 (delete)
new dataset;
obs 4
obs 5
obs 6
obs 7
And it can't be a random sample. I specifically want 40%-70% percentile (if i used the word correctly).
Thanks so much
Use this example - I don't have time to code something at the moment:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473612.htm
Except use groups=100. Then you'll have your data into percentile groups and can choose the groups you need. If your data has a lot of ties this won't work well...and you should specify how to handle ties anyways.
I'm not sure why you need to sample what appears to be within a specific range of values of your variable X but another option might be to use Proc Survey select. You can specify a sample rate of any specific percentage: Samprate= 0.45 or Samprate=45 and generate a random sample of the specified rate. Or use SAMPSIZE=n to specify a specific number of records.
Here's a program that may be "close enough".
data want;
set have nobs=_nobs_;
if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);
run;
It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count. Some notes:
Is this close enough?
A minor tweak (simple, but improves efficiency), or a major one (more complex but a lot more efficiency in the case of large data sets).
Minor (don't bother reading the last 30% of the file):
data want;
set have nobs=_nobs_;
if _n_ > 0.45*_nobs_;
if _n_ > 0.70 * _nobs_ then stop;
run;
The above actually reads (and throws away) the first unwanted 45% of the file, but at least stops once the 70% obs has been reached.
Major tweak. But if your file is big, you might want to avoid reading the first 45%, just to throw it away. You can use the data set parameter FIRSTOBS to tell sas to skip the first 45% (and the OBS parameter to tell it to stop at 70%). To do that ask dictionary.tables for the number of observations, and use macro expressions to calculate FIRSTOBS and OBS:
proc sql;
select nobs into :nrecs from dictionary.tables where libname='WORK' and memname='HAVE';
quit;
%let fobs=%sysfunc(floor(%sysevalf(0.45*&nrecs)));
%let obs=%sysfunc(ceil(%sysevalf(0.7*&nrecs)));
data want;
set have (firstobs = &fobs obs=&ob );
run;
regards,
Mark
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.