- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi wonderful SAS community,
I cant seem to figure out how to subset my dataset for the middle 45-70%. I have 107 variables and about 2000 rows (observations). I need to sort observations by variable X and select 45%-70% percentile of the dataset (that is only observations 900 to 1400). But i do not want to put observations 900 or 1400 as i may need to play with the percentile thus i need a way to figure out how to constrain dataset by %.
Any help will be apprecaited.
Thanks,
Tita
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's a program that may be "close enough".
data want;
set have nobs=_nobs_;
if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);
run;
It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count. Some notes:
- This is simple logic. So there may be slightly more complex versions that run a little faster. Given the size of your data set, that didn't seem like a priority.
- Handling ties can be an issue. This program goes by the observation number. There could be some observations that are included, and some that are excluded, even though they have the same value for your variable.
- It should be easy to edit the cutoffs since the code is so short.
Is this close enough?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use PROC RANK to calcuate the percentiles - group=100 and then you can filter from there?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It looks like I cannot choose customized percent?
I need this (very rough)
Data Data1;
set Data0;
sort by Variable X;
keep observations between 0.45*Data0 and 0.7*Data0;
run;
I looked at the proc rank, but couldn't find how it can do this. Is it possible for you to elaborate a bit more?
Thanks so much
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please post sample data, if you'd like sample code 🙂
EDIT: or reference a SASHELP data set that can mimic your data.
SASHELP.CARS and CLASS are popular.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Haha, good one.
Here is very very rough sample (original one is too complicated).
libname ECLIB000 "D:\Reverse mort";
proc sql;
create table ECLIB000.sample
(Home num,
avgclosing num,
variabley num,
variablex num,
variablez num);
insert into ECLIB000.sample
values(600000,4342,100,1200,1400)
values(500000,3980,4000,5000,1000)
values(400000,3551,1231,12312,111)
values(300000,3184,555,6666,8888)
values(200000,2801,1231,3453,23432)
values(700000,3980,4000,5000,1000)
values(200000,3551,1231,12312,111)
values(20000,3184,555,6666,8888)
values(100000,2801,1231,3453,23432);
title 'sample';
select *
from ECLIB000.sample;
proc printto; run;
proc sort ECLIB000.sample;
by variablex;
run;
*/here is when i need to say keep only the 40%-80% of the dataset (for all variables)*/;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You're using percent and percentile interchangeably - they're not. Are you looking for a percent of data or data within a specific percentile based on a single variable?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Reeza, you are right. I am looking into percentiles (all sorted by variable x);
obs 1 (delete)
obs 2 (delete)
obs 3 (delete)
obs 4 (keep)
obs 5 (keep)
obs 6 (keep)
obs 7 (keep)
obs 8 (delete)
obs 9 (delete)
obs10 (delete)
new dataset;
obs 4
obs 5
obs 6
obs 7
And it can't be a random sample. I specifically want 40%-70% percentile (if i used the word correctly).
Thanks so much
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use this example - I don't have time to code something at the moment:
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473612.htm
Except use groups=100. Then you'll have your data into percentile groups and can choose the groups you need. If your data has a lot of ties this won't work well...and you should specify how to handle ties anyways.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure why you need to sample what appears to be within a specific range of values of your variable X but another option might be to use Proc Survey select. You can specify a sample rate of any specific percentage: Samprate= 0.45 or Samprate=45 and generate a random sample of the specified rate. Or use SAMPSIZE=n to specify a specific number of records.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's a program that may be "close enough".
data want;
set have nobs=_nobs_;
if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);
run;
It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count. Some notes:
- This is simple logic. So there may be slightly more complex versions that run a little faster. Given the size of your data set, that didn't seem like a priority.
- Handling ties can be an issue. This program goes by the observation number. There could be some observations that are included, and some that are excluded, even though they have the same value for your variable.
- It should be easy to edit the cutoffs since the code is so short.
Is this close enough?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A minor tweak (simple, but improves efficiency), or a major one (more complex but a lot more efficiency in the case of large data sets).
Minor (don't bother reading the last 30% of the file):
data want;
set have nobs=_nobs_;
if _n_ > 0.45*_nobs_;
if _n_ > 0.70 * _nobs_ then stop;
run;
The above actually reads (and throws away) the first unwanted 45% of the file, but at least stops once the 70% obs has been reached.
Major tweak. But if your file is big, you might want to avoid reading the first 45%, just to throw it away. You can use the data set parameter FIRSTOBS to tell sas to skip the first 45% (and the OBS parameter to tell it to stop at 70%). To do that ask dictionary.tables for the number of observations, and use macro expressions to calculate FIRSTOBS and OBS:
proc sql;
select nobs into :nrecs from dictionary.tables where libname='WORK' and memname='HAVE';
quit;
%let fobs=%sysfunc(floor(%sysevalf(0.45*&nrecs)));
%let obs=%sysfunc(ceil(%sysevalf(0.7*&nrecs)));
data want;
set have (firstobs = &fobs obs=&ob );
run;
regards,
Mark
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------