subset data for 45-70% of the observations

Solved
Occasional Contributor
Posts: 8

subset data for 45-70% of the observations

Hi wonderful SAS community,

I cant seem to figure out how to subset my dataset for the middle 45-70%. I have 107 variables and about 2000 rows (observations). I need to sort observations by variable X and select 45%-70% percentile of the dataset (that is only observations 900 to 1400). But i do not want to put observations 900 or 1400 as i may need to play with the percentile thus i need a way to figure out how to constrain dataset by %.

Any help will be apprecaited.

Thanks,

Tita

Accepted Solutions
Solution
‎03-21-2016 05:16 PM
Super User
Posts: 5,516

Re: subset data for 45-70% of the observations

Here's a program that may be "close enough".

data want;

set have nobs=_nobs_;

if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);

run;

It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count.  Some notes:

• This is simple logic.  So there may be slightly more complex versions that run a little faster.  Given the size of  your data set, that didn't seem like a priority.
• Handling ties can be an issue.  This program goes by the observation number.  There could be some observations that are included, and some that are excluded, even though they have the same value for your variable.
• It should be easy to edit the cutoffs since the code is so short.

Is this close enough?

All Replies
Super User
Posts: 19,867

Re: subset data for 45-70% of the observations

Use PROC RANK to calcuate the percentiles - group=100 and then you can filter from there?

Occasional Contributor
Posts: 8

Re: subset data for 45-70% of the observations

It looks like I cannot choose customized percent?

I need this (very rough)

Data Data1;

set Data0;

sort by Variable X;

keep observations between 0.45*Data0 and 0.7*Data0;

run;

I looked at the proc rank, but couldn't find how it can do this. Is it possible for you to elaborate a bit more?

Thanks so much

Super User
Posts: 19,867

Re: subset data for 45-70% of the observations

[ Edited ]

Please post sample data, if you'd like sample code

EDIT: or reference a SASHELP data set that can mimic your data.

SASHELP.CARS and CLASS are popular.

Occasional Contributor
Posts: 8

Re: subset data for 45-70% of the observations

Haha, good one.

Here is very very rough sample (original one is too complicated).

libname ECLIB000 "D:\Reverse mort";
proc sql;
create table ECLIB000.sample
(Home num,

avgclosing num,

variabley num,
variablex num,
variablez num);
insert into ECLIB000.sample
values(600000,4342,100,1200,1400)
values(500000,3980,4000,5000,1000)
values(400000,3551,1231,12312,111)
values(300000,3184,555,6666,8888)
values(200000,2801,1231,3453,23432)
values(700000,3980,4000,5000,1000)
values(200000,3551,1231,12312,111)
values(20000,3184,555,6666,8888)
values(100000,2801,1231,3453,23432);
title 'sample';
select *
from ECLIB000.sample;
proc printto; run;

proc sort ECLIB000.sample;
by variablex;
run;

*/here is when i need to say keep only the 40%-80% of the dataset (for all variables)*/;

Super User
Posts: 19,867

Re: subset data for 45-70% of the observations

You're using percent and percentile interchangeably - they're not. Are you looking for a percent of data or data within a specific percentile based on a single variable?

Occasional Contributor
Posts: 8

Re: subset data for 45-70% of the observations

Reeza, you are right. I am looking into percentiles (all sorted by variable x);

obs 1       (delete)

obs 2       (delete)

obs 3       (delete)

obs 4       (keep)

obs 5       (keep)

obs 6       (keep)

obs 7       (keep)

obs 8       (delete)

obs 9       (delete)

obs10      (delete)

new dataset;

obs 4

obs 5

obs 6

obs 7

And it can't be a random sample. I specifically want 40%-70% percentile (if i used the word correctly).

Thanks so much

Super User
Posts: 19,867

Re: subset data for 45-70% of the observations

Use this example - I don't have time to code something at the moment:

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473612.htm

Except use groups=100. Then you'll have your data into percentile groups and can choose the groups you need. If your data has a lot of ties this won't work well...and you should specify how to handle ties anyways.

Super User
Posts: 11,343

Re: subset data for 45-70% of the observations

I'm not sure why you need to sample what appears to be within  a specific range of values of your variable X but another option might be to use Proc Survey select. You can specify a sample rate of any specific percentage: Samprate= 0.45 or Samprate=45 and generate a random sample of the specified rate. Or use SAMPSIZE=n to specify a specific number of records.

Occasional Contributor
Posts: 8

Re: subset data for 45-70% of the observations

I tried but this only does random sampling. I am not sampling but constraining the dataset.
Solution
‎03-21-2016 05:16 PM
Super User
Posts: 5,516

Re: subset data for 45-70% of the observations

Here's a program that may be "close enough".

data want;

set have nobs=_nobs_;

if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);

run;

It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count.  Some notes:

• This is simple logic.  So there may be slightly more complex versions that run a little faster.  Given the size of  your data set, that didn't seem like a priority.
• Handling ties can be an issue.  This program goes by the observation number.  There could be some observations that are included, and some that are excluded, even though they have the same value for your variable.
• It should be easy to edit the cutoffs since the code is so short.

Is this close enough?

Occasional Contributor
Posts: 8

Re: subset data for 45-70% of the observations

Perfect! Thanks so much. This is exactly what i was looking for. You saved me many days of experimenting.
Posts: 1,022

Re: subset data for 45-70% of the observations

A minor tweak (simple, but improves efficiency), or  a major one (more complex but a lot more efficiency in the case of large data sets).

Minor (don't bother reading the last 30% of the file):

data want;

set have nobs=_nobs_;
if  _n_ > 0.45*_nobs_;

if _n_ > 0.70 * _nobs_ then stop;

run;

The above actually reads (and throws away) the first unwanted 45% of the file, but at least stops once the 70% obs has been reached.

Major tweak. But if your file is big, you might want to avoid reading the first 45%, just to throw it away.  You can use the data set parameter FIRSTOBS to tell sas to skip the first 45% (and the OBS parameter to tell it to stop at 70%).  To do that ask dictionary.tables for the number of observations, and use macro expressions to calculate FIRSTOBS and OBS:

proc sql;
select nobs into :nrecs from dictionary.tables where libname='WORK' and memname='HAVE';

quit;

%let fobs=%sysfunc(floor(%sysevalf(0.45*&nrecs)));
%let obs=%sysfunc(ceil(%sysevalf(0.7*&nrecs)));

data want;

set have (firstobs = &fobs   obs=&ob );

run;

regards,

Mark

☑ This topic is solved.