Solved: Re: subset data for 45-70% of the observations

Tita · Posted 03-21-2016 04:25 PM

Hi wonderful SAS community,

I cant seem to figure out how to subset my dataset for the middle 45-70%. I have 107 variables and about 2000 rows (observations). I need to sort observations by variable X and select 45%-70% percentile of the dataset (that is only observations 900 to 1400). But i do not want to put observations 900 or 1400 as i may need to play with the percentile thus i need a way to figure out how to constrain dataset by %.

Any help will be apprecaited.

Thanks,

Tita

Astounding · Posted 03-21-2016 05:10 PM

Here's a program that may be "close enough".

data want;

set have nobs=_nobs_;

if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);

run;

It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count. Some notes:

This is simple logic. So there may be slightly more complex versions that run a little faster. Given the size of your data set, that didn't seem like a priority.
Handling ties can be an issue. This program goes by the observation number. There could be some observations that are included, and some that are excluded, even though they have the same value for your variable.
It should be easy to edit the cutoffs since the code is so short.

Is this close enough?

View solution in original post

Reeza · Posted 03-21-2016 04:31 PM

Use PROC RANK to calcuate the percentiles - group=100 and then you can filter from there?

Tita · Posted 03-21-2016 04:38 PM

It looks like I cannot choose customized percent?

I need this (very rough)

Data Data1;

set Data0;

sort by Variable X;

keep observations between 0.45*Data0 and 0.7*Data0;

run;

I looked at the proc rank, but couldn't find how it can do this. Is it possible for you to elaborate a bit more?

Thanks so much

Reeza · Posted 03-21-2016 04:41 PM

Please post sample data, if you'd like sample code 🙂

EDIT: or reference a SASHELP data set that can mimic your data.

SASHELP.CARS and CLASS are popular.

Tita · Posted 03-21-2016 04:55 PM

Haha, good one.

Here is very very rough sample (original one is too complicated).

libname ECLIB000 "D:\Reverse mort";
proc sql;
create table ECLIB000.sample
(Home num,

avgclosing num,

variabley num,
variablex num,
variablez num);
insert into ECLIB000.sample
values(600000,4342,100,1200,1400)
values(500000,3980,4000,5000,1000)
values(400000,3551,1231,12312,111)
values(300000,3184,555,6666,8888)
values(200000,2801,1231,3453,23432)
values(700000,3980,4000,5000,1000)
values(200000,3551,1231,12312,111)
values(20000,3184,555,6666,8888)
values(100000,2801,1231,3453,23432);
title 'sample';
select *
from ECLIB000.sample;
proc printto; run;

proc sort ECLIB000.sample;
by variablex;
run;

*/here is when i need to say keep only the 40%-80% of the dataset (for all variables)*/;

Reeza · Posted 03-21-2016 04:57 PM

You're using percent and percentile interchangeably - they're not. Are you looking for a percent of data or data within a specific percentile based on a single variable?

Tita · Posted 03-21-2016 05:05 PM

Reeza, you are right. I am looking into percentiles (all sorted by variable x);

obs 1 (delete)

obs 2 (delete)

obs 3 (delete)

obs 4 (keep)

obs 5 (keep)

obs 6 (keep)

obs 7 (keep)

obs 8 (delete)

obs 9 (delete)

obs10 (delete)

new dataset;

obs 4

obs 5

obs 6

obs 7

And it can't be a random sample. I specifically want 40%-70% percentile (if i used the word correctly).

Thanks so much

Reeza · Posted 03-21-2016 05:13 PM

Use this example - I don't have time to code something at the moment:

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473612.htm

Except use groups=100. Then you'll have your data into percentile groups and can choose the groups you need. If your data has a lot of ties this won't work well...and you should specify how to handle ties anyways.

ballardw · Posted 03-21-2016 04:52 PM

I'm not sure why you need to sample what appears to be within a specific range of values of your variable X but another option might be to use Proc Survey select. You can specify a sample rate of any specific percentage: Samprate= 0.45 or Samprate=45 and generate a random sample of the specified rate. Or use SAMPSIZE=n to specify a specific number of records.

Tita · Posted 03-21-2016 05:06 PM

I tried but this only does random sampling. I am not sampling but constraining the dataset.

Astounding · Posted 03-21-2016 05:10 PM

Here's a program that may be "close enough".

data want;

set have nobs=_nobs_;

if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);

run;

It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count. Some notes:

This is simple logic. So there may be slightly more complex versions that run a little faster. Given the size of your data set, that didn't seem like a priority.
Handling ties can be an issue. This program goes by the observation number. There could be some observations that are included, and some that are excluded, even though they have the same value for your variable.
It should be easy to edit the cutoffs since the code is so short.

Is this close enough?

Tita · Posted 03-21-2016 05:17 PM

Perfect! Thanks so much. This is exactly what i was looking for. You saved me many days of experimenting.

mkeintz · Posted 03-28-2016 04:41 PM

A minor tweak (simple, but improves efficiency), or a major one (more complex but a lot more efficiency in the case of large data sets).

Minor (don't bother reading the last 30% of the file):

data want;

set have nobs=_nobs_;
if _n_ > 0.45*_nobs_;

if _n_ > 0.70 * _nobs_ then stop;

run;

The above actually reads (and throws away) the first unwanted 45% of the file, but at least stops once the 70% obs has been reached.

Major tweak. But if your file is big, you might want to avoid reading the first 45%, just to throw it away. You can use the data set parameter FIRSTOBS to tell sas to skip the first 45% (and the OBS parameter to tell it to stop at 70%). To do that ask dictionary.tables for the number of observations, and use macro expressions to calculate FIRSTOBS and OBS:

proc sql;
select nobs into :nrecs from dictionary.tables where libname='WORK' and memname='HAVE';

quit;

%let fobs=%sysfunc(floor(%sysevalf(0.45*&nrecs)));
%let obs=%sysfunc(ceil(%sysevalf(0.7*&nrecs)));

data want;

set have (firstobs = &fobs obs=&ob );

run;

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ready to join fellow brilliant minds for the SAS Hackathon?