BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Tita
Fluorite | Level 6

Hi wonderful SAS community,

 

I cant seem to figure out how to subset my dataset for the middle 45-70%. I have 107 variables and about 2000 rows (observations). I need to sort observations by variable X and select 45%-70% percentile of the dataset (that is only observations 900 to 1400). But i do not want to put observations 900 or 1400 as i may need to play with the percentile thus i need a way to figure out how to constrain dataset by %. 

 

Any help will be apprecaited. 

 

Thanks,

Tita

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Here's a program that may be "close enough".

 

data want;

set have nobs=_nobs_;

if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);

run;

 

It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count.  Some notes:

 

  • This is simple logic.  So there may be slightly more complex versions that run a little faster.  Given the size of  your data set, that didn't seem like a priority.
  • Handling ties can be an issue.  This program goes by the observation number.  There could be some observations that are included, and some that are excluded, even though they have the same value for your variable.
  • It should be easy to edit the cutoffs since the code is so short.

 

Is this close enough? 

View solution in original post

12 REPLIES 12
Reeza
Super User

Use PROC RANK to calcuate the percentiles - group=100 and then you can filter from there?

 

 

Tita
Fluorite | Level 6

It looks like I cannot choose customized percent? 

 

I need this (very rough)

Data Data1;

set Data0;

sort by Variable X;

keep observations between 0.45*Data0 and 0.7*Data0;

run;

 

I looked at the proc rank, but couldn't find how it can do this. Is it possible for you to elaborate a bit more? 

 

Thanks so much

Reeza
Super User

Please post sample data, if you'd like sample code 🙂

EDIT: or reference a SASHELP data set that can mimic your data.

SASHELP.CARS and CLASS are popular.

Tita
Fluorite | Level 6

Haha, good one. 

 

Here is very very rough sample (original one is too complicated). 

 

libname ECLIB000 "D:\Reverse mort";
  proc sql;
    create table ECLIB000.sample
     (Home num,

      avgclosing num,

       variabley num,
       variablex num,
       variablez num);
     insert into ECLIB000.sample
           values(600000,4342,100,1200,1400)
           values(500000,3980,4000,5000,1000)
           values(400000,3551,1231,12312,111)
           values(300000,3184,555,6666,8888)
           values(200000,2801,1231,3453,23432)
           values(700000,3980,4000,5000,1000)
           values(200000,3551,1231,12312,111)
           values(20000,3184,555,6666,8888)
           values(100000,2801,1231,3453,23432);
     title 'sample';
     select *
          from ECLIB000.sample;
   proc printto; run;

 

 

proc sort ECLIB000.sample;
by variablex;
run;

 

*/here is when i need to say keep only the 40%-80% of the dataset (for all variables)*/;

Reeza
Super User

You're using percent and percentile interchangeably - they're not. Are you looking for a percent of data or data within a specific percentile based on a single variable?

Tita
Fluorite | Level 6

Reeza, you are right. I am looking into percentiles (all sorted by variable x);

obs 1       (delete)

obs 2       (delete)

obs 3       (delete)

obs 4       (keep)

obs 5       (keep)

obs 6       (keep)

obs 7       (keep)

obs 8       (delete)

obs 9       (delete)

obs10      (delete)

 

new dataset;

 

 

obs 4

obs 5

obs 6

obs 7

 

And it can't be a random sample. I specifically want 40%-70% percentile (if i used the word correctly).

 

Thanks so much

Reeza
Super User

Use this example - I don't have time to code something at the moment:

 

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473612.htm

 

Except use groups=100. Then you'll have your data into percentile groups and can choose the groups you need. If your data has a lot of ties this won't work well...and you should specify how to handle ties anyways.

 

 

ballardw
Super User

I'm not sure why you need to sample what appears to be within  a specific range of values of your variable X but another option might be to use Proc Survey select. You can specify a sample rate of any specific percentage: Samprate= 0.45 or Samprate=45 and generate a random sample of the specified rate. Or use SAMPSIZE=n to specify a specific number of records.

Tita
Fluorite | Level 6
I tried but this only does random sampling. I am not sampling but constraining the dataset.
Astounding
PROC Star

Here's a program that may be "close enough".

 

data want;

set have nobs=_nobs_;

if (0.45 * _nobs_ <= _n_ <= 0.70 * _nobs_);

run;

 

It assumes your data set is sorted, and takes all observations falling in the right range, using the observation count.  Some notes:

 

  • This is simple logic.  So there may be slightly more complex versions that run a little faster.  Given the size of  your data set, that didn't seem like a priority.
  • Handling ties can be an issue.  This program goes by the observation number.  There could be some observations that are included, and some that are excluded, even though they have the same value for your variable.
  • It should be easy to edit the cutoffs since the code is so short.

 

Is this close enough? 

Tita
Fluorite | Level 6
Perfect! Thanks so much. This is exactly what i was looking for. You saved me many days of experimenting.
mkeintz
PROC Star

A minor tweak (simple, but improves efficiency), or  a major one (more complex but a lot more efficiency in the case of large data sets).

 

 

Minor (don't bother reading the last 30% of the file): 

 

data want;

    set have nobs=_nobs_;
    if  _n_ > 0.45*_nobs_;

    if _n_ > 0.70 * _nobs_ then stop;

run;

 

The above actually reads (and throws away) the first unwanted 45% of the file, but at least stops once the 70% obs has been reached.

 

Major tweak. But if your file is big, you might want to avoid reading the first 45%, just to throw it away.  You can use the data set parameter FIRSTOBS to tell sas to skip the first 45% (and the OBS parameter to tell it to stop at 70%).  To do that ask dictionary.tables for the number of observations, and use macro expressions to calculate FIRSTOBS and OBS:

 

proc sql;
  select nobs into :nrecs from dictionary.tables where libname='WORK' and memname='HAVE';

quit;

%let fobs=%sysfunc(floor(%sysevalf(0.45*&nrecs)));
%let obs=%sysfunc(ceil(%sysevalf(0.7*&nrecs)));

 

data want;

  set have (firstobs = &fobs   obs=&ob );

run;

 

regards,

Mark

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 3212 views
  • 5 likes
  • 5 in conversation