BookmarkSubscribeRSS Feed
deleted_user
Not applicable
Hi Everyone,

Is there a way to select on sorted data by type using a specific distribution? this my sound confusing b/c i dont know the exact terminology but here is my problem below:

I have 10k records and I need to select the top 1k records (based on balance). However, each record also has a Type, and to be fair i want to select the same percentage of records from each type. For example, in my data 50% of the records are Type A, 30% are type B, and 20% are type C. So I need to select the top 500 records of type A, top 300 records of type B and top 200 records of type C. All together this equals 1000. In reality my dataset has 6 types, my total dataset is 16M records and i have to select approx. 4.2M from this list.

i know i can calculate these percentages by hand, then write custom select statements in SAS, but I was thinking that this is a common enough problem that there must be a more automated way of doing this. Does anyone have any suggestions?


Dataset:
Id Type Bal
1 A 1000
2 C 90
3 A 1
4 B 203
5 B 980
6 A 89
... ... ...


Thanks in Advance!
6 REPLIES 6
Ksharp
Super User


data temp;



 do type='A','B','C' ;



  do balance=1 to
5000;



   id+1;



   output;



  end;



 end;



run;



 



%let nobs=1000;*the number of top obs you want;



%let p_A=.5;*the percent of A in the top obs;



%let p_B=.3;



%let p_C=.2;



 



%let obs_A=%sysevalf(&nobs*&p_A,integer);



%let obs_B=%sysevalf(&nobs*&p_B,integer);



%let obs_C=%sysevalf(&nobs*&p_C,integer);



%put _user_;



 



 



proc sort data=temp nodupkey;



 by type descending balance;



run;



data want;



 set temp;



 by type;



 if first.type then count=0;



 count+1;



 select;



  when(type='A' and count le &obs_A ) output;



  when(type='B' and count le &obs_B ) output;



  when(type='C' and count le &obs_C ) output;



  otherwise;



 end;



run;






[pre]



[/pre]



Ksharp
Peter_C
Rhodochrosite | Level 12
is this selection process what procedute SURVEYSELECT is for?
(will need SAS/STAT licensed)
deleted_user
Not applicable
Ksharp - although that code will probably work, i was looking for an approach that will calculate the percentages for me.

Peter C - I believe surveyselect only picks random samples from a dataset. It looks like the data can be stratefied (which would take care of the Types) but i want top records not random ones. Also, i need the percentage of each type to be calculated to match the distribution.
art297
Opal | Level 21
Elena,

Couldn't you just precede Ksharp's code with a proc freq and data _null_ step to put the values into the macro variables? For example:

proc freq data=have noprint;
tables type/out=types;
run;
data _null_;
set types;
call symput('p_'||type,percent);
run;

HTH,
Art
-----------
> Ksharp - although that code will probably work, i was
> looking for an approach that will calculate the
> percentages for me.
>
> Peter C - I believe surveyselect only picks random
> samples from a dataset. It looks like the data can be
> stratefied (which would take care of the Types) but i
> want top records not random ones. Also, i need the
> percentage of each type to be calculated to match the
> distribution.
Ksharp
Super User
It is my fault. I do not understand totally.
Under Art.T's suggestion.


[pre]



[/pre]


data temp;



 do type='A','B','C' ;



  do balance=1 to 5000;



   id+1;



   output;



  end;



 end;



run;



 



 



%macro top(obs=1000,dsn=temp);



proc freq
data=&dsn;



 tables type /
out=percent;



run;



data _null_;



 set percent
end=last;



 sum+count;



 percent=percent*.01;



 call symput(cats('type',_n_),type);



 call symput(cats('per',_n_),percent);



 if last
then  call symput(
'count',_n_);



run;



 



%do i=1 %to &count;



  %let obs&i=%sysevalf(&obs*&&per&i,integer);



%end;



%put _user_;



 



proc sort
data=&dsn nodupkey;



 by type
descending balance;



run;



 



data want(drop=
count);



 set &dsn;



 by type;



 if first.type then count=0;



 count+1;



 select;



  %do
j=
1 %to &count;



    when(type="&&type&j" and count le &&obs&j ) output;



  %end;



  otherwise;



 end;



 



run;



%mend;



 



%top(obs=1000,dsn=temp)



 



 



 



 



 



Ksharp



Ksharp
Super User
elena- You do not tell us how to calculate these percent ,it base on freq for each A B C?
Or pre - probability.
I think these percent is setted by yourself.


Ksharp

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1301 views
  • 0 likes
  • 4 in conversation