DATA Step, Macro, Functions and more

How to select top records using a weighted approach

Reply
N/A
Posts: 0

How to select top records using a weighted approach

Hi Everyone,

Is there a way to select on sorted data by type using a specific distribution? this my sound confusing b/c i dont know the exact terminology but here is my problem below:

I have 10k records and I need to select the top 1k records (based on balance). However, each record also has a Type, and to be fair i want to select the same percentage of records from each type. For example, in my data 50% of the records are Type A, 30% are type B, and 20% are type C. So I need to select the top 500 records of type A, top 300 records of type B and top 200 records of type C. All together this equals 1000. In reality my dataset has 6 types, my total dataset is 16M records and i have to select approx. 4.2M from this list.

i know i can calculate these percentages by hand, then write custom select statements in SAS, but I was thinking that this is a common enough problem that there must be a more automated way of doing this. Does anyone have any suggestions?


Dataset:
Id Type Bal
1 A 1000
2 C 90
3 A 1
4 B 203
5 B 980
6 A 89
... ... ...


Thanks in Advance!
Super User
Posts: 10,023

Re: How to select top records using a weighted approach

Posted in reply to deleted_user


data temp;



 do type='A','B','C' ;



  do balance=1 to
5000;



   id+1;



   output;



  end;



 end;



run;



 



%let nobs=1000;*the number of top obs you want;



%let p_A=.5;*the percent of A in the top obs;



%let p_B=.3;



%let p_C=.2;



 



%let obs_A=%sysevalf(&nobs*&p_A,integer);



%let obs_B=%sysevalf(&nobs*&p_B,integer);



%let obs_C=%sysevalf(&nobs*&p_C,integer);



%put _user_;



 



 



proc sort data=temp nodupkey;



 by type descending balance;



run;



data want;



 set temp;



 by type;



 if first.type then count=0;



 count+1;



 select;



  when(type='A' and count le &obs_A ) output;



  when(type='B' and count le &obs_B ) output;



  when(type='C' and count le &obs_C ) output;



  otherwise;



 end;



run;






[pre]



[/pre]



Ksharp
Valued Guide
Posts: 2,177

Re: How to select top records using a weighted approach

Posted in reply to deleted_user
is this selection process what procedute SURVEYSELECT is for?
(will need SAS/STAT licensed)
N/A
Posts: 0

Re: How to select top records using a weighted approach

Ksharp - although that code will probably work, i was looking for an approach that will calculate the percentages for me.

Peter C - I believe surveyselect only picks random samples from a dataset. It looks like the data can be stratefied (which would take care of the Types) but i want top records not random ones. Also, i need the percentage of each type to be calculated to match the distribution.
PROC Star
Posts: 7,468

Re: How to select top records using a weighted approach

Posted in reply to deleted_user
Elena,

Couldn't you just precede Ksharp's code with a proc freq and data _null_ step to put the values into the macro variables? For example:

proc freq data=have noprint;
tables type/out=types;
run;
data _null_;
set types;
call symput('p_'||type,percent);
run;

HTH,
Art
-----------
> Ksharp - although that code will probably work, i was
> looking for an approach that will calculate the
> percentages for me.
>
> Peter C - I believe surveyselect only picks random
> samples from a dataset. It looks like the data can be
> stratefied (which would take care of the Types) but i
> want top records not random ones. Also, i need the
> percentage of each type to be calculated to match the
> distribution.
Super User
Posts: 10,023

Re: How to select top records using a weighted approach

It is my fault. I do not understand totally.
Under Art.T's suggestion.


[pre]



[/pre]


data temp;



 do type='A','B','C' ;



  do balance=1 to 5000;



   id+1;



   output;



  end;



 end;



run;



 



 



%macro top(obs=1000,dsn=temp);



proc freq
data=&dsn;



 tables type /
out=percent;



run;



data _null_;



 set percent
end=last;



 sum+count;



 percent=percent*.01;



 call symput(cats('type',_n_),type);



 call symput(cats('per',_n_),percent);



 if last
then  call symput(
'count',_n_);



run;



 



%do i=1 %to &count;



  %let obs&i=%sysevalf(&obs*&&per&i,integer);



%end;



%put _user_;



 



proc sort
data=&dsn nodupkey;



 by type
descending balance;



run;



 



data want(drop=
count);



 set &dsn;



 by type;



 if first.type then count=0;



 count+1;



 select;



  %do
j=
1 %to &count;



    when(type="&&type&j" and count le &&obs&j ) output;



  %end;



  otherwise;



 end;



 



run;



%mend;



 



%top(obs=1000,dsn=temp)



 



 



 



 



 



Ksharp



Super User
Posts: 10,023

Re: How to select top records using a weighted approach

Posted in reply to deleted_user
elena- You do not tell us how to calculate these percent ,it base on freq for each A B C?
Or pre - probability.
I think these percent is setted by yourself.


Ksharp
Ask a Question
Discussion stats
  • 6 replies
  • 564 views
  • 0 likes
  • 4 in conversation