Solved: Re: Select the top highest values in each observation and output to a ...

tonybesas · Posted 06-17-2022 09:55 AM

I'm working a large datataset of linguistic data for a research project, and I'd like to select the n highest values for each observation, output those, turn the other values to zeros and save the results to a new dataset.

My dataset has over 210,000 observations and 110 variables.

For example:

data have ;
   input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
   datalines;
1 10  9  8  7  6  5  4  3  2  1 
2  0  0 10  9  8  7  6  5  4  3
3  0  0  0  0  6  5  4  3  0  0
;
run;

If I were to select the 5 highest values for each observation in the 'have' dataset, the new dataset would look like:

1 10 9 8 7 6 0 0 0 0 0 
2 0 0 10 9 8 7 6 0 0 0
3 0 0 0 0 6 5 4 3 0 0

But if I were to select the 3 highest values for each observation, the new dataset would look like:

1 10 9 8 0 0 0 0 0 0 0 
2 0 0 10 9 8 0 0 0 0 0
3 0 0 0 0 6 5 4 0 0 0

Thank you all ahead for your help!

Reeza · Posted 06-17-2022 11:07 AM

data have ;
   input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
   datalines;
1 10  9  8  7  6  5  4  3  2  1 
2  0  0 10  9  8  7  6  5  4  3
3  0  0  0  0  6  5  4  3  0  0
;
run;

%let top_n = 5;

proc transpose data=have out=long;
by id;
var A1--D2;
run;


proc sort data=long;
by id descending col1;
run;

data long_marked;
set long;
by id;
if first.id then counter=1;
else counter+1;
if counter> &top_n then col1=0;
run;

proc sort data=long_marked;
by id _name_;
run;

proc transpose data=long_marked out=want;
by id;
id _name_;
var col1;
run;

In general, storing your data in a long format may make this easier then.

View solution in original post

Reeza · Posted 06-17-2022 10:10 AM

data want;
set have;

array have(*) A1--D2;
array top(5) top1-top5;

do i=1 to dim(top);
top(i) = largest(i, of have(*));
end;

do i=1 to dim(have);
if i<=dim(top) then have(i) = top(i);
else have(i) = 0;
end;

run;

tonybesas · Posted 06-17-2022 10:36 AM

Hi Reeza, thank you for your reply!

This is giving the following output:

Obs	id	A1	A2	A3	B1	B2	B3	C1	C2	D1	D2
1	1	10	9	8	7	6	0	0	0	0	0
2	2	10	9	8	7	6	0	0	0	0	0
3	3	6	5	4	3	0	0	0	0	0	0

Which is different from the desired output:

1 10  9  8  7  6  0  0  0  0  0 
2  0  0 10  9  8  7  6  0  0  0
3  0  0  0  0  6  5  4  3  0  0

Because the values of vars A1 and A2 in observation 2 should be 0, not 10 and 9, the values of vars A1 and A2 in observation 3 should be both 0, and so on. The top 5 values should be placed back in the same variables as they were in the 'have' dataset.

Reeza · Posted 06-17-2022 10:50 AM

Does your data have ties?

tonybesas · Posted 06-17-2022 10:55 AM

Yes, it does

Reeza · Posted 06-17-2022 11:07 AM

data have ;
   input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
   datalines;
1 10  9  8  7  6  5  4  3  2  1 
2  0  0 10  9  8  7  6  5  4  3
3  0  0  0  0  6  5  4  3  0  0
;
run;

%let top_n = 5;

proc transpose data=have out=long;
by id;
var A1--D2;
run;


proc sort data=long;
by id descending col1;
run;

data long_marked;
set long;
by id;
if first.id then counter=1;
else counter+1;
if counter> &top_n then col1=0;
run;

proc sort data=long_marked;
by id _name_;
run;

proc transpose data=long_marked out=want;
by id;
id _name_;
var col1;
run;

In general, storing your data in a long format may make this easier then.

tonybesas · Posted 06-17-2022 11:37 AM

Hi Reeza, this is perfect! I was testing it on my full dataset and it worked really fast and really well!! 👏 Thank you very very much!

PeterClemmensen · Posted 06-17-2022 02:54 PM

Here is a single pass solution

data have ;
   input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
   datalines;
1 10  9  8  7  6  5  4  3  2  1 
2  0  0 10  9  8  7  6  5  4  3
3  0  0  0  0  6  5  4  3  0  0
;
run;

data want(drop = rc i val pos);

   if _N_ = 1 then do;
      dcl hash h(ordered : 'D', multidata : 'Y');
      h.definekey('val');
      h.definedata('val', 'pos');
      h.definedone();
      dcl hiter hi('h');
   end;

   set have;

   array a{*} a1 -- d2;

   do i = 1 to dim(a);
      val = a[i];
      pos = i;
      h.add();
      a[i] = 0;
   end;

   do i = 1 by 1 while (hi.next() = 0);
      a[pos] = val;
      if i = 3 then leave;
	end;

   rc = hi.last();
   rc = hi.next();
   rc = h.clear();

run;

The DATA to DATA Step Macro
Blog: SASnrd

Select the top highest values in each observation and output to a new dataset

Re: Select the top highest values in each observation and output to a new dataset

Re: Select the top highest values in each observation and output to a new dataset

Re: Select the top highest values in each observation and output to a new dataset

Re: Select the top highest values in each observation and output to a new dataset

Re: Select the top highest values in each observation and output to a new dataset

Re: Select the top highest values in each observation and output to a new dataset

Re: Select the top highest values in each observation and output to a new dataset

Re: Select the top highest values in each observation and output to a new dataset

Register Today!

SAS Training: Just a Click Away