I'm working a large datataset of linguistic data for a research project, and I'd like to select the n highest values for each observation, output those, turn the other values to zeros and save the results to a new dataset.
My dataset has over 210,000 observations and 110 variables.
For example:
data have ; input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ; datalines; 1 10 9 8 7 6 5 4 3 2 1 2 0 0 10 9 8 7 6 5 4 3 3 0 0 0 0 6 5 4 3 0 0 ; run;
If I were to select the 5 highest values for each observation in the 'have' dataset, the new dataset would look like:
1 10 9 8 7 6 0 0 0 0 0 2 0 0 10 9 8 7 6 0 0 0 3 0 0 0 0 6 5 4 3 0 0
But if I were to select the 3 highest values for each observation, the new dataset would look like:
1 10 9 8 0 0 0 0 0 0 0 2 0 0 10 9 8 0 0 0 0 0 3 0 0 0 0 6 5 4 0 0 0
Thank you all ahead for your help!
data have ;
input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
datalines;
1 10 9 8 7 6 5 4 3 2 1
2 0 0 10 9 8 7 6 5 4 3
3 0 0 0 0 6 5 4 3 0 0
;
run;
%let top_n = 5;
proc transpose data=have out=long;
by id;
var A1--D2;
run;
proc sort data=long;
by id descending col1;
run;
data long_marked;
set long;
by id;
if first.id then counter=1;
else counter+1;
if counter> &top_n then col1=0;
run;
proc sort data=long_marked;
by id _name_;
run;
proc transpose data=long_marked out=want;
by id;
id _name_;
var col1;
run;
In general, storing your data in a long format may make this easier then.
data want;
set have;
array have(*) A1--D2;
array top(5) top1-top5;
do i=1 to dim(top);
top(i) = largest(i, of have(*));
end;
do i=1 to dim(have);
if i<=dim(top) then have(i) = top(i);
else have(i) = 0;
end;
run;
Hi Reeza, thank you for your reply!
This is giving the following output:
Obs id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 1 1 10 9 8 7 6 0 0 0 0 0 2 2 10 9 8 7 6 0 0 0 0 0 3 3 6 5 4 3 0 0 0 0 0 0
Which is different from the desired output:
1 10 9 8 7 6 0 0 0 0 0 2 0 0 10 9 8 7 6 0 0 0 3 0 0 0 0 6 5 4 3 0 0
Because the values of vars A1 and A2 in observation 2 should be 0, not 10 and 9, the values of vars A1 and A2 in observation 3 should be both 0, and so on. The top 5 values should be placed back in the same variables as they were in the 'have' dataset.
data have ;
input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
datalines;
1 10 9 8 7 6 5 4 3 2 1
2 0 0 10 9 8 7 6 5 4 3
3 0 0 0 0 6 5 4 3 0 0
;
run;
%let top_n = 5;
proc transpose data=have out=long;
by id;
var A1--D2;
run;
proc sort data=long;
by id descending col1;
run;
data long_marked;
set long;
by id;
if first.id then counter=1;
else counter+1;
if counter> &top_n then col1=0;
run;
proc sort data=long_marked;
by id _name_;
run;
proc transpose data=long_marked out=want;
by id;
id _name_;
var col1;
run;
In general, storing your data in a long format may make this easier then.
Hi Reeza, this is perfect! I was testing it on my full dataset and it worked really fast and really well!! 👏 Thank you very very much!
Here is a single pass solution
data have ;
input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
datalines;
1 10 9 8 7 6 5 4 3 2 1
2 0 0 10 9 8 7 6 5 4 3
3 0 0 0 0 6 5 4 3 0 0
;
run;
data want(drop = rc i val pos);
if _N_ = 1 then do;
dcl hash h(ordered : 'D', multidata : 'Y');
h.definekey('val');
h.definedata('val', 'pos');
h.definedone();
dcl hiter hi('h');
end;
set have;
array a{*} a1 -- d2;
do i = 1 to dim(a);
val = a[i];
pos = i;
h.add();
a[i] = 0;
end;
do i = 1 by 1 while (hi.next() = 0);
a[pos] = val;
if i = 3 then leave;
end;
rc = hi.last();
rc = hi.next();
rc = h.clear();
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.