I'm working a large datataset of linguistic data for a research project, and I'd like to select the n highest values for each observation, output those, turn the other values to zeros and save the results to a new dataset.
My dataset has over 210,000 observations and 110 variables.
For example:
data have ; input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ; datalines; 1 10 9 8 7 6 5 4 3 2 1 2 0 0 10 9 8 7 6 5 4 3 3 0 0 0 0 6 5 4 3 0 0 ; run;
If I were to select the 5 highest values for each observation in the 'have' dataset, the new dataset would look like:
1 10 9 8 7 6 0 0 0 0 0 2 0 0 10 9 8 7 6 0 0 0 3 0 0 0 0 6 5 4 3 0 0
But if I were to select the 3 highest values for each observation, the new dataset would look like:
1 10 9 8 0 0 0 0 0 0 0 2 0 0 10 9 8 0 0 0 0 0 3 0 0 0 0 6 5 4 0 0 0
Thank you all ahead for your help!
data have ;
input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
datalines;
1 10 9 8 7 6 5 4 3 2 1
2 0 0 10 9 8 7 6 5 4 3
3 0 0 0 0 6 5 4 3 0 0
;
run;
%let top_n = 5;
proc transpose data=have out=long;
by id;
var A1--D2;
run;
proc sort data=long;
by id descending col1;
run;
data long_marked;
set long;
by id;
if first.id then counter=1;
else counter+1;
if counter> &top_n then col1=0;
run;
proc sort data=long_marked;
by id _name_;
run;
proc transpose data=long_marked out=want;
by id;
id _name_;
var col1;
run;
In general, storing your data in a long format may make this easier then.
data want;
set have;
array have(*) A1--D2;
array top(5) top1-top5;
do i=1 to dim(top);
top(i) = largest(i, of have(*));
end;
do i=1 to dim(have);
if i<=dim(top) then have(i) = top(i);
else have(i) = 0;
end;
run;
Hi Reeza, thank you for your reply!
This is giving the following output:
Obs id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 1 1 10 9 8 7 6 0 0 0 0 0 2 2 10 9 8 7 6 0 0 0 0 0 3 3 6 5 4 3 0 0 0 0 0 0
Which is different from the desired output:
1 10 9 8 7 6 0 0 0 0 0 2 0 0 10 9 8 7 6 0 0 0 3 0 0 0 0 6 5 4 3 0 0
Because the values of vars A1 and A2 in observation 2 should be 0, not 10 and 9, the values of vars A1 and A2 in observation 3 should be both 0, and so on. The top 5 values should be placed back in the same variables as they were in the 'have' dataset.
data have ;
input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
datalines;
1 10 9 8 7 6 5 4 3 2 1
2 0 0 10 9 8 7 6 5 4 3
3 0 0 0 0 6 5 4 3 0 0
;
run;
%let top_n = 5;
proc transpose data=have out=long;
by id;
var A1--D2;
run;
proc sort data=long;
by id descending col1;
run;
data long_marked;
set long;
by id;
if first.id then counter=1;
else counter+1;
if counter> &top_n then col1=0;
run;
proc sort data=long_marked;
by id _name_;
run;
proc transpose data=long_marked out=want;
by id;
id _name_;
var col1;
run;
In general, storing your data in a long format may make this easier then.
Hi Reeza, this is perfect! I was testing it on my full dataset and it worked really fast and really well!! 👏 Thank you very very much!
Here is a single pass solution
data have ;
input id A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 ;
datalines;
1 10 9 8 7 6 5 4 3 2 1
2 0 0 10 9 8 7 6 5 4 3
3 0 0 0 0 6 5 4 3 0 0
;
run;
data want(drop = rc i val pos);
if _N_ = 1 then do;
dcl hash h(ordered : 'D', multidata : 'Y');
h.definekey('val');
h.definedata('val', 'pos');
h.definedone();
dcl hiter hi('h');
end;
set have;
array a{*} a1 -- d2;
do i = 1 to dim(a);
val = a[i];
pos = i;
h.add();
a[i] = 0;
end;
do i = 1 by 1 while (hi.next() = 0);
a[pos] = val;
if i = 3 then leave;
end;
rc = hi.last();
rc = hi.next();
rc = h.clear();
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.