Solved: Editing rows for 80 different variables

sasbeginner1222 · Posted 04-12-2023 11:49 PM

Hi, I've got a dataset where each row is a unique identifer (ID), and 80 numerical variables containing percentages (a1 -- a80).

I want to be able to loop through each column, set the highest X of rows (numerically) as 1 and the rest as 0.

The issue is that X is different for each column, i.e for a1, x = 21%, a2, x = 50% ... a80, x = 14% etc

I was thinking of using temporary arrays to store the value for the Xs but not sure how to entirely approach this.

Thanks.

Input dataset

id	a1	a2	a3
1	0.1	0.95	0.54
2	0.15	0.86	0.55
3	0.02	0.91	0.50
4	0.13	0.87	0.49
5	0.12	0.88	0.48

Expected output (e.g turn the largest 3,2,4 (I used integers instead of percentage here but either is fine) rows for a1, a2, a3 respectively into "1" and "0" else)

id	a1	a2	a3
1	0	1	1
2	1	0	1
3	0	1	1
4	1	0	1
5	1	0	0

This is what I thought of but doing it for 80 different variables would be quite tedious.

proc sql inobs = 3; create table t_a1 as
select id, a1 from input
order by a1 DESC; quit;

proc sql inobs = 2; create table t_a2 as
select id, a2 from input
order by a2 DESC; quit;

proc sql inobs = 4; create table t_a3 as
select id, a3 from input
order by a3 DESC; quit;

proc sql; create table merged as
select * from t_a1 as a
left join t_a2 as b
on a.id = b.id;
left join t_a3 as c
on a.id = c.id;
quit;

ballardw · Posted 04-13-2023 01:09 PM

@sasbeginner1222 wrote:
Hi, thanks for the help I have added an example hoping it will make my question somewhat clearer.

You need to walk us through exactly how you "turn the largest 3,2,4 (I used integers instead of percentage here but either is fine) rows for a1, a2, a3 respectively into "1" and "0" else)".

Such as Largest 3 means what? WHICH specific values were determined to be the largest 3. Then which cells used that information and how to assign the 0 and 1.

Since we need to come up with something that works for 80 variables we really do need some nitty gritty detailed description.

This is what I think you want. Please note the data step to provide example data:

data have;
  input id	a1	a2	a3;
datalines;
1	0.1	0.95	0.54
2	0.15	0.86	0.55
3	0.02	0.91	0.50
4	0.13	0.87	0.49
5	0.12	0.88	0.4
;

proc rank data=have out=ranked descending;
   var a1 a2 a3;
   ranks ra1 ra2 ra3;
run;

data want;
   set ranked;
   array r(*) ra1-ra3;
   /* array L is for the LIMIT number, with the DESCENDING
      option in proc rank the ranks of the largest values are 1, 2, 3
      so if a limit of 2 is selected then 1 and 2 will be indicated
   */
   array l(3) _temporary_ (3,2,4);
   do i=1 to dim(r);
      r[i]= (r[i] le l[i]);
   end;
   drop i;
run;

Proc Rank provides a NEW variable with the rank position (there are some other options).

Then a data step to change those ranks into the indicator variable.

The Array L in the last data step must be defined to have as many elements as the ranked values array (R). Likely 80 in your description. Then you need a number in each position .

Note that you could merge a data set with a single observation that has all 80 limit values and place the names of those variables in the array L definition.

The code might look like:

data limits;
   input limit1 limit2 limit3;
datalines;
3 2 4
;

data want;
   merge ranked limits;
   array r(*) ra1-ra3;
   array l(3) _temporary_;
   array z(*) limit1-limit3;
   If _n_=1 then do i=1 to dim(l);
      l[i]=z[i];
   end;   
   do i=1 to dim(r);
      r[i]= (r[i] le l[i]);
   end;
   drop i limit:;
run;

The Merge will only have the Limit variable values on the first observation. The temporary array once loaded persists across iterations of the data step. So we can load the values from the first observation into the array so they are available through the execution of the data step.

Caution: Your example data did not provide any tied values for the variables. If you have such then you may need to get into how you want ties treated as if they occur in the group that you are calling "highest" you may have too many or too few actual values. There are several rules available for tie breaking and would apply to ALL the variables in a single call to Proc Rank. So if those get complex this problem gets more complex.

View solution in original post

andreas_lds · Posted 04-13-2023 02:26 AM

Posting data in usable form makes it so much easier for us to provide useful answers ...

sasbeginner1222 · Posted 04-13-2023 11:09 AM

Hi, thanks for the help, I have added an example hoping it will make my question somewhat clearer.

s_lassen · Posted 04-13-2023 03:54 AM

I guess something like this may work:

data want;
  set have;
  array values(*) a1-a80;
  array temp(80) 8 _temporary_;
  do _N_=1 to dim(values);
    temp(_N_)=values(_N_);
    end;
  call sortn(of temp(*));
  _n=n(of temp(*));
  if _n then do;
    /* find the first valid value */
    _first=dim(temp)-_n+1;
    /* find 20th percentile */
    _20pct=temp(_first+floor(_n/5));
    do _N_= 1 to dim(values);
      if values(_N_)>_20pct then
        values(_N_)=1;
      else
        values(_N_)=0;
    end;
  drop _:;
run;

ballardw · Posted 04-13-2023 10:54 AM

Suggestion: For example purposes reduce the size of your problem by providing data with the ID variable and 3 of the other variables and maybe 5 or 6 rows of data.

This should be small enough that you can work the result from the example data by hand and show what the expected result is.

I have a couple concerns because I am afraid that your definition of "loop through each column, sort it from highest percentage" could mean several different things and some are quite ugly and I am not sure what you actually want.

Similar with "set the highest X% of rows as 1 and the rest as 0". STATE what the x% is for your worked example. If you need to change the x% of rows it may help to provide stuff.

I suspect you may be looking for Proc Rank for one class of solutions but the other interpretations I can imagine are quite ugly.

sasbeginner1222 · Posted 04-13-2023 11:08 AM

Hi, thanks for the help I have added an example hoping it will make my question somewhat clearer.

ballardw · Posted 04-13-2023 01:09 PM

@sasbeginner1222 wrote:
Hi, thanks for the help I have added an example hoping it will make my question somewhat clearer.

You need to walk us through exactly how you "turn the largest 3,2,4 (I used integers instead of percentage here but either is fine) rows for a1, a2, a3 respectively into "1" and "0" else)".

Such as Largest 3 means what? WHICH specific values were determined to be the largest 3. Then which cells used that information and how to assign the 0 and 1.

Since we need to come up with something that works for 80 variables we really do need some nitty gritty detailed description.

This is what I think you want. Please note the data step to provide example data:

data have;
  input id	a1	a2	a3;
datalines;
1	0.1	0.95	0.54
2	0.15	0.86	0.55
3	0.02	0.91	0.50
4	0.13	0.87	0.49
5	0.12	0.88	0.4
;

proc rank data=have out=ranked descending;
   var a1 a2 a3;
   ranks ra1 ra2 ra3;
run;

data want;
   set ranked;
   array r(*) ra1-ra3;
   /* array L is for the LIMIT number, with the DESCENDING
      option in proc rank the ranks of the largest values are 1, 2, 3
      so if a limit of 2 is selected then 1 and 2 will be indicated
   */
   array l(3) _temporary_ (3,2,4);
   do i=1 to dim(r);
      r[i]= (r[i] le l[i]);
   end;
   drop i;
run;

Proc Rank provides a NEW variable with the rank position (there are some other options).

Then a data step to change those ranks into the indicator variable.

The Array L in the last data step must be defined to have as many elements as the ranked values array (R). Likely 80 in your description. Then you need a number in each position .

Note that you could merge a data set with a single observation that has all 80 limit values and place the names of those variables in the array L definition.

The code might look like:

data limits;
   input limit1 limit2 limit3;
datalines;
3 2 4
;

data want;
   merge ranked limits;
   array r(*) ra1-ra3;
   array l(3) _temporary_;
   array z(*) limit1-limit3;
   If _n_=1 then do i=1 to dim(l);
      l[i]=z[i];
   end;   
   do i=1 to dim(r);
      r[i]= (r[i] le l[i]);
   end;
   drop i limit:;
run;

The Merge will only have the Limit variable values on the first observation. The temporary array once loaded persists across iterations of the data step. So we can load the values from the first observation into the array so they are available through the execution of the data step.

Caution: Your example data did not provide any tied values for the variables. If you have such then you may need to get into how you want ties treated as if they occur in the group that you are calling "highest" you may have too many or too few actual values. There are several rules available for tie breaking and would apply to ALL the variables in a single call to Proc Rank. So if those get complex this problem gets more complex.

Tom · Posted 04-13-2023 01:47 PM

I cannot tell exactly what you are asking. If you just want to make binary flags that indicate if the value is larger than some cutoff that is simple enough.

data want;
  set have;
  array x a1-a3;
  array cutoff [3] _temporary_ (0.104 0.894 0.490);
  do index=1 to dim(x);
    x[index] = cutoff[index] <= x[index] ;
  end;
  drop index;
run;

Editing rows for 80 different variables

Re: Editing rows for 80 different variables

Re: Editing rows for 80 different variables

Re: Editing rows for 80 different variables

Re: Editing rows for 80 different variables

Re: Editing rows for 80 different variables

Re: Editing rows for 80 different variables

Re: Editing rows for 80 different variables

Re: Editing rows for 80 different variables

Register Today!

SAS Training: Just a Click Away