Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- Editing rows for 80 different variables

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

☑ This topic is **solved**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 04-12-2023 11:49 PM
(231 views)

Hi, I've got a dataset where each row is a unique identifer (ID), and 80 numerical variables containing percentages (a1 -- a80).

I want to be able to loop through each column, set the highest X of rows (numerically) as 1 and the rest as 0.

The issue is that X is different for each column, i.e for a1, x = 21%, a2, x = 50% ... a80, x = 14% etc

I was thinking of using temporary arrays to store the value for the Xs but not sure how to entirely approach this.

Thanks.

Input dataset

id | a1 | a2 | a3 |

1 | 0.1 | 0.95 | 0.54 |

2 | 0.15 | 0.86 | 0.55 |

3 | 0.02 | 0.91 | 0.50 |

4 | 0.13 | 0.87 | 0.49 |

5 | 0.12 | 0.88 | 0.48 |

Expected output (e.g turn the largest 3,2,4 (I used integers instead of percentage here but either is fine) rows for a1, a2, a3 respectively into "1" and "0" else)

id | a1 | a2 | a3 |

1 | 0 | 1 | 1 |

2 | 1 | 0 | 1 |

3 | 0 | 1 | 1 |

4 | 1 | 0 | 1 |

5 | 1 | 0 | 0 |

This is what I thought of but doing it for 80 different variables would be quite tedious.

`proc sql inobs = 3; create table t_a1 as`

select id, a1 from input

order by a1 DESC; quit;

proc sql inobs = 2; create table t_a2 as

select id, a2 from input

order by a2 DESC; quit;

proc sql inobs = 4; create table t_a3 as

select id, a3 from input

order by a3 DESC; quit;

proc sql; create table merged as

select * from t_a1 as a

left join t_a2 as b

on a.id = b.id;

left join t_a3 as c

on a.id = c.id;

quit;

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@sasbeginner1222 wrote:

Hi, thanks for the help I have added an example hoping it will make my question somewhat clearer.

You need to walk us through exactly how you "turn the largest 3,2,4 (I used integers instead of percentage here but either is fine) rows for a1, a2, a3 respectively into "1" and "0" else)".

Such as Largest 3 means what? WHICH specific values were determined to be the largest 3. Then which cells used that information and how to assign the 0 and 1.

Since we need to come up with something that works for 80 variables we really do need some nitty gritty detailed description.

This is what I think you want. Please note the data step to provide example data:

data have; input id a1 a2 a3; datalines; 1 0.1 0.95 0.54 2 0.15 0.86 0.55 3 0.02 0.91 0.50 4 0.13 0.87 0.49 5 0.12 0.88 0.4 ; proc rank data=have out=ranked descending; var a1 a2 a3; ranks ra1 ra2 ra3; run; data want; set ranked; array r(*) ra1-ra3; /* array L is for the LIMIT number, with the DESCENDING option in proc rank the ranks of the largest values are 1, 2, 3 so if a limit of 2 is selected then 1 and 2 will be indicated */ array l(3) _temporary_ (3,2,4); do i=1 to dim(r); r[i]= (r[i] le l[i]); end; drop i; run;

Proc Rank provides a NEW variable with the rank position (there are some other options).

Then a data step to change those ranks into the indicator variable.

The Array L in the last data step must be defined to have as many elements as the ranked values array (R). Likely 80 in your description. Then you need a number in each position .

Note that you could merge a data set with a single observation that has all 80 limit values and place the names of those variables in the array L definition.

The code might look like:

data limits; input limit1 limit2 limit3; datalines; 3 2 4 ; data want; merge ranked limits; array r(*) ra1-ra3; array l(3) _temporary_; array z(*) limit1-limit3; If _n_=1 then do i=1 to dim(l); l[i]=z[i]; end; do i=1 to dim(r); r[i]= (r[i] le l[i]); end; drop i limit:; run;

The Merge will only have the Limit variable values on the first observation. The temporary array once loaded persists across iterations of the data step. So we can load the values from the first observation into the array so they are available through the execution of the data step.

**Caution:** Your example data did not provide any tied values for the variables. If you have such then you may need to get into how you want ties treated as if they occur in the group that you are calling "highest" you may have too many or too few actual values. There are several rules available for tie breaking and would apply to ALL the variables in a single call to Proc Rank. So if those get complex this problem gets more complex.

7 REPLIES 7

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posting data in usable form makes it so much easier for us to provide useful answers ...

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi, thanks for the help, I have added an example hoping it will make my question somewhat clearer.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I guess something like this may work:

```
data want;
set have;
array values(*) a1-a80;
array temp(80) 8 _temporary_;
do _N_=1 to dim(values);
temp(_N_)=values(_N_);
end;
call sortn(of temp(*));
_n=n(of temp(*));
if _n then do;
/* find the first valid value */
_first=dim(temp)-_n+1;
/* find 20th percentile */
_20pct=temp(_first+floor(_n/5));
do _N_= 1 to dim(values);
if values(_N_)>_20pct then
values(_N_)=1;
else
values(_N_)=0;
end;
drop _:;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Suggestion: For example purposes reduce the size of your problem by providing data with the ID variable and 3 of the other variables and maybe 5 or 6 rows of data.

This should be small enough that you can work the result from the example data by hand and show what the expected result is.

I have a couple concerns because I am afraid that your definition of "loop through each column, sort it from highest percentage" could mean several different things and some are quite ugly and I am not sure what you actually want.

Similar with "set the highest X% of rows as 1 and the rest as 0". STATE what the x% is for your worked example. If you need to change the x% of rows it may help to provide stuff.

I suspect you may be looking for Proc Rank for one class of solutions but the other interpretations I can imagine are quite ugly.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi, thanks for the help I have added an example hoping it will make my question somewhat clearer.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@sasbeginner1222 wrote:

Hi, thanks for the help I have added an example hoping it will make my question somewhat clearer.

You need to walk us through exactly how you "turn the largest 3,2,4 (I used integers instead of percentage here but either is fine) rows for a1, a2, a3 respectively into "1" and "0" else)".

Such as Largest 3 means what? WHICH specific values were determined to be the largest 3. Then which cells used that information and how to assign the 0 and 1.

Since we need to come up with something that works for 80 variables we really do need some nitty gritty detailed description.

This is what I think you want. Please note the data step to provide example data:

data have; input id a1 a2 a3; datalines; 1 0.1 0.95 0.54 2 0.15 0.86 0.55 3 0.02 0.91 0.50 4 0.13 0.87 0.49 5 0.12 0.88 0.4 ; proc rank data=have out=ranked descending; var a1 a2 a3; ranks ra1 ra2 ra3; run; data want; set ranked; array r(*) ra1-ra3; /* array L is for the LIMIT number, with the DESCENDING option in proc rank the ranks of the largest values are 1, 2, 3 so if a limit of 2 is selected then 1 and 2 will be indicated */ array l(3) _temporary_ (3,2,4); do i=1 to dim(r); r[i]= (r[i] le l[i]); end; drop i; run;

Proc Rank provides a NEW variable with the rank position (there are some other options).

Then a data step to change those ranks into the indicator variable.

The Array L in the last data step must be defined to have as many elements as the ranked values array (R). Likely 80 in your description. Then you need a number in each position .

Note that you could merge a data set with a single observation that has all 80 limit values and place the names of those variables in the array L definition.

The code might look like:

data limits; input limit1 limit2 limit3; datalines; 3 2 4 ; data want; merge ranked limits; array r(*) ra1-ra3; array l(3) _temporary_; array z(*) limit1-limit3; If _n_=1 then do i=1 to dim(l); l[i]=z[i]; end; do i=1 to dim(r); r[i]= (r[i] le l[i]); end; drop i limit:; run;

The Merge will only have the Limit variable values on the first observation. The temporary array once loaded persists across iterations of the data step. So we can load the values from the first observation into the array so they are available through the execution of the data step.

**Caution:** Your example data did not provide any tied values for the variables. If you have such then you may need to get into how you want ties treated as if they occur in the group that you are calling "highest" you may have too many or too few actual values. There are several rules available for tie breaking and would apply to ALL the variables in a single call to Proc Rank. So if those get complex this problem gets more complex.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I cannot tell exactly what you are asking. If you just want to make binary flags that indicate if the value is larger than some cutoff that is simple enough.

```
data want;
set have;
array x a1-a3;
array cutoff [3] _temporary_ (0.104 0.894 0.490);
do index=1 to dim(x);
x[index] = cutoff[index] <= x[index] ;
end;
drop index;
run;
```

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.