BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dammie_101
Obsidian | Level 7

Good day everyone.

Please i need some advice, as well as help with something. 

 

I want to create a new variable by asking SAS to multiply the values in the cells of input columns, and repeat this process for all observations. As it in the code below...I want to ask SAS to start with obs 1 (row 1) and multiply the values of grp1 - grp5 to create new variable "rowproduct" with the result in that cell. Then it should go on to obs 2 (row 2)...etc until the end.

Will the IML procedure be better suited for this or will it more appropriate to use ARRAY?

Example code to solve this will be highly appreciated.

 

Thank you very much.

Dami.

data test;
input TargetID	grp1	grp2	grp3	grp4	grp5;
datalines;
7314326	88103	102695	118879	82913	95039
10031881	253	255	74	36	442
7213321	3431	5344	1134	637	4054
7213330	2765	4184	1143	635	2722
7214940	447	269	11328	9417	145
;
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User
data want;
set test;

array myarr {*} grp:;
/* defines an array over all variables the name of which starts with grp */
/* the dimension (*) is dynamic, dependent on the number of variables found */
/* the colon can be used as wildcard character in SAS code for variables or dataset names */
/* but can only be used at the end */

rowproduct = 1;
/* initializes the new variable, as a simple calculation on missing values would result in missing values */

do i = 1 to dim(myarr); /* upper bound depends on number of variables found */
  rowproduct = rowproduct * coalesce(myarr{i},1);
  /* coalesce() returns the first non-missing argument */
  /* prevents missing values to destroy our result */

end;
drop i;
run;

View solution in original post

11 REPLIES 11
dammie_101
Obsidian | Level 7

This works perfectly!

I am grateful.

Thank you so much!

dammie_101
Obsidian | Level 7

i think i get the logic:

 

SAS scans and calls in all variables from the input dataset into array procedure;

 

set column 'rowproduct' to an initial of 1;

 

initiate array procedure from column 1 until last element in the array dimension;

 

rowproduct => (a) call initial value of rowproduct...multiply this with (b) call cell values in array dimension, multiply values) until last entry;

 

 

 

 

 

 

dammie_101
Obsidian | Level 7

I am open to learning more. I absolutely would not mind additional explanation.


@Kurt_Bremser wrote:

@dammie_101 wrote:

This works perfectly!

I am grateful.

Thank you so much!


Fine! Is the code clear to you, or do you need additional explanation?


 

Kurt_Bremser
Super User
data want;
set test;

array myarr {*} grp:;
/* defines an array over all variables the name of which starts with grp */
/* the dimension (*) is dynamic, dependent on the number of variables found */
/* the colon can be used as wildcard character in SAS code for variables or dataset names */
/* but can only be used at the end */

rowproduct = 1;
/* initializes the new variable, as a simple calculation on missing values would result in missing values */

do i = 1 to dim(myarr); /* upper bound depends on number of variables found */
  rowproduct = rowproduct * coalesce(myarr{i},1);
  /* coalesce() returns the first non-missing argument */
  /* prevents missing values to destroy our result */

end;
drop i;
run;
dammie_101
Obsidian | Level 7

Thank you so very much!

I am deeply grateful.

FreelanceReinh
Jade | Level 19

Or keep it simple:

data want;
set test;
rowproduct=grp1*grp2*grp3*grp4*grp5;
run;

A shorter alternative (especially if the product involved more than only 5 "GRP" variables) would be something like

rowproduct=geomean(of grp:)**5;

where all variables would be multiplied whose names start with "GRP" (use grp1-grp5 if grp: was too comprehensive) and the exponent (5) is the number of these variables. Restriction: The GEOMEAN function requires non-negative arguments. However, if missing values could occur, the exponent would need to be the number of non-missing values:

rowproduct=geomean(of grp:)**n(of grp:);

(The GEOMEAN function ignores missing values unlike the explicit product).

dammie_101
Obsidian | Level 7

how about in a situation where the column variables have different names? As an example instead of grp1, grp2, grp3 etc the variable names are liverKO kidneyKO bladderKO heartKO lungsKO. (I am going to get to that point in the next couple of weeks in the course of my experiments, hence why i am asking).

 

what would the line statement

 

rowproduct = geomean(of grp:)**5

look like?

 

Thanks in advance.

FreelanceReinh
Jade | Level 19

@dammie_101 wrote:

how about in a situation where the column variables have different names? As an example instead of grp1, grp2, grp3 etc the variable names are liverKO kidneyKO bladderKO heartKO lungsKO.

 

what would the line statement

 

rowproduct = geomean(of grp:)**5

look like?

 

Thanks in advance.


If those five variables are adjacent in dataset TEST, i.e., if their variable numbers in PROC CONTENTS output of TEST (see column "#"; ideally use the VARNUM option) are consecutive and, let's say, liverKO is the first and lungsKO the last variable in that order, you can write

rowproduct=geomean(of liverKO--lungsKO)**5;

(assuming there are no missing values, as explained earlier; otherwise replace "5" by "n(of liverKO--lungsKO)" if you want to obtain the product of the non-missing values in this case).

 

A similar notation (liverKO-numeric-lungsKO) would apply if the sequence of the five variables was only interrupted by character variables.

 

Alternatively, you can define an array (see the ARRAY statement in KurtBremser's post, where "grp:" would be replaced either by the list of all five variable names or, if applicable, by the variable list in the notation I've just described). Then the formula would read

rowproduct=geomean(of myarr{*})**5;

(or, again, with n(of myarr{*}) as the exponent).

 

That said, if the number of factors is as small as five, I would prefer the simple definition

rowproduct=liverKO*kidneyKO*bladderKO*heartKO*lungsKO;

which is much easier to read, even if the COALESCE function (see KurtBremser's solution) was applied to each factor in order to ignore missing values.

 

 

 

dammie_101
Obsidian | Level 7

understood.

I am sincerely grateful for taking time to explain to that depth of detail.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 3823 views
  • 1 like
  • 3 in conversation