- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Good day everyone.
Please i need some advice, as well as help with something.
I want to create a new variable by asking SAS to multiply the values in the cells of input columns, and repeat this process for all observations. As it in the code below...I want to ask SAS to start with obs 1 (row 1) and multiply the values of grp1 - grp5 to create new variable "rowproduct" with the result in that cell. Then it should go on to obs 2 (row 2)...etc until the end.
Will the IML procedure be better suited for this or will it more appropriate to use ARRAY?
Example code to solve this will be highly appreciated.
Thank you very much.
Dami.
data test;
input TargetID grp1 grp2 grp3 grp4 grp5;
datalines;
7314326 88103 102695 118879 82913 95039
10031881 253 255 74 36 442
7213321 3431 5344 1134 637 4054
7213330 2765 4184 1143 635 2722
7214940 447 269 11328 9417 145
;
run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data want;
set test;
array myarr {*} grp:;
/* defines an array over all variables the name of which starts with grp */
/* the dimension (*) is dynamic, dependent on the number of variables found */
/* the colon can be used as wildcard character in SAS code for variables or dataset names */
/* but can only be used at the end */
rowproduct = 1;
/* initializes the new variable, as a simple calculation on missing values would result in missing values */
do i = 1 to dim(myarr); /* upper bound depends on number of variables found */
rowproduct = rowproduct * coalesce(myarr{i},1);
/* coalesce() returns the first non-missing argument */
/* prevents missing values to destroy our result */
end;
drop i;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use an array:
data want;
set test;
array myarr {*} grp:;
rowproduct = 1;
do i = 1 to dim(myarr);
rowproduct = rowproduct * coalesce(myarr{i},1);
end;
drop i;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This works perfectly!
I am grateful.
Thank you so much!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@dammie_101 wrote:
This works perfectly!
I am grateful.
Thank you so much!
Fine! Is the code clear to you, or do you need additional explanation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
i think i get the logic:
SAS scans and calls in all variables from the input dataset into array procedure;
set column 'rowproduct' to an initial of 1;
initiate array procedure from column 1 until last element in the array dimension;
rowproduct => (a) call initial value of rowproduct...multiply this with (b) call cell values in array dimension, multiply values) until last entry;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am open to learning more. I absolutely would not mind additional explanation.
@Kurt_Bremser wrote:
@dammie_101 wrote:This works perfectly!
I am grateful.
Thank you so much!
Fine! Is the code clear to you, or do you need additional explanation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data want;
set test;
array myarr {*} grp:;
/* defines an array over all variables the name of which starts with grp */
/* the dimension (*) is dynamic, dependent on the number of variables found */
/* the colon can be used as wildcard character in SAS code for variables or dataset names */
/* but can only be used at the end */
rowproduct = 1;
/* initializes the new variable, as a simple calculation on missing values would result in missing values */
do i = 1 to dim(myarr); /* upper bound depends on number of variables found */
rowproduct = rowproduct * coalesce(myarr{i},1);
/* coalesce() returns the first non-missing argument */
/* prevents missing values to destroy our result */
end;
drop i;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so very much!
I am deeply grateful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Or keep it simple:
data want;
set test;
rowproduct=grp1*grp2*grp3*grp4*grp5;
run;
A shorter alternative (especially if the product involved more than only 5 "GRP" variables) would be something like
rowproduct=geomean(of grp:)**5;
where all variables would be multiplied whose names start with "GRP" (use grp1-grp5 if grp: was too comprehensive) and the exponent (5) is the number of these variables. Restriction: The GEOMEAN function requires non-negative arguments. However, if missing values could occur, the exponent would need to be the number of non-missing values:
rowproduct=geomean(of grp:)**n(of grp:);
(The GEOMEAN function ignores missing values unlike the explicit product).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
how about in a situation where the column variables have different names? As an example instead of grp1, grp2, grp3 etc the variable names are liverKO kidneyKO bladderKO heartKO lungsKO. (I am going to get to that point in the next couple of weeks in the course of my experiments, hence why i am asking).
what would the line statement
rowproduct = geomean(of grp:)**5
look like?
Thanks in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@dammie_101 wrote:
how about in a situation where the column variables have different names? As an example instead of grp1, grp2, grp3 etc the variable names are liverKO kidneyKO bladderKO heartKO lungsKO.
what would the line statement
rowproduct = geomean(of grp:)**5
look like?
Thanks in advance.
If those five variables are adjacent in dataset TEST, i.e., if their variable numbers in PROC CONTENTS output of TEST (see column "#"; ideally use the VARNUM option) are consecutive and, let's say, liverKO is the first and lungsKO the last variable in that order, you can write
rowproduct=geomean(of liverKO--lungsKO)**5;
(assuming there are no missing values, as explained earlier; otherwise replace "5" by "n(of liverKO--lungsKO)" if you want to obtain the product of the non-missing values in this case).
A similar notation (liverKO-numeric-lungsKO) would apply if the sequence of the five variables was only interrupted by character variables.
Alternatively, you can define an array (see the ARRAY statement in KurtBremser's post, where "grp:" would be replaced either by the list of all five variable names or, if applicable, by the variable list in the notation I've just described). Then the formula would read
rowproduct=geomean(of myarr{*})**5;
(or, again, with n(of myarr{*}) as the exponent).
That said, if the number of factors is as small as five, I would prefer the simple definition
rowproduct=liverKO*kidneyKO*bladderKO*heartKO*lungsKO;
which is much easier to read, even if the COALESCE function (see KurtBremser's solution) was applied to each factor in order to ignore missing values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
understood.
I am sincerely grateful for taking time to explain to that depth of detail.