How to get sub-sums in a matrix, not dataset?

StatOnGlobe · Posted 01-23-2015 01:46 AM

Hi~~

I'm a stat-centric risk manager working on lots of financial models. Gladly IML community caught my eyes because I've got many questions with respect to IML usages.

I'm dealing with a very big matrix whose size is 1,000,000-by-20 (row-by-column). For ease of explanation, the matrix name is XXX. 1st column is number of defaults, 2nd column is grade(1,2,3~,20), 3rd column is region number(100, 200, 300, 400), 4th column is corporate size(1: large company, 2:Small and Medium, 3: Small etc)...

The matrix format in display is like this;

8 2 300 2 ....

15 1 100 3 ....

3 4 100 1 ....

2 1 200 1 ....

....

I'd like to get sums of defaults by each of grade, region number, corporate size, which is usually done by proc means, proc summary procedures with "data sets". For example,

proc summary data=XXX;

var defaults

class grade region_number corporate_size;

output out=getsum sum=;

run;

Then I get many sub-sums for various combination of class variables. I found an iml equivalent "summary" statement which functions almost same as proc summary.

But the problem is that in order to use "summary" statement, I need to make the matrix XXX a data set and then I can use the "summary" statement.

My question is, Is there any other way that without making the matrix XXX a data set, I can get sub-sums of first column by each combination of 2, 3, 4 column values. The point is that huge number of temporary big matrices are generated and sub-sumbs are extactred and big matrices are discarded. In the meantime, big time loss occurs during making data sets. I could tolerate if the number of those operations small, but the number could be up to tens of thousand, which is considered to be no solution at all.

I've been trying to find possible solutions for it, searching through internet and in vain....

Thank you in advance~

Rick_SAS · Posted 01-23-2015 09:15 AM

You are asking for BY-group processing. There are several ways to do this. For a large number of rows I recommend the UNIQUEBY technique, which is described in this blog post: An efficient alternative to the UNIQUE-LOC technique - The DO Loop

For 1million obs, this approach takes a few seconds. The blog article explains the details, so I'll just give the solution here:

/* create sample data for testing */
data Have;
call streaminit(123);
do i = 1 to 1e6; /* use 50 for debugging */
   Defaults = rand("Poisson", 5);
   Grade = ceil(5*rand("Uniform"));
   Region = 100 * ceil(4*rand("Uniform"));
   Size = ceil(2*rand("Uniform"));
   output;
end;
run;

/* Use PROC MEANS to compute the correct answer */
proc means data=Have sum;
class Grade Region Size;
var Defaults;
run;

/* obtain the same answer in PROC IML by using the UNIQUEBY technique */
proc iml;
varName = {"Defaults" "Grade" "Region" "Size"};
use Have;
read all var varName into XXX;
close Have;

call sort(XXX, 2:4); /* 1. Sort or use PROC SORT in Base SAS */
by = uniqueby(XXX, 2:4); /* 2. Rows for 1st obs in each level. */

sum = j(nrow(by),1);          /* 3. Allocate vector to hold results */
b = by // (nrow(XXX)+1);      /* Trick: append (n+1) to end of by */
do i = 1 to nrow(b)-1;        /* 4. For each level... */
   idx = b:(b[i+1]-1);     /* 5. Find observations in level */
   sum = sum(XXX[idx, 1]); /* 6. Compute statistic on those values */
end;
*lbl = rowcat( char(XXX[by,2:4]) ); /* Opitonal: create labels for BY groups */
*print sum[rowname=lbl];

data_null__ · Posted 01-23-2015 09:48 AM

If sounds like you need NWAY option. I think you are saying you want summary of each level of grade*region_number*corporate_size but not the extra "ways" that PROC SUMMARY provides by default. That can be done by specifying NWAY on the PROC SUMMARY statement.

Ksharp · Posted 01-24-2015 05:28 AM

Maybe you mean this :

proc iml;
use sashelp.class;
summary class{age} var{weight} stat{mean std var} opt{noprint save};
x=age||weight;
create want from x[c={'age' 'weight_mean' 'weight_std' 'weight_var'}] ;
append from x;
close;
quit;

Xia Keshan

How to get sub-sums in a matrix, not dataset?

Re: How to get sub-sums in a matrix, not dataset?

Re: How to get sub-sums in a matrix, not dataset?

Re: How to get sub-sums in a matrix, not dataset?

How to get sub-sums in a matrix, not dataset?

Re: How to get sub-sums in a matrix, not dataset?

Re: How to get sub-sums in a matrix, not dataset?

Re: How to get sub-sums in a matrix, not dataset?

Registration is open