BookmarkSubscribeRSS Feed
StatOnGlobe
Calcite | Level 5

Hi~~

I'm a stat-centric risk manager working on lots of financial models. Gladly IML community caught my eyes because I've got many questions with respect to IML usages.

I'm dealing with a very big matrix whose size is 1,000,000-by-20 (row-by-column). For ease of explanation, the matrix name is XXX. 1st column is number of defaults, 2nd column is grade(1,2,3~,20), 3rd column is region number(100, 200, 300, 400), 4th column is corporate size(1: large company, 2:Small and Medium, 3: Small etc)...

The matrix format in display is like this;

8          2     300     2     ....

15        1     100     3     ....

3          4     100     1     ....

2          1     200     1     ....

....

I'd like to get sums of defaults by each of grade, region number, corporate size, which is usually done by proc means, proc summary procedures with "data sets". For example,

proc summary data=XXX;

     var defaults

     class grade region_number corporate_size;

     output out=getsum sum=;

run;

Then I get many sub-sums for various combination of class variables. I found an iml equivalent "summary" statement which functions almost same as proc summary.

But the problem is that in order to use "summary" statement, I need to make the matrix XXX a data set and then I can use the "summary" statement.

My question is, Is there any other way that without making the matrix XXX a data set, I can get sub-sums of first column by each combination of 2, 3, 4 column values. The point is that huge number of temporary big matrices are generated and sub-sumbs are extactred and big matrices are discarded. In the meantime, big time loss occurs during making data sets. I could tolerate if the number of those operations small, but the number could be up to tens of thousand, which is considered to be no solution at all.

I've been trying to find possible solutions for it, searching through internet and in vain....

Thank you in advance~

3 REPLIES 3
Rick_SAS
SAS Super FREQ

You are asking for BY-group processing. There are several ways to do this. For a large number of rows I recommend the UNIQUEBY technique, which is described in this blog post: An efficient alternative to the UNIQUE-LOC technique - The DO Loop

For 1million obs, this approach takes a few seconds. The blog article explains the details, so I'll just give the solution here:

/* create sample data for testing */
data Have;
call streaminit(123);
do i = 1 to 1e6;  /* use 50 for debugging */
   Defaults = rand("Poisson", 5);
   Grade = ceil(5*rand("Uniform"));
   Region = 100 * ceil(4*rand("Uniform"));
   Size = ceil(2*rand("Uniform"));
   output;
end;
run;

/* Use PROC MEANS to compute the correct answer */
proc means data=Have sum;
class Grade Region Size;
var Defaults;
run;

/* obtain the same answer in PROC IML by using the UNIQUEBY technique */
proc iml;
varName = {"Defaults" "Grade" "Region" "Size"};
use Have;
read all var varName into XXX;
close Have;

call sort(XXX, 2:4);          /* 1. Sort or use PROC SORT in Base SAS */
by = uniqueby(XXX, 2:4);      /* 2. Rows for 1st obs in each level. */

sum = j(nrow(by),1);          /* 3. Allocate vector to hold results */
b = by // (nrow(XXX)+1);      /* Trick: append (n+1) to end of by */
do i = 1 to nrow(b)-1;        /* 4. For each level... */
   idx = b:(b[i+1]-1);     /* 5. Find observations in level */
   sum = sum(XXX[idx, 1]); /* 6. Compute statistic on those values */
end;
*lbl = rowcat( char(XXX[by,2:4]) ); /* Opitonal: create labels for BY groups */
*print sum[rowname=lbl];

data_null__
Jade | Level 19

If sounds like you need NWAY option.  I think you are saying you want summary of each level of grade*region_number*corporate_size  but not the extra "ways" that PROC SUMMARY provides by default.  That can be done by specifying NWAY on the PROC SUMMARY statement.

Ksharp
Super User

Maybe you mean this :

proc iml;
use sashelp.class;
summary class{age} var{weight} stat{mean std var} opt{noprint save};
x=age||weight;
create want from x[c={'age' 'weight_mean' 'weight_std' 'weight_var'}] ;
append from x;
close;
quit;

Xia Keshan

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 3 replies
  • 873 views
  • 1 like
  • 4 in conversation