DATA Step, Macro, Functions and more

sum of columns in data set multiplied by indexed macro variable

Reply
Contributor
Posts: 60

sum of columns in data set multiplied by indexed macro variable

I need to calculate sum of columns multiplied by proper macro variable.

columns are a, x, b, m etc (names can be different and the number of columns may change)

each of columns have corresponding macro value &m1, &m2, &m3  .... &mn (n- number of columns = number of macro vairables)

i need to calculate sum for each row: a*&m1 + x*&m2 + b*&m3 + m*&m4 + .....  and write results to result table

Can anybody help.

Frequent Contributor
Posts: 95

Re: sum of columns in data set multiplied by indexed macro variable

One way would be by using arrays.

Zafer

data new;

    a = 1;

    b = 2;

    c = 3;

    d = 4;

    e = 5;

    output;

    a = 2;

    b = 3;

    c = 5;

    d = 6;

    e = 1;

    output;

run;

%let M1=1;

%let M2=2;

%let M3=3;

%let M4=4;

%let M5=5;

%let vars = a b c d e;

data x;

    set new;

    array _a_

  • &vars;
  •     dim = DIM(_a_);

        do i=1 to dim;

            Result = SUM(Result, _a_*SYMGET(CATS('M',i)));

        end;

        drop dim i;

    run;

    Respected Advisor
    Posts: 3,777

    Re: sum of columns in data set multiplied by indexed macro variable

    This is what PROC SCORE can do.  This examples uses the data macro vars from Alpay's example.

    data coef;

       if 0 then set new(keep=&vars);

       array _n

  • &vars;
  •    length _TYPE_ _NAME_ $8;

       retain _TYPE_ 'SCORE' _NAME_ 'RESULT';

       do i = 1 to dim(_n);

          _n = symgetN(cats('M',i));

          end;

       output;

       stop;

       drop i;

       run;

    Proc print;

       run;

    Proc score data=new score=coef out=scored;

       var &vars;

       run;

    proc print;

       run;

    Super User
    Posts: 9,687

    Re: sum of columns in data set multiplied by indexed macro variable

    NULL, symgetN() is very low efficient , that means your code is very slow . I prefer to make a marco varible to hold such expression .

    data _null_;
     set sashelp.vcolumn(keep=libname name memname type where=(libname='SASHELP' and memname='CLASS' and type='num')) end=last; 
     length list $ 4000;
     retain list;
     list=catx(',',list,cats(name,'*','&m ',_n_));
     if last then call symputx('list',list);
    run;
    
    %put %superq(list);
    
    %let M1=1;
    %let M2=2;
    %let M3=3;
    data want;
     set sashelp.class;
     sum=sum(&list );
    run;
    
    

    Ksharp

    Respected Advisor
    Posts: 3,777

    Re: sum of columns in data set multiplied by indexed macro variable

    Ksharp wrote:

    NULL, symgetN() is very low efficient , that means your code is very slow . I prefer to make a marco varible to hold such expression .

    The argument that my program is inefficient appears to be based on your misunderstanding of the data step to create COEF.  I believe if you study it more closely you will see SYMGETN would need to be very slow indeed to have any significant impact on performance.

    Super User
    Posts: 9,687

    Re: sum of columns in data set multiplied by indexed macro variable

    NULL,

    No, I am not misunderstanding . I know how slow the symgetn() would be at any situation .

    You put it on every single variable and every single observation which make your code very slow .

    Ksharp

    Respected Advisor
    Posts: 3,777

    Re: sum of columns in data set multiplied by indexed macro variable

    Yes it is obvious you don't understand because you don't know what you're talking about.

    How many observations do you think COEF has?

    Super Contributor
    Posts: 376

    Re: sum of columns in data set multiplied by indexed macro variable

    Excerpts from above:

    %let M1=1;

    %let M2=2;

    %let M3=3;

    %let M4=4;

    %let M5=5;

    %let vars = a b c d e;

    This is what PROC SCORE can do.  This examples uses the data macro vars from Alpay's example.

    data coef;

       if 0 then set new(keep=&vars);

       array _n

  • &vars;
  •    length _TYPE_ _NAME_ $8;

       retain _TYPE_ 'SCORE' _NAME_ 'RESULT';

       do i = 1 to dim(_n);

          _n = symgetN(cats('M',i));

          end;

       output;

       stop;

       drop i;

       run;

    So, reviewing the data coef datastep:

    1) No records are read from set new.  It's only used to set the PDV variable attributes for &vars (a b c d e).

    2) Create an array for a b c d e.

    3) Create additional variables _TYPE_='SCORE' and _NAME_='RESULT'

    4) Loop over the 5 element array, setting a b c d e to &m1, &m2, ..., &m5 respectively.

    5) Output the record

    6) Stop the data step (actually not needed but makes the code crystal clear).

    So, there is only one loop through the data step, and symgetn is called 5 times.  No records are read from an input dataset.

    I gotta agree with Data:  where do you think the poor performance would lie?  Your statement that symgetn is slow is in fact correct, in principle, but not in this instance due to the construction of the data step.

    On the other hand, your code:

    data _null_;

    set sashelp.vcolumn(keep=libname name memname type where=(libname='SASHELP' and memname='CLASS' and type='num')) end=last;

    length list $ 4000;

    retain list;

    list=catx(',',list,cats(name,'*','&m ',_n_));

    if last then call symputx('list',list);

    run;

    will need to process *every column* from *every dataset* in *every allocated library*!  And if some of those libraries are RDBMS libraries it will perform even more poorly.  Even though you're applying a where clause, the sashelp view will not use indexes, and in general will perform badly.  Honestly, allocate 20 libraries or so, with say 100 datasets per library, with say an average of 20 variables per dataset, and see how sashelp.vcolumn will perform.

    If you're going to use your approach, at least consider using PROC CONTENTS against your desired dataset, i.e. proc contents data=sashelp.class out=columns, then use your approach.

    But of the two approaches shown, yours and Data's, I believe Data's will perform much faster.

    Regards,

    Scott

    Ask a Question
    Discussion stats
    • 7 replies
    • 693 views
    • 0 likes
    • 5 in conversation