Solved: Re: Adding up variables in one dataset based on a second dataset

tonybesas · Posted 11-12-2019 08:30 AM

I have this primary dataset that looks like this, with thousands of lines and hundreds of columns:

file   baby back ball bear beat begin call
text1  3    4    0    1    0    1     0
text2  2    0    0    3    3    1     2
text3  0    0    0    1    0    5     1
text4  1    1    2    1    1    4     1

And I also have this secondary dataset that looks like this, which spells out the variables in the primary dataset that should be added up:

word pole
baby f3neg
back f1pos
ball f1pos
bear f2pos
beat f2neg
begin f2neg
call f3pos

What I'd like to do is, for each value of 'file' in the primary database, to sum up the values for the words in the primary database according to the distribution of the words in the secondary database, like this:

F1=(sum of all f1pos words if any) - (sum of all f1neg words if any)
F2=(sum of all f2pos words if any) - (sum of all f2neg words if any)
F3=(sum of all f3pos words if any) - (sum of all f3neg words if any)

For example:

F1 for text1 = ( 4 for back + 0 for ball ) - ( 0 because no words are marked as f1neg in the secondary database ) = 4
F1 for text2 = ( 0 for back + 0 for ball ) - ( 0 because no words are marked as f1neg in the secondary database ) = 0
F1 for text3 = ( 0 for back + 0 for ball ) - ( 0 because no words are marked as f1neg in the secondary database ) = 0
F1 for text4 = ( 1 for back + 2 for ball ) - ( 0 because no words are marked as f1neg in the secondary database ) = 3

F2 for text1 = ( 1 for bear ) - ( 0 for beat + 1 for begin ) = 0
F2 for text2 = ( 3 for bear ) - ( 3 for beat + 1 for begin ) = -1
F2 for text3 = ( 1 for bear ) - ( 0 for beat + 5 for begin ) = -4
F2 for text4 = ( 1 for bear ) - ( 1 for beat + 4 for begin ) = -4

and so on for F3.

and then add this back to the primary database:

file   baby back ball bear beat begin call  F1   F2   F3
text1  3    4    0    1    0    1     0     4     0   -3
text2  2    0    0    3    3    1     2     0    -1    0 
text3  0    0    0    1    0    5     1     0    -4    1
text4  1    1    2    1    1    4     1     3    -4    0

How could this be done in SAS?

thank you ahead

Tom · Posted 11-12-2019 11:04 AM

If you have SAS/STAT licensed then use PROC SCORE. Your metadata needs to be in the format of a scoring dataset.

Obs    _type_    _name_    back    ball    bear    beat    begin    baby    call

 1     SCORE       f1        1       1       .       .        .       .       .
 2     SCORE       f2        .       .       1      -1       -1       .       .
 3     SCORE       f3        .       .       .       .        .      -1       1

Then you can use it with PROC SCORE.

proc score data=one score=score out=want;
run;

You can convert your existing layout with just a little coding.

data two;
  input source :$32. target :$32. factor ;
datalines;
baby f3 -1
back f1 1
ball f1 1
bear f2 1
beat f2 -1
begin f2 -1
call f3 1
;

proc sort data=two;
  by target ;
run;
proc transpose data=two out=score ;
  by target ;
  id source;
  var factor;
run;
data score;
  _type_='SCORE';
  set score;
  drop _name_;
  rename target=_name_;
run;

View solution in original post

PeterClemmensen · Posted 11-12-2019 08:52 AM

@tonybesas Hi and welcome to the SAS Community 🙂

Here is one way

data one;
input file $ baby back ball bear beat begin call;
datalines;
text1 3 4 0 1 0 1 0
text2 2 0 0 3 3 1 2
text3 0 0 0 1 0 5 1
text4 1 1 2 1 1 4 1
;

data two;
input word $ pole $;
datalines;
baby f3neg
back f1pos
ball f1pos
bear f2pos
beat f2neg
begin f2neg
call f3pos
;

data want(drop=i v word pole);
    if _N_=1 then do;
        declare hash h(dataset:'two');
        h.definekey('word', 'pole');
        h.definedone();
    end;

    set one;
    if 0 then set two;
    array _ {*} _numeric_;
    
    do i=1 to dim(_);
        v=vname(_[i]);
        if h.check(key:v, key:'f1pos')=0 then F1=sum(F1,    _[i]);
        if h.check(key:v, key:'f1neg')=0 then F1=sum(F1, -1*_[i]);
        if h.check(key:v, key:'f2pos')=0 then F2=sum(F2,    _[i]);
        if h.check(key:v, key:'f2neg')=0 then F2=sum(F2, -1*_[i]);
        if h.check(key:v, key:'f3pos')=0 then F3=sum(F3,    _[i]);
        if h.check(key:v, key:'f3neg')=0 then F3=sum(F3, -1*_[i]);
    end;
run;

Result

file  baby back ball bear beat begin call F1  F2  F3
text1 3    4    0    1    0    1     0    4   0   -3
text2 2    0    0    3    3    1     2    0   -1  0
text3 0    0    0    1    0    5     1    0   -4  1
text4 1    1    2    1    1    4     1    3   -4  0

The DATA to DATA Step Macro
Blog: SASnrd

tonybesas · Posted 11-12-2019 11:53 PM

This works great, really fast! Thank you for taking the time to do this, draycut!

PaigeMiller · Posted 11-12-2019 08:59 AM

UNTESTED CODE

%macro dothis;
proc sql noprint;
    select distinct word into :f1pos separated by ',' from secondary_data_set
        where lowcase(pole)='f1pos';
    select distinct word into :f1neg separated by ',' from secondary_data_set
        where lowcase(pole)='f1neg';
    select distinct word into :f2pos separated by ',' from secondary_data_set
        where lowcase(pole)='f2pos';
    select distinct word into :f2neg separated by ',' from secondary_data_set
        where lowcase(pole)='f2neg';
    select distinct word into :f3pos separated by ',' from secondary_data_set
        where lowcase(pole)='f3pos';
    select distinct word into :f3neg separated by ',' from secondary_data_set
        where lowcase(pole)='f3neg';
quit;

data want;
    set have;
    f1=%if &f1pos^= %then sum(&f1pos); -
       %if &f1neg^= %then sum(&f1neg); ;
    /* you do the typing for F2 and F3, I'm lazy */
run;
%mend;

%dothis

--
Paige Miller

tonybesas · Posted 11-13-2019 12:03 AM

Thank you for contributing this code! it works really well!

novinosrin · Posted 11-12-2019 10:12 AM

Hi @tonybesas First off, very interesting question from the perspective of

1. How to build an algorithm that will scale

2. How to structure the flow

3. How to look up, rotate, look up and again rotate or in other words going from vertical to horizontal and vice versa.

The following is simple data-driven and yet will scale. The positive and negatives are the only categorical component that matters in a series and the series can be any number.


data one;
input file $ baby back ball bear beat begin call;
datalines;
text1 3 4 0 1 0 1 0
text2 2 0 0 3 3 1 2
text3 0 0 0 1 0 5 1
text4 1 1 2 1 1 4 1
;

data two;
input word $ pole $;
datalines;
baby f3neg
back f1pos
ball f1pos
bear f2pos
beat f2neg
begin f2neg
call f3pos
;
proc transpose data=one out=t;
by file;
var baby--call;
run;
proc sql;
create table t2 as
select file,substr(pole,1,2) as series, substr(pole,3) as sign,sum(col1) as s
from t a left join two b
on a._name_=b.word
group by file,series,sign
order by file,series,sign;
quit;

data t3;
set t2;
by file series;
dif=dif(s);
if first.series then dif=s;
if last.series;
run;

proc transpose data=t3(keep=file series dif) out=want(drop=_name_);
by file ;
id series;
var dif;
run;

data final_want;
merge one want;
by file;
run;

novinosrin · Posted 11-12-2019 11:49 AM


data one;
input file $ baby back ball bear beat begin call;
datalines;
text1 3 4 0 1 0 1 0
text2 2 0 0 3 3 1 2
text3 0 0 0 1 0 5 1
text4 1 1 2 1 1 4 1
;

data two;
input word $ pole $;
datalines;
baby f3neg
back f1pos
ball f1pos
bear f2pos
beat f2neg
begin f2neg
call f3pos
;

data temp;
 if _n_=1 then do;
  if 0 then set one two;
  dcl hash H (dataset:'two');
  h.definekey  ("word");
  h.definedata ("pole");
  h.definedone ();
  dcl hash H1 (ordered:'y');
  h1.definekey  ("pole");
  h1.definedata ("pole","s");
  h1.definedone ();
  dcl hiter hi('h1');
 end;
 set one;
 array t baby--call;
 do over t;
  rc=h.find(key:vname(t));
  if h1.find() ne 0 then s=t;
  else s=sum(s,t);
  h1.replace();
 end;
 do while(hi.next()=0);
  series=substr(pole,1,2);
  output;
 end;
 h1.clear();
 keep file series  s;
run;

proc sql noprint;
select distinct series into :series separated by ' '
from w1;
quit;

data want;
 if _n_=1 then do;
 if 0 then set one ;
   dcl hash H (dataset:'one') ;
   h.definekey  ("file") ;
   h.definedata (all:'y') ;
   h.definedone () ; 
   array t(*) &series;
 end;
 do _n_=1 by 1 until(last.file);
  do until(last.series);
   set temp;
   by file series;
   dif=dif(s);
   if first.series then dif=s;
  end;
  h.find();
  t(_n_)=dif;
 end;
 drop series dif s;
run;

tonybesas · Posted 11-13-2019 12:23 AM

Great code, took care of the job! thank you, novinosrin, for taking the time to contribute!

tonybesas · Posted 11-13-2019 12:38 AM

Sorry, I meant to reply to your first code, novinosrin, but somehow this ended up beneath your second code. It turns out, the second code generates an error message, pointing to the ampersand:

array t(*) &series;

ERROR 22-322: Syntax error, expecting one of the following: a name, a numeric constant, $, (, ;, VARCHAR, _ALL_, _CHARACTER_,

_CHAR_, _NUMERIC_, _TEMPORARY_.

tonybesas · Posted 11-13-2019 12:33 AM

Great code, took care of the job! thank you, novinosrin, for taking the time to contribute!

Tom · Posted 11-12-2019 10:31 AM

I would format you metadata table to be easier to use for generating the code.

data two;
  input source :$32. target :$32. sign :$1.;
datalines;
baby f3 -
back f1 +
ball f1 +
bear f2 +
beat f2 -
begin f2 -
call f3 +
;

proc sort data=two;
  by target source ;
run;

Then you can use that metadata to generate the statements you need.

filename code temp;
data _null_;
  file code ;
  set two ;
  by target ;
  if first.target then put target '=sum(' @;
  else put ',' @;
  put sign source @ ;
  if last.target then put ');' ;
run;

Resulting text file:

f1 =sum(+ back ,+ ball );
f2 =sum(+ bear ,- beat ,- begin );
f3 =sum(- baby ,+ call );

You can then use that code against your input data.

38   data want;
39     set one ;
40     %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file /Volumes/scratch/SAS_work3B6800005EBE_amrdrml464.pfizer.com/#LN00020.
41  +f1 =sum(+ back ,+ ball );
42  +f2 =sum(+ bear ,- beat ,- begin );
43  +f3 =sum(- baby ,+ call );
NOTE: %INCLUDE (level 1) ending.
44   run;

Result:

Obs    file     baby    back    ball    bear    beat    begin    call    f1    f2    f3

 1     text1      3       4       0       1       0       1        0      4     0    -3
 2     text2      2       0       0       3       3       1        2      0    -1     0
 3     text3      0       0       0       1       0       5        1      0    -4     1
 4     text4      1       1       2       1       1       4        1      3    -4     0

tonybesas · Posted 11-13-2019 06:35 AM

thank you, Tom, this worked really well!

Tom · Posted 11-12-2019 11:04 AM

If you have SAS/STAT licensed then use PROC SCORE. Your metadata needs to be in the format of a scoring dataset.

Obs    _type_    _name_    back    ball    bear    beat    begin    baby    call

 1     SCORE       f1        1       1       .       .        .       .       .
 2     SCORE       f2        .       .       1      -1       -1       .       .
 3     SCORE       f3        .       .       .       .        .      -1       1

Then you can use it with PROC SCORE.

proc score data=one score=score out=want;
run;

You can convert your existing layout with just a little coding.

data two;
  input source :$32. target :$32. factor ;
datalines;
baby f3 -1
back f1 1
ball f1 1
bear f2 1
beat f2 -1
begin f2 -1
call f3 1
;

proc sort data=two;
  by target ;
run;
proc transpose data=two out=score ;
  by target ;
  id source;
  var factor;
run;
data score;
  _type_='SCORE';
  set score;
  drop _name_;
  rename target=_name_;
run;

tonybesas · Posted 11-13-2019 01:28 AM

Yes, I have PROC SCORE in my installation, and this works really well! Thank you, Tom, for your contribution!

hashman · Posted 11-12-2019 06:26 PM

@tonybesas:

Something like below should work - under the assumption that the order of WORD values in TWO matches that of the like-named variables in ONE:

data one ;                                                                                                                                                                                                                                                      
  input file $ baby back ball bear beat begin call ;                                                                                                                                                                                                            
  lines ;                                                                                                                                                                                                                                                       
text1 3 4 0 1 0 1 0                                                                                                                                                                                                                                             
text2 2 0 0 3 3 1 2                                                                                                                                                                                                                                             
text3 0 0 0 1 0 5 1                                                                                                                                                                                                                                             
text4 1 1 2 1 1 4 1                                                                                                                                                                                                                                             
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data two ;                                                                                                                                                                                                                                                      
  input word $ pole $ ;                                                                                                                                                                                                                                         
  lines ;                                                                                                                                                                                                                                                       
baby  f3neg                                                                                                                                                                                                                                                     
back  f1pos                                                                                                                                                                                                                                                     
ball  f1pos                                                                                                                                                                                                                                                     
bear  f2pos                                                                                                                                                                                                                                                     
beat  f2neg                                                                                                                                                                                                                                                     
begin f2neg                                                                                                                                                                                                                                                     
call  f3pos                                                                                                                                                                                                                                                     
;                                                                                                                                                                                                                                                               
run ;                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                
data want ;                                                                                                                                                                                                                                                     
  array x [2,7] _temporary_ ;                                                                                                                                                                                                                                   
  if _n_ = 1 then do _i_ = 1 by 1 until (z) ;                                                                                                                                                                                                                   
    set two (keep = pole) end = z ;                                                                                                                                                                                                                             
    x[1,_i_] = input (char (pole, 2), 1.) ;                                                                                                                                                                                                                     
    x[2,_i_] = choosen (findc ("np", char (pole,3)), -1, 1) ;                                                                                                                                                                                                   
  end ;                                                                                                                                                                                                                                                         
  set one ;                                                                                                                                                                                                                                                     
  array v baby--call ;                                                                                                                                                                                                                                          
  array F [3] ;                                                                                                                                                                                                                                                 
  do over v ;                                                                                                                                                                                                                                                   
    _n_ = x[1,_i_] ;                                                                                                                                                                                                                                            
    f[_n_] = sum (f[_n_], v * x[2,_i_]) ;                                                                                                                                                                                                                       
  end ;                                                                                                                                                                                                                                                         
run ;

Or, alternatively, the array X can be replaced by a hash table, which has the advantage of not making the assumption mentioned above, as the attributes for each variable (which F and which sign) are searched by the variable name in the table X:

data want (drop = _:) ;                                                                                                                                                                                                                                         
  if _n_ = 1 then do ;                                                                                                                                                                                                                                          
    dcl hash x () ;                                                                                                                                                                                                                                             
    x.definekey ("_w") ;                                                                                                                                                                                                                                        
    x.definedata ("_n_", "_iorc_") ;                                                                                                                                                                                                                            
    x.definedone () ;                                                                                                                                                                                                                                           
    do _n_ = 1 by 1 until (z) ;                                                                                                                                                                                                                                 
      set two (rename=word=_w) end = z ;                                                                                                                                                                                                                        
      x.add (key:_w, data:input(char(pole,2),1.), data:choosen(findc("np",char (pole,3)),-1,1)) ;                                                                                                                                                               
    end ;                                                                                                                                                                                                                                                       
  end ;                                                                                                                                                                                                                                                         
  set one ;                                                                                                                                                                                                                                                     
  array v baby--call ;                                                                                                                                                                                                                                          
  array F [3] ;                                                                                                                                                                                                                                                 
  do over v ;                                                                                                                                                                                                                                                   
    x.find (key:vname(v)) ;                                                                                                                                                                                                                                     
    f[_n_] = sum (f[_n_], v * _iorc_) ;                                                                                                                                                                                                                         
  end ;                                                                                                                                                                                                                                                         
run ;

Kind regards

Paul D.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away