I have this primary dataset that looks like this, with thousands of lines and hundreds of columns:
file baby back ball bear beat begin call
text1 3 4 0 1 0 1 0
text2 2 0 0 3 3 1 2
text3 0 0 0 1 0 5 1
text4 1 1 2 1 1 4 1
And I also have this secondary dataset that looks like this, which spells out the variables in the primary dataset that should be added up:
word pole
baby f3neg
back f1pos
ball f1pos
bear f2pos
beat f2neg
begin f2neg
call f3pos
What I'd like to do is, for each value of 'file' in the primary database, to sum up the values for the words in the primary database according to the distribution of the words in the secondary database, like this:
F1=(sum of all f1pos words if any) - (sum of all f1neg words if any)
F2=(sum of all f2pos words if any) - (sum of all f2neg words if any)
F3=(sum of all f3pos words if any) - (sum of all f3neg words if any)
For example:
F1 for text1 = ( 4 for back + 0 for ball ) - ( 0 because no words are marked as f1neg in the secondary database ) = 4
F1 for text2 = ( 0 for back + 0 for ball ) - ( 0 because no words are marked as f1neg in the secondary database ) = 0
F1 for text3 = ( 0 for back + 0 for ball ) - ( 0 because no words are marked as f1neg in the secondary database ) = 0
F1 for text4 = ( 1 for back + 2 for ball ) - ( 0 because no words are marked as f1neg in the secondary database ) = 3
F2 for text1 = ( 1 for bear ) - ( 0 for beat + 1 for begin ) = 0
F2 for text2 = ( 3 for bear ) - ( 3 for beat + 1 for begin ) = -1
F2 for text3 = ( 1 for bear ) - ( 0 for beat + 5 for begin ) = -4
F2 for text4 = ( 1 for bear ) - ( 1 for beat + 4 for begin ) = -4
and so on for F3.
and then add this back to the primary database:
file baby back ball bear beat begin call F1 F2 F3
text1 3 4 0 1 0 1 0 4 0 -3
text2 2 0 0 3 3 1 2 0 -1 0
text3 0 0 0 1 0 5 1 0 -4 1
text4 1 1 2 1 1 4 1 3 -4 0
How could this be done in SAS?
thank you ahead
If you have SAS/STAT licensed then use PROC SCORE. Your metadata needs to be in the format of a scoring dataset.
Obs _type_ _name_ back ball bear beat begin baby call
1 SCORE f1 1 1 . . . . .
2 SCORE f2 . . 1 -1 -1 . .
3 SCORE f3 . . . . . -1 1
Then you can use it with PROC SCORE.
proc score data=one score=score out=want;
run;
You can convert your existing layout with just a little coding.
data two;
input source :$32. target :$32. factor ;
datalines;
baby f3 -1
back f1 1
ball f1 1
bear f2 1
beat f2 -1
begin f2 -1
call f3 1
;
proc sort data=two;
by target ;
run;
proc transpose data=two out=score ;
by target ;
id source;
var factor;
run;
data score;
_type_='SCORE';
set score;
drop _name_;
rename target=_name_;
run;
@tonybesas Hi and welcome to the SAS Community 🙂
Here is one way
data one;
input file $ baby back ball bear beat begin call;
datalines;
text1 3 4 0 1 0 1 0
text2 2 0 0 3 3 1 2
text3 0 0 0 1 0 5 1
text4 1 1 2 1 1 4 1
;
data two;
input word $ pole $;
datalines;
baby f3neg
back f1pos
ball f1pos
bear f2pos
beat f2neg
begin f2neg
call f3pos
;
data want(drop=i v word pole);
if _N_=1 then do;
declare hash h(dataset:'two');
h.definekey('word', 'pole');
h.definedone();
end;
set one;
if 0 then set two;
array _ {*} _numeric_;
do i=1 to dim(_);
v=vname(_[i]);
if h.check(key:v, key:'f1pos')=0 then F1=sum(F1, _[i]);
if h.check(key:v, key:'f1neg')=0 then F1=sum(F1, -1*_[i]);
if h.check(key:v, key:'f2pos')=0 then F2=sum(F2, _[i]);
if h.check(key:v, key:'f2neg')=0 then F2=sum(F2, -1*_[i]);
if h.check(key:v, key:'f3pos')=0 then F3=sum(F3, _[i]);
if h.check(key:v, key:'f3neg')=0 then F3=sum(F3, -1*_[i]);
end;
run;
Result
file baby back ball bear beat begin call F1 F2 F3 text1 3 4 0 1 0 1 0 4 0 -3 text2 2 0 0 3 3 1 2 0 -1 0 text3 0 0 0 1 0 5 1 0 -4 1 text4 1 1 2 1 1 4 1 3 -4 0
This works great, really fast! Thank you for taking the time to do this, draycut!
UNTESTED CODE
%macro dothis;
proc sql noprint;
select distinct word into :f1pos separated by ',' from secondary_data_set
where lowcase(pole)='f1pos';
select distinct word into :f1neg separated by ',' from secondary_data_set
where lowcase(pole)='f1neg';
select distinct word into :f2pos separated by ',' from secondary_data_set
where lowcase(pole)='f2pos';
select distinct word into :f2neg separated by ',' from secondary_data_set
where lowcase(pole)='f2neg';
select distinct word into :f3pos separated by ',' from secondary_data_set
where lowcase(pole)='f3pos';
select distinct word into :f3neg separated by ',' from secondary_data_set
where lowcase(pole)='f3neg';
quit;
data want;
set have;
f1=%if &f1pos^= %then sum(&f1pos); -
%if &f1neg^= %then sum(&f1neg); ;
/* you do the typing for F2 and F3, I'm lazy */
run;
%mend;
%dothis
Thank you for contributing this code! it works really well!
Hi @tonybesas First off, very interesting question from the perspective of
1. How to build an algorithm that will scale
2. How to structure the flow
3. How to look up, rotate, look up and again rotate or in other words going from vertical to horizontal and vice versa.
The following is simple data-driven and yet will scale. The positive and negatives are the only categorical component that matters in a series and the series can be any number.
data one;
input file $ baby back ball bear beat begin call;
datalines;
text1 3 4 0 1 0 1 0
text2 2 0 0 3 3 1 2
text3 0 0 0 1 0 5 1
text4 1 1 2 1 1 4 1
;
data two;
input word $ pole $;
datalines;
baby f3neg
back f1pos
ball f1pos
bear f2pos
beat f2neg
begin f2neg
call f3pos
;
proc transpose data=one out=t;
by file;
var baby--call;
run;
proc sql;
create table t2 as
select file,substr(pole,1,2) as series, substr(pole,3) as sign,sum(col1) as s
from t a left join two b
on a._name_=b.word
group by file,series,sign
order by file,series,sign;
quit;
data t3;
set t2;
by file series;
dif=dif(s);
if first.series then dif=s;
if last.series;
run;
proc transpose data=t3(keep=file series dif) out=want(drop=_name_);
by file ;
id series;
var dif;
run;
data final_want;
merge one want;
by file;
run;
data one;
input file $ baby back ball bear beat begin call;
datalines;
text1 3 4 0 1 0 1 0
text2 2 0 0 3 3 1 2
text3 0 0 0 1 0 5 1
text4 1 1 2 1 1 4 1
;
data two;
input word $ pole $;
datalines;
baby f3neg
back f1pos
ball f1pos
bear f2pos
beat f2neg
begin f2neg
call f3pos
;
data temp;
if _n_=1 then do;
if 0 then set one two;
dcl hash H (dataset:'two');
h.definekey ("word");
h.definedata ("pole");
h.definedone ();
dcl hash H1 (ordered:'y');
h1.definekey ("pole");
h1.definedata ("pole","s");
h1.definedone ();
dcl hiter hi('h1');
end;
set one;
array t baby--call;
do over t;
rc=h.find(key:vname(t));
if h1.find() ne 0 then s=t;
else s=sum(s,t);
h1.replace();
end;
do while(hi.next()=0);
series=substr(pole,1,2);
output;
end;
h1.clear();
keep file series s;
run;
proc sql noprint;
select distinct series into :series separated by ' '
from w1;
quit;
data want;
if _n_=1 then do;
if 0 then set one ;
dcl hash H (dataset:'one') ;
h.definekey ("file") ;
h.definedata (all:'y') ;
h.definedone () ;
array t(*) &series;
end;
do _n_=1 by 1 until(last.file);
do until(last.series);
set temp;
by file series;
dif=dif(s);
if first.series then dif=s;
end;
h.find();
t(_n_)=dif;
end;
drop series dif s;
run;
Great code, took care of the job! thank you, novinosrin, for taking the time to contribute!
Sorry, I meant to reply to your first code, novinosrin, but somehow this ended up beneath your second code. It turns out, the second code generates an error message, pointing to the ampersand:
Great code, took care of the job! thank you, novinosrin, for taking the time to contribute!
I would format you metadata table to be easier to use for generating the code.
data two;
input source :$32. target :$32. sign :$1.;
datalines;
baby f3 -
back f1 +
ball f1 +
bear f2 +
beat f2 -
begin f2 -
call f3 +
;
proc sort data=two;
by target source ;
run;
Then you can use that metadata to generate the statements you need.
filename code temp;
data _null_;
file code ;
set two ;
by target ;
if first.target then put target '=sum(' @;
else put ',' @;
put sign source @ ;
if last.target then put ');' ;
run;
Resulting text file:
f1 =sum(+ back ,+ ball ); f2 =sum(+ bear ,- beat ,- begin ); f3 =sum(- baby ,+ call );
You can then use that code against your input data.
38 data want; 39 set one ; 40 %include code / source2; NOTE: %INCLUDE (level 1) file CODE is file /Volumes/scratch/SAS_work3B6800005EBE_amrdrml464.pfizer.com/#LN00020. 41 +f1 =sum(+ back ,+ ball ); 42 +f2 =sum(+ bear ,- beat ,- begin ); 43 +f3 =sum(- baby ,+ call ); NOTE: %INCLUDE (level 1) ending. 44 run;
Result:
Obs file baby back ball bear beat begin call f1 f2 f3 1 text1 3 4 0 1 0 1 0 4 0 -3 2 text2 2 0 0 3 3 1 2 0 -1 0 3 text3 0 0 0 1 0 5 1 0 -4 1 4 text4 1 1 2 1 1 4 1 3 -4 0
thank you, Tom, this worked really well!
If you have SAS/STAT licensed then use PROC SCORE. Your metadata needs to be in the format of a scoring dataset.
Obs _type_ _name_ back ball bear beat begin baby call
1 SCORE f1 1 1 . . . . .
2 SCORE f2 . . 1 -1 -1 . .
3 SCORE f3 . . . . . -1 1
Then you can use it with PROC SCORE.
proc score data=one score=score out=want;
run;
You can convert your existing layout with just a little coding.
data two;
input source :$32. target :$32. factor ;
datalines;
baby f3 -1
back f1 1
ball f1 1
bear f2 1
beat f2 -1
begin f2 -1
call f3 1
;
proc sort data=two;
by target ;
run;
proc transpose data=two out=score ;
by target ;
id source;
var factor;
run;
data score;
_type_='SCORE';
set score;
drop _name_;
rename target=_name_;
run;
Yes, I have PROC SCORE in my installation, and this works really well! Thank you, Tom, for your contribution!
Something like below should work - under the assumption that the order of WORD values in TWO matches that of the like-named variables in ONE:
data one ;
input file $ baby back ball bear beat begin call ;
lines ;
text1 3 4 0 1 0 1 0
text2 2 0 0 3 3 1 2
text3 0 0 0 1 0 5 1
text4 1 1 2 1 1 4 1
;
run ;
data two ;
input word $ pole $ ;
lines ;
baby f3neg
back f1pos
ball f1pos
bear f2pos
beat f2neg
begin f2neg
call f3pos
;
run ;
data want ;
array x [2,7] _temporary_ ;
if _n_ = 1 then do _i_ = 1 by 1 until (z) ;
set two (keep = pole) end = z ;
x[1,_i_] = input (char (pole, 2), 1.) ;
x[2,_i_] = choosen (findc ("np", char (pole,3)), -1, 1) ;
end ;
set one ;
array v baby--call ;
array F [3] ;
do over v ;
_n_ = x[1,_i_] ;
f[_n_] = sum (f[_n_], v * x[2,_i_]) ;
end ;
run ;
Or, alternatively, the array X can be replaced by a hash table, which has the advantage of not making the assumption mentioned above, as the attributes for each variable (which F and which sign) are searched by the variable name in the table X:
data want (drop = _:) ;
if _n_ = 1 then do ;
dcl hash x () ;
x.definekey ("_w") ;
x.definedata ("_n_", "_iorc_") ;
x.definedone () ;
do _n_ = 1 by 1 until (z) ;
set two (rename=word=_w) end = z ;
x.add (key:_w, data:input(char(pole,2),1.), data:choosen(findc("np",char (pole,3)),-1,1)) ;
end ;
end ;
set one ;
array v baby--call ;
array F [3] ;
do over v ;
x.find (key:vname(v)) ;
f[_n_] = sum (f[_n_], v * _iorc_) ;
end ;
run ;
Kind regards
Paul D.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.