Hi, first ever question so sorry if I'm posting on the wrong place or something like that. Please let me know so I can re-post accordingly or edit in any info I've missed.
Version: SAS STUDIO | Release: 3.7 (Enterprise Edition)
So, I've got a sort of meta dataset that looks something like this:
data meta_ds; input varname:$32. id missvalue cap99; datalines; var1 1 0 10 var1 2 3 24 var1 3 5 50 var2 1 1 5 var2 2 10 40 var2 3 4 20 ; run;
Using this meta_ds I need to apply some transformations to another dataset depending on the id variable. Using the example meta_ds, these transformations would look something like this:
if id = 1 then do; if var1 = . then var1 = 0; if var1 > 10 then var1 = 10; if var2 = . then var2 = 1; . . end; else if id = 2 then do; if var1 = . then var1 = 3; . . end; else if id = 3 then do; . . end;
So basically:
if var = . then var = missvalue; if var > cap99 then var > cap99;
The dataset which I need to apply the transformations to is fairly big (around 40million rows) and making this process as fast and efficient as possible is one of the major concerns.
My question is:
Is hardcoding all the if statements the best approach? I know how to generate the code using something like proc sql, which could then be executed with %include for example... but what if I didn't want to have to write all those lines of code (a bit above 100 variables, 3 unique ids, 2 if statements for each variable and id combination, so a bit over 1000 lines of code).
One approach I tried was using the hash object:
data want; if _n_=0 then set meta_ds; DECLARE hash lookUpTable(dataset:'meta_ds'); lookUpTable.DEFINEKEY ('varName','id'); lookUpTable.DEFINEDATA('missValue','cap99'); lookUpTable.DEFINEDONE(); array numvar &varlistContinuas.; do until (eof); set have end=eof; do over numvar; call vname(numvar,varName); varName = upcase(varName); rc = lookUpTable.FIND(); if rc = 0 then do; if numvar = . then numvar = missValue; if numvar > cap99 then numvar = cap99; end; else putlog "ERROR:LookUpTable" varname; end; output; end; drop varName missValue cap99 rc; stop; run;
but this takes quite a bit longer to run (30 min vs <10 min).
Is there a better approach?
Added dummy have data:
data have; input id var1 var2 varn; datalines; 1 3 4 15 1 . 15 . 2 . . 10 3 60 24 9 2 25 50 . 3 . . 2 ; run;
I wouldn't give up on the hash object usage. I suspect that the call vname routine is taking up a lot of time. So instead of call vname (executed for every incoming observation) set up a temporary array of the varnames (no call vname routine):
data meta_ds;
input varname:$32. id missvalue cap99;
datalines;
var1 1 0 10
var1 2 3 24
var1 3 5 50
var2 1 1 5
var2 2 10 40
var2 3 4 20
;
run;
data have;
do id=1 to 3;
do var1=.,2,99;
do var2=.,2,99;
output;
end;
end;
end;
id=4; output;
run;
data want;
set have;
array vnames {2} $32 _temporary_ ('var1','var2');
array values {2} var1 var2;
if _n_=1 then do;
if 0 then set meta_ds;
declare hash h (dataset:'meta_ds');
h.definekey('id','varname');
h.definedata('missvalue','cap99');
h.definedone();
end;
do i=1 to dim(values);
rc=h.find(key:id,key:vnames{i});
if rc^=0 then put 'Lookup Error: ' id= 'Varname=' vnames{i};
else if values{i}=. then values{i}=missvalue;
else if values{i}>cap99 then values{i}=cap99;
end;
drop i varname missvalue cap99 rc;
run;
Now if you don't what to type in the initial values of the vnames array, you could do the call varname subroutine for the first obs only, and the resulting names will be retained through the data step, because that is one attribute of _TEMPORARY_ arrays:
data want;
set have;
array vnames {20} $32 _temporary_ ;
array values {*} var1 var2;
if _n_=1 then do;
do i=1 to dim(values);
call vname(values{i},vnames{i});
end;
if 0 then set meta_ds;
declare hash h (dataset:'meta_ds');
h.definekey('id','varname');
h.definedata('missvalue','cap99');
h.definedone();
end;
do i=1 to dim(values);
rc=h.find(key:id,key:vnames{i});
if rc^=0 then put 'Lookup Error: ' id= 'Varname=' vnames{i};
else if values{i}=. then values{i}=missvalue;
else if values{i}>cap99 then values{i}=cap99;
end;
drop i varname missvalue cap99 rc;
run;
Hi and welcome to the forum. Can you post an example of your have data set?
Also, what is the point of this statement?
varName = upcase(varName);
Have you looked into creating custom formats instead? Given the rules you're applying a format should work fine.
If you need to explicitly recode them, you could probably do it with an array.
@pepegalleta wrote:
Hi, first ever question so sorry if I'm posting on the wrong place or something like that. Please let me know so I can re-post accordingly or edit in any info I've missed.
Version: SAS STUDIO | Release: 3.7 (Enterprise Edition)
So, I've got a sort of meta dataset that looks something like this:
data meta_ds; input varname:$32. id missvalue cap99; datalines; var1 1 0 10 var1 2 3 24 var1 3 5 50 var2 1 1 5 var2 2 10 40 var2 3 4 20 ; run;
Using this meta_ds I need to apply some transformations to another dataset depending on the id variable. Using the example meta_ds, these transformations would look something like this:
if id = 1 then do; if var1 = . then var1 = 0; if var1 > 10 then var1 = 10; if var2 = . then var2 = 1; . . end; else if id = 2 then do; if var1 = . then var1 = 3; . . end; else if id = 3 then do; . . end;
So basically:
if var = . then var = missvalue; if var > cap99 then var > cap99;
The dataset which I need to apply the transformations to is fairly big (around 40million rows) and making this process as fast and efficient as possible is one of the major concerns.
My question is:
Is hardcoding all the if statements the best approach? I know how to generate the code using something like proc sql, which could then be executed with %include for example... but what if I didn't want to have to write all those lines of code (a bit above 100 variables, 3 unique ids, 2 if statements for each variable and id combination, so a bit over 1000 lines of code).
One approach I tried was using the hash object:
data want; if _n_=0 then set meta_ds; DECLARE hash lookUpTable(dataset:'meta_ds'); lookUpTable.DEFINEKEY ('varName','id'); lookUpTable.DEFINEDATA('missValue','cap99'); lookUpTable.DEFINEDONE(); array numvar &varlistContinuas.; do until (eof); set have end=eof; do over numvar; call vname(numvar,varName); varName = upcase(varName); rc = lookUpTable.FIND(); if rc = 0 then do; if numvar = . then numvar = missValue; if numvar > cap99 then numvar = cap99; end; else putlog "ERROR:LookUpTable" varname; end; output; end; drop varName missValue cap99 rc; stop; run;
but this takes quite a bit longer to run (30 min vs <10 min).
Is there a better approach?
I wouldn't give up on the hash object usage. I suspect that the call vname routine is taking up a lot of time. So instead of call vname (executed for every incoming observation) set up a temporary array of the varnames (no call vname routine):
data meta_ds;
input varname:$32. id missvalue cap99;
datalines;
var1 1 0 10
var1 2 3 24
var1 3 5 50
var2 1 1 5
var2 2 10 40
var2 3 4 20
;
run;
data have;
do id=1 to 3;
do var1=.,2,99;
do var2=.,2,99;
output;
end;
end;
end;
id=4; output;
run;
data want;
set have;
array vnames {2} $32 _temporary_ ('var1','var2');
array values {2} var1 var2;
if _n_=1 then do;
if 0 then set meta_ds;
declare hash h (dataset:'meta_ds');
h.definekey('id','varname');
h.definedata('missvalue','cap99');
h.definedone();
end;
do i=1 to dim(values);
rc=h.find(key:id,key:vnames{i});
if rc^=0 then put 'Lookup Error: ' id= 'Varname=' vnames{i};
else if values{i}=. then values{i}=missvalue;
else if values{i}>cap99 then values{i}=cap99;
end;
drop i varname missvalue cap99 rc;
run;
Now if you don't what to type in the initial values of the vnames array, you could do the call varname subroutine for the first obs only, and the resulting names will be retained through the data step, because that is one attribute of _TEMPORARY_ arrays:
data want;
set have;
array vnames {20} $32 _temporary_ ;
array values {*} var1 var2;
if _n_=1 then do;
do i=1 to dim(values);
call vname(values{i},vnames{i});
end;
if 0 then set meta_ds;
declare hash h (dataset:'meta_ds');
h.definekey('id','varname');
h.definedata('missvalue','cap99');
h.definedone();
end;
do i=1 to dim(values);
rc=h.find(key:id,key:vnames{i});
if rc^=0 then put 'Lookup Error: ' id= 'Varname=' vnames{i};
else if values{i}=. then values{i}=missvalue;
else if values{i}>cap99 then values{i}=cap99;
end;
drop i varname missvalue cap99 rc;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.