BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pepegalleta
Fluorite | Level 6

Hi, first ever question so sorry if I'm posting on the wrong place or something like that. Please let me know so I can re-post accordingly or edit in any info I've missed.

 

Version: SAS STUDIO | Release: 3.7 (Enterprise Edition)

 

So, I've got a sort of meta dataset that looks something like this:

 

data meta_ds;
  input varname:$32. id missvalue cap99;
  datalines;
  var1 1 0 10
  var1 2 3 24
  var1 3 5 50
  var2 1 1 5
  var2 2 10 40
  var2 3 4 20
  ;
run;

 

 

Using this meta_ds I need to apply some transformations to another dataset depending on the id variable. Using the example meta_ds, these transformations would look something like this:

 

if id = 1 then do;
  if var1 = . then var1 = 0;
  if var1 > 10 then var1 = 10;
  if var2 = . then var2 = 1;
  .
  .
end;

else if id = 2 then do;
  if var1 = . then var1 = 3;
  .
  .
end;

else if id = 3 then do;
  .
  .
end;

 

So basically:

if var = . then var = missvalue;
if var > cap99 then var > cap99;

 

The dataset which I need to apply the transformations to is fairly big (around 40million rows) and making this process as fast and efficient as possible is one of the major concerns.

 

My question is:

 

Is hardcoding all the if statements the best approach? I know how to generate the code using something like proc sql, which could then be executed with %include for example... but what if I didn't want to have to write all those lines of code (a bit above 100 variables, 3 unique ids, 2 if statements for each variable and id combination, so a bit over 1000 lines of code).

 

One approach I tried was using the hash object:

 

data want;
  if _n_=0 then set meta_ds;
  
  DECLARE hash lookUpTable(dataset:'meta_ds');
    lookUpTable.DEFINEKEY ('varName','id');     
    lookUpTable.DEFINEDATA('missValue','cap99');      
    lookUpTable.DEFINEDONE();

  array numvar &varlistContinuas.;
  
  do until (eof);
    set have end=eof;
    
    do over numvar;
      call vname(numvar,varName);
      varName = upcase(varName);
      rc = lookUpTable.FIND();
      if rc = 0 then do;
        if numvar = .     then numvar = missValue;
        if numvar > cap99 then numvar = cap99;
      end;
      else putlog "ERROR:LookUpTable" varname;
    end;
    output;
  end;
  
  drop varName missValue cap99 rc;  
  stop;
run;

 

but this takes quite a bit longer to run (30 min vs <10 min).

 

Is there a better approach?

 

 Added dummy have data:

data have;
  input id var1 var2 varn;
  datalines;
  1 3 4 15
  1 . 15 .
  2 . . 10
  3 60 24 9
  2 25 50 .
  3 . . 2
  ;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

I wouldn't give up on the hash object usage.  I suspect that the call vname routine is taking up a lot of time.  So instead of call vname (executed for every incoming observation) set up a temporary array of the varnames (no call vname routine):

 

 

data meta_ds;
  input varname:$32. id missvalue cap99;
  datalines;
var1 1 0 10
var1 2 3 24
var1 3 5 50
var2 1 1 5
var2 2 10 40
var2 3 4 20
;
run;

data have;
  do id=1 to 3;
    do var1=.,2,99;
      do var2=.,2,99;
        output;
      end;
    end;
  end;
  id=4; output;
run;

data want;
  set have;
  array vnames {2} $32 _temporary_ ('var1','var2');
  array values {2} var1 var2;

  if _n_=1 then do;
    if 0 then set meta_ds;
    declare hash h (dataset:'meta_ds');
	  h.definekey('id','varname');
	  h.definedata('missvalue','cap99');
	  h.definedone();
  end;

  do i=1 to dim(values);
    rc=h.find(key:id,key:vnames{i});
	if rc^=0 then put 'Lookup Error: ' id= 'Varname=' vnames{i};
	else if values{i}=. then values{i}=missvalue;
	else if values{i}>cap99 then values{i}=cap99;
  end;
  drop i varname missvalue cap99 rc;
run;

 

Now if you don't what to type in the initial values of the vnames array, you could do the call varname subroutine for the first obs only, and the resulting names will be retained through the data step, because that is one attribute of _TEMPORARY_ arrays:

 

data want;
  set have;
  array vnames {20} $32 _temporary_ ;
  array values {*} var1 var2;

  if _n_=1 then do;
    do i=1 to dim(values);
	   call vname(values{i},vnames{i});
	end;

    if 0 then set meta_ds;
    declare hash h (dataset:'meta_ds');
	  h.definekey('id','varname');
	  h.definedata('missvalue','cap99');
	  h.definedone();
  end;

  do i=1 to dim(values);
    rc=h.find(key:id,key:vnames{i});
	if rc^=0 then put 'Lookup Error: ' id= 'Varname=' vnames{i};
	else if values{i}=. then values{i}=missvalue;
	else if values{i}>cap99 then values{i}=cap99;
  end;
  drop i varname missvalue cap99 rc;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

5 REPLIES 5
PeterClemmensen
Tourmaline | Level 20

Hi and welcome to the forum. Can you post an example of your have data set?

 

Also, what is the point of this statement?

 

varName = upcase(varName);
pepegalleta
Fluorite | Level 6
Added example of have data to original post.
The upcase function was there because in meta_ds varName is all caps and I just wanted to make sure the output of the call vname routine was also all caps so the lookUp wouldn't fail.
My bad on using lowercase for the example
Reeza
Super User

Have you looked into creating custom formats instead? Given the rules you're applying a format should work fine.

If you need to explicitly recode them, you could probably do it with an array. 

 


@pepegalleta wrote:

Hi, first ever question so sorry if I'm posting on the wrong place or something like that. Please let me know so I can re-post accordingly or edit in any info I've missed.

 

Version: SAS STUDIO | Release: 3.7 (Enterprise Edition)

 

So, I've got a sort of meta dataset that looks something like this:

 

data meta_ds;
  input varname:$32. id missvalue cap99;
  datalines;
  var1 1 0 10
  var1 2 3 24
  var1 3 5 50
  var2 1 1 5
  var2 2 10 40
  var2 3 4 20
  ;
run;

 

 

Using this meta_ds I need to apply some transformations to another dataset depending on the id variable. Using the example meta_ds, these transformations would look something like this:

 

if id = 1 then do;
  if var1 = . then var1 = 0;
  if var1 > 10 then var1 = 10;
  if var2 = . then var2 = 1;
  .
  .
end;

else if id = 2 then do;
  if var1 = . then var1 = 3;
  .
  .
end;

else if id = 3 then do;
  .
  .
end;

 

So basically:

if var = . then var = missvalue;
if var > cap99 then var > cap99;

 

The dataset which I need to apply the transformations to is fairly big (around 40million rows) and making this process as fast and efficient as possible is one of the major concerns.

 

My question is:

 

Is hardcoding all the if statements the best approach? I know how to generate the code using something like proc sql, which could then be executed with %include for example... but what if I didn't want to have to write all those lines of code (a bit above 100 variables, 3 unique ids, 2 if statements for each variable and id combination, so a bit over 1000 lines of code).

 

One approach I tried was using the hash object:

 

data want;
  if _n_=0 then set meta_ds;
  
  DECLARE hash lookUpTable(dataset:'meta_ds');
    lookUpTable.DEFINEKEY ('varName','id');     
    lookUpTable.DEFINEDATA('missValue','cap99');      
    lookUpTable.DEFINEDONE();

  array numvar &varlistContinuas.;
  
  do until (eof);
    set have end=eof;
    
    do over numvar;
      call vname(numvar,varName);
      varName = upcase(varName);
      rc = lookUpTable.FIND();
      if rc = 0 then do;
        if numvar = .     then numvar = missValue;
        if numvar > cap99 then numvar = cap99;
      end;
      else putlog "ERROR:LookUpTable" varname;
    end;
    output;
  end;
  
  drop varName missValue cap99 rc;  
  stop;
run;

 

but this takes quite a bit longer to run (30 min vs <10 min).

 

Is there a better approach?

 

 


 

 

pepegalleta
Fluorite | Level 6
I actually went the format route when dealing with string variables.
The problem here, as I understand it, is that I would need to create 3 formats for each variable, one for every id; so I'd need to add like a suffix to each fmat name to identify them.
The thing is some of these variable names have already been truncated to fit sas' 32 character limit, and i run the risk of having some naming conflicts if I further truncate the var names, which I'm trying to avoid right now
mkeintz
PROC Star

I wouldn't give up on the hash object usage.  I suspect that the call vname routine is taking up a lot of time.  So instead of call vname (executed for every incoming observation) set up a temporary array of the varnames (no call vname routine):

 

 

data meta_ds;
  input varname:$32. id missvalue cap99;
  datalines;
var1 1 0 10
var1 2 3 24
var1 3 5 50
var2 1 1 5
var2 2 10 40
var2 3 4 20
;
run;

data have;
  do id=1 to 3;
    do var1=.,2,99;
      do var2=.,2,99;
        output;
      end;
    end;
  end;
  id=4; output;
run;

data want;
  set have;
  array vnames {2} $32 _temporary_ ('var1','var2');
  array values {2} var1 var2;

  if _n_=1 then do;
    if 0 then set meta_ds;
    declare hash h (dataset:'meta_ds');
	  h.definekey('id','varname');
	  h.definedata('missvalue','cap99');
	  h.definedone();
  end;

  do i=1 to dim(values);
    rc=h.find(key:id,key:vnames{i});
	if rc^=0 then put 'Lookup Error: ' id= 'Varname=' vnames{i};
	else if values{i}=. then values{i}=missvalue;
	else if values{i}>cap99 then values{i}=cap99;
  end;
  drop i varname missvalue cap99 rc;
run;

 

Now if you don't what to type in the initial values of the vnames array, you could do the call varname subroutine for the first obs only, and the resulting names will be retained through the data step, because that is one attribute of _TEMPORARY_ arrays:

 

data want;
  set have;
  array vnames {20} $32 _temporary_ ;
  array values {*} var1 var2;

  if _n_=1 then do;
    do i=1 to dim(values);
	   call vname(values{i},vnames{i});
	end;

    if 0 then set meta_ds;
    declare hash h (dataset:'meta_ds');
	  h.definekey('id','varname');
	  h.definedata('missvalue','cap99');
	  h.definedone();
  end;

  do i=1 to dim(values);
    rc=h.find(key:id,key:vnames{i});
	if rc^=0 then put 'Lookup Error: ' id= 'Varname=' vnames{i};
	else if values{i}=. then values{i}=missvalue;
	else if values{i}>cap99 then values{i}=cap99;
  end;
  drop i varname missvalue cap99 rc;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1617 views
  • 1 like
  • 4 in conversation