BookmarkSubscribeRSS Feed
Azeem112
Quartz | Level 8

Hi , I have a table as below

data have;
length ID $10. score 8. formula $200.;
infile datalines dsd missover  ;
input ID $ score formula $;
datalines;
KRI1,,KRI2/KRI3
KRI2,1,
KRI3,10,
KRI4,,(KRI5-KRI6)*100
KRI5,13,
KRI6,16,
KRI7,,(KRI8+KRI9)/(KRI10-KRI11)
KRI8,13,
KRI9,16,
KRI10,1,
KRI11,10,
;
run;

I want to replace the formula ID's with actual values. i can filter out the formula data by using where the formula is not empty but my actual table has +250k rows and i can't do a standard loop and replace method.   

I want my want table as

data want;
length ID $10. formula $200. score 8. ;
infile datalines dsd missover  ;
input ID $ formula $ score;
datalines;
KRI1,1/10,0.1
KRI4,(13-16)*100,-300
KRI7,(13+16)/(1-10),-3.22
;
run;

for a smaller data I am using loop and tranwrd function and then use eval method to calculate the final value.

How can i do the same for  a large dataset?

5 REPLIES 5
PeterClemmensen
Tourmaline | Level 20

Try this. Should perform reasonably. 

 

Feel free to ask 🙂

 

data temp;

   format ID formula score;

   merge have
         have (firstobs = 2 keep = score rename = score = s);

   if score = . then do;
      f = formula;
      i = ID;
   end;

   if score then f = tranwrd(f, strip(ID), put(strip(score), 8.));

   if s = . then do;
      formula = compress(f);
      ID = i;
      score = input(resolve('%sysevalf('||formula||')'), best8.);
      output;
   end;

   retain f i;
   keep ID formula score;
   format score2 8.2;
run;

 

Result:

 

ID    formula         score
KRI1  1/10            0.1
KRI4  (13-16)*100     -300
KRI7  (13+16)/(1-10)  -3.22222

 

AhmedAl_Attar
Ammonite | Level 13

Hi @Azeem112 

 

Stealing some code from @PeterClemmensen response 😉, here is another approach

data have;
length ID $10. score 8. formula $200.;
infile datalines dsd missover  ;
input ID $ score formula $;
datalines;
KRI1,,KRI2/KRI3
KRI2,1,
KRI3,10,
KRI4,,(KRI5-KRI6)*100
KRI5,13,
KRI6,16,
KRI7,,(KRI8+KRI9)/(KRI10-KRI11)
KRI8,13,
KRI9,16,
KRI10,1,
KRI11,10,
;
run;

/* Create custom informat that can be used to convert ID to matching numeric value */
data infmt;
  retain fmtname 'ref' type 'I';
  /* Exclude formula records */
  set have(where=(missing(formula)) rename=(id=start score=label)); 
run;
proc format cntlin=work.infmt; run;

/* Use the custom Informat with Regular Expression */
data want(KEEP=ID formula score);
  set have(where=(strip(formula) ne ''));
  length formula2 $400;
  formula2 = cats('%sysevalf(',prxchange('s/KRI(\d+)/%sysfunc(inputn(KRI$1,ref.))/',-1,formula),')');
  score = input(resolve(formula2),best.);
run;

Hope this helps

Ksharp
Super User
data have;
length ID $10. score 8. formula $200.;
infile datalines dsd missover  ;
input ID $ score formula $;
datalines;
KRI1,,KRI2/KRI3
KRI2,1,
KRI3,10,
KRI4,,(KRI5-KRI6)*100
KRI5,13,
KRI6,16,
KRI7,,(KRI8+KRI9)/(KRI10-KRI11)
KRI8,13,
KRI9,16,
KRI10,1,
KRI11,10,
;
run;
proc transpose data=have(where=(score is not missing)) out=temp(drop=_name_) prefix=_;
id id;
var score;
run;
data want;
 if _n_=1 then set temp;
 set have(where=(formula is not missing));
 array x{*} _:;
 do i=1 to dim(x);
  formula=tranwrd(upcase(formula),upcase(substr(vname(x{i}),2)),strip(x{i}) );
 end;
 score=input(resolve(catt('%sysevalf(',formula,')')),best.);
 drop i _:;
 run;
mkeintz
PROC Star

If your data are always structured such that a formula observation is immediately followed by all of its components, then the solution offered by @PeterClemmensen seems to be the most straightforward.

 

But if the components can be anywhere in the dataset, then you need to have access to all the known score values in the dataset before you can reliably interpret every formula.  @Ksharp 's solution does that.  The only downside I see to the latter is that, for each formula, it loops through all known scores in the dataset ("do i=1 to dim(_x)").  It could be a big loop.  You said you have 250,000 obs, so maybe you have 50,000 formulas and 200,000 known scores to loop through for each formula, yes?.

 

Here's a way to have every known score available for each formula, but loop only through those scores actually in the formula:

 

data have;
  length ID $10. score 8. formula $200.;
  infile datalines dsd missover  ;
  input ID $ score formula $;
datalines;
KRI1,,KRI2/KRI3
KRI2,1,
KRI3,10,
KRI4,,(KRI5-KRI6)*100
KRI5,13,
KRI6,16,
KRI7,,(KRI8+KRI9)/(KRI10-KRI11)
KRI8,13,
KRI9,16,
KRI10,1,
KRI11,10,
run;

data want (drop=_: i);
  set have (where=(formula is not missing));
  if _n_=1 then do;
    declare hash sc (dataset:'have (keep=id score  where=(score is not missing))');
      sc.definekey('id');
      sc.definedata('score');
      sc.definedone();
  end;

  resolve_text=compress(formula);

  do i=1 to countw(formula,'-+*/() ');  /* For every term in the formula */
    _term=scan(formula,i,'-+*/() ');    /* Get the term identity, then insert its score */
    if sc.find(key:_term)=0 then resolve_text=tranwrd(resolve_text,trim(_term),trim(cats(score)));
  end;
  score=input(resolve(cats('%sysevalf(',resolve_text,')')),32.);
run;

Note that the sc.find(key:_term) method will retrieve the score from the SC hash, but not the ID.  That's because the ID is only in the .definekey method, but not in the .definedata method.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tom
Super User Tom
Super User

Are your ID values valid SAS names?

Transpose your non formula values to a single observation.  First see if PROC TRANSPOSE can handle it.

proc transpose
  data=have(where=(missing(formula)))
  out=wide(drop=_name_ )
;
  id id;
  var score;
run;

If it cannot you can generate CODE to make the WIDE dataset instead.

filename code temp;
data _null_;
  file code;
  set have;
  where not missing(score);
  put id '=' score ';' ;
run;
data wide;
  %include code ;
run;

Now use the other observations to generate code to set the SCORE (and ID and FORMUAL variables). And include the generated code to calculate the score and write the observations.

filename code temp;
data _null_;
  file code;
  set have;
  where not missing(formula);
  put id=:$quote. ';' formula= :$quote. ';score=' formula ';output;' ;
run;

Now read in the WIDE dataset to get the SCORE values for the IDs in question.  Also read in the HAVE dataset (but not any of the data) to set the type/length/format of the variables.

data want; 
  set have(obs=0) wide;
  %inc code ;
  keep id score formula; 
run;

Tom_0-1733021705962.png

You could generate bot code files in a single pass through the HAVE dataset.  So two passes total.  See below.  Remove the WHERE statement if you want the result to have all of the original observations.

 

Spoiler
filename code1 temp;
filename code2 temp;
data _null_;
  set have;
  if missing(formula) then do;
     file code1;
     put id '=' score ';' ;
  end;
  else do;
     file code2 ;
     put 'when (' id :$quote. ') score=' formula ';' ;
  end;
run;

data wide ;
  %include code1 ;
run;

data want; 
  if _n_=1 then set wide ;
  set have;
  where not missing(formula);
  select (id);
    %inc code2 ;
     otherwise ;
  end;
  keep id score formula ;
run;

 

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 499 views
  • 5 likes
  • 6 in conversation