BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bkq32
Quartz | Level 8

Hello,

 

I have:

data have;
 input id$ match1$ match2$ match3$;
cards;
1 1 1 1
2 1 0 1
3 1 1 1
;
run;

and want to create the variable SCORE. I wrote:

data want;
 set have;
 if match1="1" and match2="1" and match3="1" then score="1";
 if match1="0" and match2="1" and match3="1" then score="2";
 if match1="1" and match2="0" and match3="1" then score="3";
run;

In my code for the actual data set, SCORE increases by 1 in each subsequent IF THEN statement. Is there a way to simplify this? I was looking for threads on arrays but am still a little confused how to apply that here.

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data have;
 input id$ match1$ match2$ match3$;
cards;
1 1 1 1
2 1 0 1
3 1 1 1
;
run;
data want;
 set have;
 temp=cats(of match:);
 if temp='111' then score=1;
  else if temp='101' then score=2;
run;

View solution in original post

7 REPLIES 7
Astounding
PROC Star

Technically, you are creating SCORE as a character variable.  So it's not really increasing by 1 each time, it's just a different character.  If you were to make SCORE a numeric variable instead, by removing the quotes around its assigned value, there might be other possibilities.  (Important note:  are match1 through match3 really character variables as well?)

 

This isn't necessarily simpler ... it's in the mind of the beholder:

 

data want;
 set have;
  if match3="1" then do;
    if match1="1" and match2="1" then score="1";
    if match1="0" and match2="1" then score="2";
    if match1="1" and match2="0" then score="3";
  end;
run;
andreas_lds
Jade | Level 19

Is match3 = "1" true for all obs in your real data? If not, score will be missing for those obs. If yes: you can remove it from the if-statements.

bkq32
Quartz | Level 8

All the match variables are boolean, so I guess I don't need to make them character variables. I had sent Company A a file with demographic information, and they gave back a data set with these match variables which indicate that Company A also had a record in their database for a particular person (ID) that matched on a specific demographic (e.g. If match1 = 1, then the last name matched; If match2 = 1, then first name matched; If match3 = 1, then social security number matched, etc.)

 

I now have to create a match score (variable SCORE) for each record that indicates how likely the record in my data set represents the same person in Company A's data set. I arbitrarily determined that a score of 1 indicates the highest likelihood that it's a true match when all demographic variables matched (=1). I'm then identifying the next slightly less ideal permutation (e.g. last name did not match) and giving it a score of 2, and so on, so all match variables will at some point take on 0s and 1s.

 

I'm not opposed to writing out all the IF THEN statements, but was just wondering if there might be a more efficient way to do this.

FreelanceReinh
Jade | Level 19

Hello @bkq32,

 

I think ideally there would be a formula to compute SCORE from the MATCHx variables (in which case all these variables should be numeric). For example, a reasonable SCORE might be a linear combination of MATCH1, MATCH2, ..., i.e., SCORE=w1*MATCH1+w2*MATCH2+... where w1, w2, ...>0 are weights so that important matches like "last name" contribute more to the score than, say, "house number supplement." With a suitable formula this would mean: "the higher the score, the better the match" (contrary to your example scores). In particular, the perfect match (MATCH1=MATCH2=...MATCHn=1) would automatically have the highest possible score.

 

If it's too difficult to develop such a formula and the scores are simply ("arbitrarily") assigned using a lookup table, you could store this table either

  • in a SAS dataset and then perform the assignment using a common lookup technique, e.g., by means of a hash object in a DATA step, which could look like this:
    rc=h.find(key:cats(of match:));
  • or in a numeric informat, say, mscore and then use the INPUT function to assign the score, e.g.:
    score=input(cats(of match:),mscore.);
Astounding
PROC Star

Here's a simplistic, but perfectly valid way to do this.

 

Your MATCH variables have to be defined in order (match1=most important, match3 = least important).

 

SCORE will be numeric, with a higher score being a better match.

 

In a single statement:

 

score = input(cats(of match1-match3), binary3.));

 

Ksharp
Super User
data have;
 input id$ match1$ match2$ match3$;
cards;
1 1 1 1
2 1 0 1
3 1 1 1
;
run;
data want;
 set have;
 temp=cats(of match:);
 if temp='111' then score=1;
  else if temp='101' then score=2;
run;
bkq32
Quartz | Level 8

Thanks everyone! These were really helpful.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 826 views
  • 1 like
  • 5 in conversation