Hello all!
In a pickle here, I have these SQL queries that I run to pull users that are exactly the same or similar.
Suppose the table name: usertable
SQL:
Combination 1
Name1, name2, compared(name1,name2) from usertable a join usertable b where a.passport=b.passport
Combination 2
Name1, name2, compared(name1,name2) from usertable a join usertable b where a.email=b.email
Results: name1 name2 0
name1 name2 200
name1 name2 50
Merge combination 1 & combination 2
This creates a dataset with field id, name1, name2, passport, email. Here all the names will be there with matching docs or not (true positive or false positive).
From here how do I make sure that only true positive ids stay? i.e those with similar name or same name but the documents match exactly.
Thanks!
Hey!
Yeah I see the point, I was having hard time explaining it myself, figured explaining the background of the task itself would have taken some time.
But I feel like there a logic in works now which I feel would do a good job for this. For that reason I would close this post so that more people don't get confused:P
But yeah next questions the question gonna be more clear:P
Thank you tho!
The code below:
Suppose there's this combination here giving us duplicate users using the DOB, Email & Visa matching
SELECT "C9" AS COMB, A. NEW _CIF_NO, A. CUST_NAME, B. NEW CIF_ NO AS NEW CIF NO 1, B. CUST_ NAME AS CUST_FULL NAME__1, COMPGED (A. CUST_NAME, B. CUST_NAME) AS SCORE FROM ALL_ENT_ CUST_IND BASE A JOIN ALI_ENT_ CUST_ INDV_BASE B ON COMPGED (A. CUST_NAME, B. CUST_NAME) <= 1000 AND A. CUST DOB = B. CUST DOB AND A. MOBILE NUM = B. MOBILE NUM AND A. DOC VISA = B. DOC VISA AND A. CUST_ NAME IS NOT NULL
gives us the result below,
CIF Name Cif_new Name2 Score EIB~72—775 EASO PHILIP K A EASO ENB~48--88 EASO PHILIP 330 EIB~72—445 JUDE GRAP ENB~48--98 JUDE GRAP 0 EIB~71—775 EASO PHILIP EASO K A ENB~43--88 PHILIP EASO 500
Score 0 is exact name match along with the doc matching, the higher the score the less the name match.
After a bunch of combinations they all get merged, and a final dataset is created with all the user names, documents like this
CIF Name Cif_new Doc DOB Email EIB~72—775 EASO PHILIP K A EASO ENB~48--88 A32742 12-JUL-20 AHSDA@GMAIL.COM EIB~72—445 JUDE GRAP ENB~48---98 B42823 13-AUG-20 GFA@GMAIL.COM EIB~72—775 EASO PHILIP K A EASO ENB~436-88 C91742 17-FEB-21 UADFN@GMAIL.COM
This dataset here contains score 0, score 330 & score 500 etc all together.
What Im trying to do is find a way to create a subset-dataset which would have only those users that even if the names don't match exactly but their documents match. Basically no false positive (like names that match but no matching documents) but just true positive.
Let me know if more clarification needed!
I'm lost. The query you showed will only return records that match exactly for DOB, MobileNum, and DocVisa, because of your ON clause:
ON COMPGED (A. CUST_NAME, B. CUST_NAME) <= 1000
AND A. CUST DOB = B. CUST DOB
AND A. MOBILE NUM = B. MOBILE NUM
AND A. DOC VISA = B. DOC VISA
AND A. CUST_ NAME IS NOT NULL
When you say "after a bunch of combinations they all get merged" it's not clear what you mean, because you only show one query, I assume one "combination".
It's also hard to follow because the variable names in your code have spaces in them, and don't match to the variable names you show in the output.
I think to make this more clear, it would help if you could show a full small example. You could show a DATA step with the CARDS statement to make work.ALL_ENT_CUST_IND_BASE with 5-10 records, and another step to make work.ALI_ENT_CUST_INDV_BASE with 5-10 records.
Then show queries to make two of your "combination" datasets. And show the combination datasets (and explain if you like them). Then show code to merge the combination datasets, and the show the results. And then describe why you don't like the results, and show the results you want.
The key point is for you to post working code to create the input data, and working code for your queries to create the combinations, and merge the combinations, so that people who want to help can replicate your scenario.
@Quentin wrote:
I'm lost. The query you showed will only return records that match exactly for DOB, MobileNum, and DocVisa, because of your ON clause:
... stuff deleted ...
The key point is for you to post working code to create the input data, and working code for your queries to create the combinations, and merge the combinations, so that people who want to help can replicate your scenario.
I couldn't agree more. I don't think I understand the problem any better than when I first looked at it. I'm sure a good sample of input data, and the desired output would be very helpful.
Hmm, okay I see. I guess let me rectify this and make it more clear. It's a bunch of lot of steps being involved to create the final table.
I will update the explanation to make it more clear!
@darklord wrote:It's a bunch of lot of steps being involved to create the final table.
If there are, for example, 6 steps then do some debugging yourself on steps 1, 2, 3, ..., 6. Show us just the steps where you can't figure out why you are getting incorrect results.
Hello!
Yeah a lot of steps ended up being involved to make a logic for this. I think I have a clear idea now, I been working on creating flags for different scenarios. It would be save to say that this question can be closed for now. As again coming up with the current logic itself took some time (extra 4 datasets) 😛
Thanks again tho! I mean I was having trouble myself trying to phase this and explain it.
@darklord wrote:
Hmm, okay I see. I guess let me rectify this and make it more clear. It's a bunch of lot of steps being involved to create the final table.
I will update the explanation to make it more clear!
@darklord What often creates a lot of clarity is representative Have sample data (representative = mimiks all your different scenarios) with the data create via a working SAS data step, desired Want result and an explanation of the logic that gets us from Have to Want.
Sharing on top of above your current code even if not working is of course also very valuable because it shows your current thinking which in turn makes it easier for us to provide suitable advice.
Hey!
Yeah I see the point, I was having hard time explaining it myself, figured explaining the background of the task itself would have taken some time.
But I feel like there a logic in works now which I feel would do a good job for this. For that reason I would close this post so that more people don't get confused:P
But yeah next questions the question gonna be more clear:P
Thank you tho!
@darklord wrote:
Hey!
Yeah I see the point, I was having hard time explaining it myself, figured explaining the background of the task itself would have taken some time.
But I feel like there a logic in works now which I feel would do a good job for this. For that reason I would close this post so that more people don't get confused:P
But yeah next questions the question gonna be more clear:P
Thank you tho!
It's ok to mark your own response above as the "solution" to this thread. That will help those who scan threads looking for unsolved (or "unclosed" in your case) topics.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.