Solved: Duplicates in the table, but keep them based on other conditions

darklord · Posted 03-15-2024 01:38 PM

Hello all!

In a pickle here, I have these SQL queries that I run to pull users that are exactly the same or similar.

Suppose the table name: usertable

SQL:

Combination 1

Name1, name2, compared(name1,name2) from usertable a join usertable b where a.passport=b.passport

Combination 2

Name1, name2, compared(name1,name2) from usertable a join usertable b where a.email=b.email

Results: name1 name2 0

name1 name2 200

name1 name2 50

Merge combination 1 & combination 2

This creates a dataset with field id, name1, name2, passport, email. Here all the names will be there with matching docs or not (true positive or false positive).

From here how do I make sure that only true positive ids stay? i.e those with similar name or same name but the documents match exactly.

Thanks!

darklord · Posted 03-19-2024 10:19 AM

Hey!

Yeah I see the point, I was having hard time explaining it myself, figured explaining the background of the task itself would have taken some time.

But I feel like there a logic in works now which I feel would do a good job for this. For that reason I would close this post so that more people don't get confused:P

But yeah next questions the question gonna be more clear:P

Thank you tho!

View solution in original post

Mazi · Posted 03-15-2024 03:48 PM

Can you share some sample data and what you’ve tried.

If you can also show us what you want that would be great.

darklord · Posted 03-15-2024 03:58 PM

working on that, will upload some soon!

darklord · Posted 03-16-2024 11:17 AM

The code below:

Suppose there's this combination here giving us duplicate users using the DOB, Email & Visa matching

SELECT "C9" AS COMB,
A. NEW _CIF_NO,
A. CUST_NAME,
B. NEW CIF_ NO AS NEW CIF NO 1,
B. CUST_ NAME AS CUST_FULL NAME__1,
COMPGED (A. CUST_NAME, B. CUST_NAME) AS SCORE
FROM ALL_ENT_ CUST_IND BASE A 
JOIN ALI_ENT_ CUST_ INDV_BASE B
ON COMPGED (A. CUST_NAME, B. CUST_NAME) <= 1000
AND A. CUST DOB = B. CUST DOB
AND A. MOBILE NUM = B. MOBILE NUM
AND A. DOC VISA = B. DOC VISA
AND A. CUST_ NAME IS NOT NULL

gives us the result below,

CIF	             Name	          Cif_new	   Name2	  Score
EIB~72—775     EASO PHILIP K A EASO	ENB~48--88	 EASO PHILIP	   330
EIB~72—445	 JUDE GRAP	        ENB~48--98	  JUDE GRAP	    0
EIB~71—775     EASO PHILIP EASO K A 	ENB~43--88       PHILIP EASO	   500

Score 0 is exact name match along with the doc matching, the higher the score the less the name match.

After a bunch of combinations they all get merged, and a final dataset is created with all the user names, documents like this

CIF	                               Name	                 Cif_new          Doc        DOB          Email
EIB~72—775                   EASO PHILIP K A EASO	       ENB~48--88         A32742     12-JUL-20   AHSDA@GMAIL.COM
EIB~72—445	                  JUDE GRAP	               ENB~48---98        B42823     13-AUG-20   GFA@GMAIL.COM
EIB~72—775	              EASO PHILIP K A EASO	       ENB~436-88         C91742     17-FEB-21   UADFN@GMAIL.COM

This dataset here contains score 0, score 330 & score 500 etc all together.

What Im trying to do is find a way to create a subset-dataset which would have only those users that even if the names don't match exactly but their documents match. Basically no false positive (like names that match but no matching documents) but just true positive.

Let me know if more clarification needed!

Quentin · Posted 03-16-2024 04:52 PM

I'm lost. The query you showed will only return records that match exactly for DOB, MobileNum, and DocVisa, because of your ON clause:

ON COMPGED (A. CUST_NAME, B. CUST_NAME) <= 1000
AND A. CUST DOB = B. CUST DOB
AND A. MOBILE NUM = B. MOBILE NUM
AND A. DOC VISA = B. DOC VISA
AND A. CUST_ NAME IS NOT NULL

When you say "after a bunch of combinations they all get merged" it's not clear what you mean, because you only show one query, I assume one "combination".

It's also hard to follow because the variable names in your code have spaces in them, and don't match to the variable names you show in the output.

I think to make this more clear, it would help if you could show a full small example. You could show a DATA step with the CARDS statement to make work.ALL_ENT_CUST_IND_BASE with 5-10 records, and another step to make work.ALI_ENT_CUST_INDV_BASE with 5-10 records.

Then show queries to make two of your "combination" datasets. And show the combination datasets (and explain if you like them). Then show code to merge the combination datasets, and the show the results. And then describe why you don't like the results, and show the results you want.

The key point is for you to post working code to create the input data, and working code for your queries to create the combinations, and merge the combinations, so that people who want to help can replicate your scenario.

mkeintz · Posted 03-16-2024 06:43 PM

@Quentin wrote:

I'm lost. The query you showed will only return records that match exactly for DOB, MobileNum, and DocVisa, because of your ON clause:

... stuff deleted ...

The key point is for you to post working code to create the input data, and working code for your queries to create the combinations, and merge the combinations, so that people who want to help can replicate your scenario.

I couldn't agree more. I don't think I understand the problem any better than when I first looked at it. I'm sure a good sample of input data, and the desired output would be very helpful.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

darklord · Posted 03-17-2024 09:49 AM

Hmm, okay I see. I guess let me rectify this and make it more clear. It's a bunch of lot of steps being involved to create the final table.

I will update the explanation to make it more clear!

PaigeMiller · Posted 03-17-2024 01:12 PM

@darklord wrote:

It's a bunch of lot of steps being involved to create the final table.

If there are, for example, 6 steps then do some debugging yourself on steps 1, 2, 3, ..., 6. Show us just the steps where you can't figure out why you are getting incorrect results.

--
Paige Miller

darklord · Posted 03-19-2024 10:15 AM

Hello!

Yeah a lot of steps ended up being involved to make a logic for this. I think I have a clear idea now, I been working on creating flags for different scenarios. It would be save to say that this question can be closed for now. As again coming up with the current logic itself took some time (extra 4 datasets) 😛

Thanks again tho! I mean I was having trouble myself trying to phase this and explain it.

Patrick · Posted 03-17-2024 08:09 PM

@darklord wrote:

Hmm, okay I see. I guess let me rectify this and make it more clear. It's a bunch of lot of steps being involved to create the final table.

I will update the explanation to make it more clear!

@darklord What often creates a lot of clarity is representative Have sample data (representative = mimiks all your different scenarios) with the data create via a working SAS data step, desired Want result and an explanation of the logic that gets us from Have to Want.

Sharing on top of above your current code even if not working is of course also very valuable because it shows your current thinking which in turn makes it easier for us to provide suitable advice.

darklord · Posted 03-19-2024 10:19 AM

Hey!

Yeah I see the point, I was having hard time explaining it myself, figured explaining the background of the task itself would have taken some time.

But I feel like there a logic in works now which I feel would do a good job for this. For that reason I would close this post so that more people don't get confused:P

But yeah next questions the question gonna be more clear:P

Thank you tho!

mkeintz · Posted 03-19-2024 11:04 AM

@darklord wrote:

Hey!

Yeah I see the point, I was having hard time explaining it myself, figured explaining the background of the task itself would have taken some time.

But I feel like there a logic in works now which I feel would do a good job for this. For that reason I would close this post so that more people don't get confused:P

But yeah next questions the question gonna be more clear:P

Thank you tho!

It's ok to mark your own response above as the "solution" to this thread. That will help those who scan threads looking for unsolved (or "unclosed" in your case) topics.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Re: Duplicates in the table, but keep them based on other conditions

Registration is open

SAS Training: Just a Click Away