BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
darklord
Obsidian | Level 7

Hello all!

In a pickle here, I have these SQL queries that I run to pull users that are exactly the same or similar.

Suppose the table name: usertable

SQL:

Combination 1

Name1, name2, compared(name1,name2) from usertable a join usertable b where a.passport=b.passport

Combination 2

Name1, name2, compared(name1,name2) from usertable a join usertable b where a.email=b.email

 

Results: name1 name2 0

                name1 name2 200

                name1 name2 50

Merge combination 1 & combination 2

This creates a dataset with field id, name1, name2, passport, email. Here all the names will be there with matching docs or not (true  positive or false positive).

 

From here how do I make sure that only true positive ids stay? i.e those with similar name or same name but the documents match exactly.

 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
darklord
Obsidian | Level 7

Hey! 

Yeah I see the point, I was having hard time explaining it myself, figured explaining the background of the task itself would have taken some time.

 

But I feel like there a logic in works now which I feel would do a good job for this. For that reason I would close this post so that more people don't get confused:P

 

But yeah next questions the question gonna be more clear:P

 

Thank you tho!

View solution in original post

11 REPLIES 11
Mazi
Quartz | Level 8
Can you share some sample data and what you’ve tried.

If you can also show us what you want that would be great.
darklord
Obsidian | Level 7
working on that, will upload some soon!
darklord
Obsidian | Level 7

The code below:

Suppose there's this combination here giving us duplicate users using the DOB, Email & Visa matching

SELECT "C9" AS COMB,
A. NEW _CIF_NO,
A. CUST_NAME,
B. NEW CIF_ NO AS NEW CIF NO 1,
B. CUST_ NAME AS CUST_FULL NAME__1,
COMPGED (A. CUST_NAME, B. CUST_NAME) AS SCORE
FROM ALL_ENT_ CUST_IND BASE A 
JOIN ALI_ENT_ CUST_ INDV_BASE B
ON COMPGED (A. CUST_NAME, B. CUST_NAME) <= 1000
AND A. CUST DOB = B. CUST DOB
AND A. MOBILE NUM = B. MOBILE NUM
AND A. DOC VISA = B. DOC VISA
AND A. CUST_ NAME IS NOT NULL

gives us the result below,

CIF	             Name	          Cif_new	   Name2	  Score
EIB~72—775     EASO PHILIP K A EASO	ENB~48--88	 EASO PHILIP	   330
EIB~72—445	 JUDE GRAP	        ENB~48--98	  JUDE GRAP	    0
EIB~71—775     EASO PHILIP EASO K A 	ENB~43--88       PHILIP EASO	   500

Score 0 is exact name match along with the doc matching, the higher the score the less the name match.

 

After a bunch of combinations they all get merged, and a final dataset is created with all the user names, documents like this 

CIF	                               Name	                 Cif_new          Doc        DOB          Email
EIB~72—775                   EASO PHILIP K A EASO	       ENB~48--88         A32742     12-JUL-20   AHSDA@GMAIL.COM
EIB~72—445	                  JUDE GRAP	               ENB~48---98        B42823     13-AUG-20   GFA@GMAIL.COM
EIB~72—775	              EASO PHILIP K A EASO	       ENB~436-88         C91742     17-FEB-21   UADFN@GMAIL.COM

This dataset here contains score 0, score 330 & score 500 etc all together.

 

What Im trying to do is find a way to create a subset-dataset which would have only those users that even if the names don't match exactly but their documents match. Basically no false positive (like names that match but no matching documents)  but just true positive.

 

Let me know if more clarification needed!

Quentin
Super User

I'm lost.  The query you showed will only return records that match exactly for DOB, MobileNum,  and DocVisa, because of your ON clause:

 

ON COMPGED (A. CUST_NAME, B. CUST_NAME) <= 1000
AND A. CUST DOB = B. CUST DOB
AND A. MOBILE NUM = B. MOBILE NUM
AND A. DOC VISA = B. DOC VISA
AND A. CUST_ NAME IS NOT NULL

When you say "after a bunch of combinations they all get merged" it's not clear what you mean, because you only show one query, I assume one "combination".

 

It's also hard to follow because the variable names in your code have spaces in them, and don't match to the variable names you show in the output. 

 

I think to make this more clear, it would help if you could show a full small example.  You could show a DATA step with the CARDS statement to make work.ALL_ENT_CUST_IND_BASE with 5-10 records, and another step to make work.ALI_ENT_CUST_INDV_BASE with 5-10 records.  

 

Then show queries to make two of your "combination" datasets.   And show the combination datasets (and explain if you like them).  Then show code to merge the combination datasets, and the show the results.  And then describe why you don't like the results, and show the results you want.

 

The key point is for you to post working code to create the input data, and working code for your queries to create the combinations, and merge the combinations, so that people who want to help can replicate your scenario.

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
mkeintz
PROC Star

@Quentin wrote:

I'm lost.  The query you showed will only return records that match exactly for DOB, MobileNum,  and DocVisa, because of your ON clause:

... stuff deleted ...

The key point is for you to post working code to create the input data, and working code for your queries to create the combinations, and merge the combinations, so that people who want to help can replicate your scenario.


I couldn't agree more.  I don't think I understand the problem any better than when I first looked at it.  I'm sure a good sample of input data, and the desired output would be very helpful.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
darklord
Obsidian | Level 7

Hmm, okay I see. I guess let me rectify this and make it more clear. It's a bunch of lot of steps being involved to create the final table.

 

I will update the explanation to make it more clear!

PaigeMiller
Diamond | Level 26

@darklord wrote:

 

It's a bunch of lot of steps being involved to create the final table.

 


If there are, for example, 6 steps then do some debugging yourself on steps 1, 2, 3, ..., 6. Show us just the steps where you can't figure out why you are getting incorrect results.

--
Paige Miller
darklord
Obsidian | Level 7

Hello!

Yeah a lot of steps ended up being involved to make a logic for this. I think I have a clear idea now, I been working on creating flags for different scenarios. It would be save to say that this question can be closed for now. As again coming up with the current logic itself took some time  (extra 4 datasets) 😛

 

Thanks again tho! I mean I was having trouble myself trying to phase this and explain it.

Patrick
Opal | Level 21

@darklord wrote:

Hmm, okay I see. I guess let me rectify this and make it more clear. It's a bunch of lot of steps being involved to create the final table.

 

I will update the explanation to make it more clear!


@darklord What often creates a lot of clarity is representative Have sample data (representative = mimiks all your different scenarios) with the data create via a working SAS data step, desired Want result and an explanation of the logic that gets us from Have to Want.

Sharing on top of above your current code even if not working is of course also very valuable because it shows your current thinking which in turn makes it easier for us to provide suitable advice. 

darklord
Obsidian | Level 7

Hey! 

Yeah I see the point, I was having hard time explaining it myself, figured explaining the background of the task itself would have taken some time.

 

But I feel like there a logic in works now which I feel would do a good job for this. For that reason I would close this post so that more people don't get confused:P

 

But yeah next questions the question gonna be more clear:P

 

Thank you tho!

mkeintz
PROC Star

@darklord wrote:

Hey! 

Yeah I see the point, I was having hard time explaining it myself, figured explaining the background of the task itself would have taken some time.

 

But I feel like there a logic in works now which I feel would do a good job for this. For that reason I would close this post so that more people don't get confused:P

 

But yeah next questions the question gonna be more clear:P

 

Thank you tho!


It's ok to mark your own response above as the "solution" to this thread.  That will help those who scan threads looking for unsolved (or "unclosed" in your case) topics.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 671 views
  • 6 likes
  • 6 in conversation