BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Stanley3
Fluorite | Level 6

Hi,

I'm currently merging multiple datasets and consistently using the following SQL code to verify there are no duplicate IDs at each stage of the process:

 

proc sql;
select count(distinct ID) as UniqueIDs,
count(*) as NObs
from dataset;
quit; 
 
This approach effectively identifies duplicates when datasets have multiple timepoints, and I make sure to only select one timepoint per ID. Throughout the merging process, my checks confirm that all IDs remain unique (e.g., 1500 observations and 1500 unique IDs).
 

However, after the final merge, the same SQL query unexpectedly shows 1500 observations but only 1300 unique IDs. Manual verification confirms that duplicates are present, despite prior checks showing no such issues. I'm looking for insights into why these duplicates weren’t detected sooner by the SQL query, or if there's a specific merging condition I might have overlooked.

 

Edited to add:

The SQL code above is the one that I use to check each dataset for duplicates (which it seems to find pretty well) and after each merge. 

Merge code (throughout and also the last one): 

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
run;
Attaching the LOG for the last merge + SQL check. The result for this was 1181 UniqueIDs and 1229 NObs, while previously I was getting 1229 UniqueIDs and 1229 NObs.
LOG for the last merge + SQL checkLOG for the last merge + SQL check
1 ACCEPTED SOLUTION

Accepted Solutions
Quentin
Super User

Are you intending to do a one-to-one merge, i.e. have unique values for ID in both datasets?  If so, the easiest way to check for duplicates is to add a logical assertion your MERGE step, e.g.:

 

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
      if NOT (first.ID and last.ID) then put "ERROR: found a duplicate! " ID= ;
run;

Of course if you have a lot of duplicates, that will flood your log with errors.  I wrote a little %ASSERT macro that has a limit on the number of errors it throws.  You would use it like:

 

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
      %assert(first.ID and last.ID) 
run;

%Assert is in this paper https://www.lexjansen.com/nesug/nesug12/cc/cc31.pdf , and there is a %DupCk  macro  in related paper (https://www.lexjansen.com/sesug/2022/SESUG2022_Paper_187_Final_PDF.pdf), which I use for checking a dataset for duplicates.  So if I do a one:many merge, I will often check that the first dataset really is unique, and would code it like:

%dupck(dataset1, by=ID)

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
run;
BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

View solution in original post

7 REPLIES 7
ballardw
Super User

Show the code for the "final merge" and we might have some ideas.

 

Checking on "counts" does not in any way guarantee that the same values are present in each step.

Stanley3
Fluorite | Level 6
Thanks! Sorry, do you mean the merging code or the SQL code?
ballardw
Super User

Probably both would be a good idea.

If you have a LOG from running the code that would be a bit better as well.

Stanley3
Fluorite | Level 6
Thanks! I edited to add both
ballardw
Super User

@Stanley3 wrote:
Thanks! I edited to add both

I don't see any result from the Proc sql that indicates there are or are not duplicates.

 

 

I also don't under stand that comment about "no variables overwritten". From the code shown and the previous summary information of the Maestra_antro3 data set with the same number of observations and variables the only way you don't get anything overwritten is one of these:

1) Data set Crp has only Folio as a variable

2) All the variables in Crp other than folio duplicate values already in Maestra_antro3 for that value of Folio and you just can't see the overwriting that did happen.

 

 

Suggestion:

Stop using code like this until you identify the problem step and cause.

Data maestro_antro3;
   set maestro_antro3;
(especially not MERGE)

That completely replaces the source data set and you can't be sure what actually did happen.

If you make new data sets you can COMPARE them with Proc Compare or other code.

If you routinely use the same set as the source and the output set you have no idea where errors, like duplicates, creep in.

 

Quentin
Super User

Are you intending to do a one-to-one merge, i.e. have unique values for ID in both datasets?  If so, the easiest way to check for duplicates is to add a logical assertion your MERGE step, e.g.:

 

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
      if NOT (first.ID and last.ID) then put "ERROR: found a duplicate! " ID= ;
run;

Of course if you have a lot of duplicates, that will flood your log with errors.  I wrote a little %ASSERT macro that has a limit on the number of errors it throws.  You would use it like:

 

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
      %assert(first.ID and last.ID) 
run;

%Assert is in this paper https://www.lexjansen.com/nesug/nesug12/cc/cc31.pdf , and there is a %DupCk  macro  in related paper (https://www.lexjansen.com/sesug/2022/SESUG2022_Paper_187_Final_PDF.pdf), which I use for checking a dataset for duplicates.  So if I do a one:many merge, I will often check that the first dataset really is unique, and would code it like:

%dupck(dataset1, by=ID)

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
run;
BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Patrick
Opal | Level 21

You must already have duplicate ID's in at least one of your source tables. You could use below code for investigation.

data dataset1_dups;
  set dataset1;
  by id;
  if not first.id;
run;

data dataset2_dups;
  set dataset2;
  by id;
  if not first.id;
run;

data merged_dataset merged_dups; 
  merge dataset1 dataset2;
  by ID;
  if first.ID and last.ID then output merged_dataset;
  else output merged_dups;
run;

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 448 views
  • 3 likes
  • 4 in conversation