Solved: Select distinct count not working

Stanley3 · Posted 05-06-2024 05:19 PM

Hi,

I'm currently merging multiple datasets and consistently using the following SQL code to verify there are no duplicate IDs at each stage of the process:

proc sql;
select count(distinct ID) as UniqueIDs,
count(*) as NObs
from dataset;
quit;

This approach effectively identifies duplicates when datasets have multiple timepoints, and I make sure to only select one timepoint per ID. Throughout the merging process, my checks confirm that all IDs remain unique (e.g., 1500 observations and 1500 unique IDs).

However, after the final merge, the same SQL query unexpectedly shows 1500 observations but only 1300 unique IDs. Manual verification confirms that duplicates are present, despite prior checks showing no such issues. I'm looking for insights into why these duplicates weren’t detected sooner by the SQL query, or if there's a specific merging condition I might have overlooked.

Edited to add:

The SQL code above is the one that I use to check each dataset for duplicates (which it seems to find pretty well) and after each merge.

Merge code (throughout and also the last one):

data merged_dataset;

merge dataset1 dataset2;

by ID;

run;

Attaching the LOG for the last merge + SQL check. The result for this was 1181 UniqueIDs and 1229 NObs, while previously I was getting 1229 UniqueIDs and 1229 NObs.

LOG for the last merge + SQL check

Quentin · Posted 05-06-2024 06:09 PM

Are you intending to do a one-to-one merge, i.e. have unique values for ID in both datasets? If so, the easiest way to check for duplicates is to add a logical assertion your MERGE step, e.g.:

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
      if NOT (first.ID and last.ID) then put "ERROR: found a duplicate! " ID= ;
run;

Of course if you have a lot of duplicates, that will flood your log with errors. I wrote a little %ASSERT macro that has a limit on the number of errors it throws. You would use it like:

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
      %assert(first.ID and last.ID) 
run;

%Assert is in this paper https://www.lexjansen.com/nesug/nesug12/cc/cc31.pdf , and there is a %DupCk macro in related paper (https://www.lexjansen.com/sesug/2022/SESUG2022_Paper_187_Final_PDF.pdf), which I use for checking a dataset for duplicates. So if I do a one:many merge, I will often check that the first dataset really is unique, and would code it like:

%dupck(dataset1, by=ID)

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
run;

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

View solution in original post

ballardw · Posted 05-06-2024 05:27 PM

Show the code for the "final merge" and we might have some ideas.

Checking on "counts" does not in any way guarantee that the same values are present in each step.

Stanley3 · Posted 05-06-2024 05:31 PM

Thanks! Sorry, do you mean the merging code or the SQL code?

ballardw · Posted 05-06-2024 05:38 PM

Probably both would be a good idea.

If you have a LOG from running the code that would be a bit better as well.

Stanley3 · Posted 05-06-2024 05:53 PM

Thanks! I edited to add both

ballardw · Posted 05-06-2024 06:48 PM

@Stanley3 wrote:
Thanks! I edited to add both

I don't see any result from the Proc sql that indicates there are or are not duplicates.

I also don't under stand that comment about "no variables overwritten". From the code shown and the previous summary information of the Maestra_antro3 data set with the same number of observations and variables the only way you don't get anything overwritten is one of these:

1) Data set Crp has only Folio as a variable

2) All the variables in Crp other than folio duplicate values already in Maestra_antro3 for that value of Folio and you just can't see the overwriting that did happen.

Suggestion:

Stop using code like this until you identify the problem step and cause.

Data maestro_antro3;
   set maestro_antro3;
(especially not MERGE)

That completely replaces the source data set and you can't be sure what actually did happen.

If you make new data sets you can COMPARE them with Proc Compare or other code.

If you routinely use the same set as the source and the output set you have no idea where errors, like duplicates, creep in.

Quentin · Posted 05-06-2024 06:09 PM

Are you intending to do a one-to-one merge, i.e. have unique values for ID in both datasets? If so, the easiest way to check for duplicates is to add a logical assertion your MERGE step, e.g.:

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
      if NOT (first.ID and last.ID) then put "ERROR: found a duplicate! " ID= ;
run;

Of course if you have a lot of duplicates, that will flood your log with errors. I wrote a little %ASSERT macro that has a limit on the number of errors it throws. You would use it like:

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
      %assert(first.ID and last.ID) 
run;

%Assert is in this paper https://www.lexjansen.com/nesug/nesug12/cc/cc31.pdf , and there is a %DupCk macro in related paper (https://www.lexjansen.com/sesug/2022/SESUG2022_Paper_187_Final_PDF.pdf), which I use for checking a dataset for duplicates. So if I do a one:many merge, I will often check that the first dataset really is unique, and would code it like:

%dupck(dataset1, by=ID)

data merged_dataset; 
      merge dataset1 dataset2;
      by ID;
run;

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Patrick · Posted 05-06-2024 10:07 PM

You must already have duplicate ID's in at least one of your source tables. You could use below code for investigation.

data dataset1_dups;
  set dataset1;
  by id;
  if not first.id;
run;

data dataset2_dups;
  set dataset2;
  by id;
  if not first.id;
run;

data merged_dataset merged_dups; 
  merge dataset1 dataset2;
  by ID;
  if first.ID and last.ID then output merged_dataset;
  else output merged_dups;
run;

Select distinct count not working

Re: Select distinct count not working

Re: Select distinct count not working

Re: Select distinct count not working

Re: Select distinct count not working

Re: Select distinct count not working

Re: Select distinct count not working

Re: Select distinct count not working

Re: Select distinct count not working