Hi, I'm currently merging multiple datasets and consistently using the following SQL code to verify there are no duplicate IDs at each stage of the process: proc sql; select count(distinct ID) as UniqueIDs, count(*) as NObs from dataset; quit; This approach effectively identifies duplicates when datasets have multiple timepoints, and I make sure to only select one timepoint per ID. Throughout the merging process, my checks confirm that all IDs remain unique (e.g., 1500 observations and 1500 unique IDs). However, after the final merge, the same SQL query unexpectedly shows 1500 observations but only 1300 unique IDs. Manual verification confirms that duplicates are present, despite prior checks showing no such issues. I'm looking for insights into why these duplicates weren’t detected sooner by the SQL query, or if there's a specific merging condition I might have overlooked. Edited to add: The SQL code above is the one that I use to check each dataset for duplicates (which it seems to find pretty well) and after each merge. Merge code (throughout and also the last one): data merged_dataset; merge dataset1 dataset2; by ID; run; Attaching the LOG for the last merge + SQL check. The result for this was 1181 UniqueIDs and 1229 NObs, while previously I was getting 1229 UniqueIDs and 1229 NObs. LOG for the last merge + SQL check
... View more