Solved: Re: Select distinct selecting duplicate records

Babloo · Posted 02-28-2023 09:41 PM

I've the following code to select distinct records from the database..

Unfortunately, I spot-checked some data in the resultant table and discovered that there were duplicates. But it would be ideal to have no duplicates. The duplicates that I spotted were where the one of the field (class) appears to be blank. I'm querying in Snowflake from SAS.

 create table lib.&table.as
        select * from connection to database (
          select distinct &variables
          from &source_table
        )
      ;

The total row count for the table using the `proc sql select distinct` method is 31091. I ran the `proc summary` code for that specific type, and it had a total row count of 31083.

All 8 of those ‘extra’ rows in the `proc sql` table are the duplicate records where the class field appears to be blank.
I suspect there are some embedded control characters or something that are making the `select distinct` query identify records as unique.

Any help to resolve the issue to create the table without any duplicates?

Patrick · Posted 03-02-2023 06:30 PM

@Babloo SAS doesn't have a concept of NULL but just of missing. Column values in NULL state or only storing blanks on the database side get all converted to missing values on the SAS side and that's why the DISTINCT on the SAS side then de-dups the rows as you want it.

View solution in original post

Babloo · Posted 02-28-2023 11:09 PM

I have tried to use compress function in the field where there is blank in select distinct but still the field is being selected. What would be the likely cause for the issue and any help to resolve the issue?

create table lib.&table.as
        select * from connection to database (
          select distinct compress(&variables)
          from &source_table
        )
      ;

Patrick · Posted 02-28-2023 11:31 PM

@Babloo Many databases make a distinction between NULL and a BLANK, or even between strings containing a different number of blanks.

Your idea to somehow "compress" the variables heads into the right direction but won't work this way starting with compress() being a SAS function that might not exist in database native SQL or have a different meaning and syntax.

Which database are you connecting to?

Babloo · Posted 02-28-2023 11:34 PM

@Patrick I'm querying in Snowflake from SAS. I'm not certain how to resolve this issue. Any help?

Patrick · Posted 02-28-2023 11:42 PM

@Babloo wrote:

@Patrick I'm querying in Snowflake from SAS. I'm not certain how to resolve this issue. Any help?

I would have to Google Snowflake sites to try and figure out how that's done in this database.

If the distinct on the database side reduces volumes a lot and the resulting volumes are not that massive then the quickest thing to de-dup would be to add another Distinct on the SAS side.

create table lib.&table.as
        select DISTINCT * from connection to database (
          select distinct &variables
          from &source_table
        )
      ;

....and just as a thought which I don't know if it would work... With many databases SAS/Access pushes a DISTINCT to the database if used in implicit SQL. You could try to define a libname to Snowflake and then just write a SAS SQL with a distinct in it and test if SAS handles the challenge for you.

Babloo · Posted 03-01-2023 02:49 AM

I'm interested to know if we have any other alternative method to remove duplicates

Patrick · Posted 03-01-2023 02:58 AM

Why?

With SAS: proc sort data=table nodupkey; by _all_; run;

or getting the rows from the DB with an order by clause where you sort by all variables and then a SAS data step with if first.<variable> and the variable you use is the very last one in your sort order.

And just did a brief search for Snowflake:

Guess what "one NULL is not equal to another NULL" means is that all rows with a least one column in NULL state will be kept if using DISTINCT.

Babloo · Posted 03-01-2023 08:29 PM

Adding one more 'distinct' as Patrick mentioned has resolved the issue. Is there a way to understand what is causing the duplicate records to be kept in that (Snowflake) table?

Patrick · Posted 03-01-2023 09:25 PM

@Babloo wrote:
Adding one more 'distinct' as Patrick mentioned has resolved the issue. Is there a way to understand what is causing the duplicate records to be kept in that (Snowflake) table?

That's already answered in my post with the screenshot and highlighted portions.

Babloo · Posted 03-02-2023 12:21 PM

@Patrick So "one NULL is not equal to another NULL" in Snowflake whereas in SAS all NULLs are same and equal? I could not locate the good document to understand about the NULLs in SAS.

Tom · Posted 03-02-2023 12:29 PM

@Babloo wrote:

@Patrick So "one NULL is not equal to another NULL" in Snowflake whereas in SAS all NULLs are same and equal? I could not locate the good document to understand about the NULLs in SAS.

If the null values are the issue (you could do some experiments to see) then it is because SNOWFLAKE (and most database systems) use TRI-level logic instead of BINARY logic.

In these tri-level logic system the answer to a boolean question, such as does A equal B, can have THREE answers instead of just TWO. When either of the values is NULL then the expression is neither TRUE nor FALSE.

But in SAS the answer is always either TRUE or FALSE. Two strings that only contain spaces are equal. Two number that have the same missing value (SAS has 27 different missing numeric values) are equal. A missing value does not equal an actual value.

If the issue is with strings with spaces then the issue is that SAS uses only FIXED length strings. So when it compares strings is ignores the trailing spaces. So 'A' is equal to 'A '. But in databases that allow VARIABLE length strings those are two different vlaues.

Patrick · Posted 03-02-2023 06:30 PM

@Babloo SAS doesn't have a concept of NULL but just of missing. Column values in NULL state or only storing blanks on the database side get all converted to missing values on the SAS side and that's why the DISTINCT on the SAS side then de-dups the rows as you want it.

Classroom Training Available!