BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Babloo
Rhodochrosite | Level 12

I've the following code to select distinct records from the database.. 

Unfortunately, I spot-checked some data in the resultant table and discovered that there were duplicates. But it would be ideal to have no duplicates. The duplicates that I spotted were where the one of the field (class) appears to be blank. I'm querying in Snowflake from SAS.

 

 create table lib.&table.as
        select * from connection to database (
          select distinct &variables
          from &source_table
        )
      ;
      

The total row count for the table using the `proc sql select distinct` method is 31091. I ran the `proc summary` code for that specific type, and it had a total row count of 31083.


All 8 of those ‘extra’ rows in the `proc sql` table are the duplicate records where the class field appears to be blank.
I suspect there are some embedded control characters or something that are making the `select distinct` query identify records as unique.


Any help to resolve the issue to create the table without any duplicates?

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@Babloo SAS doesn't have a concept of NULL but just of missing. Column values in NULL state or only storing blanks on the database side get all converted to missing values on the SAS side and that's why the DISTINCT on the SAS side then de-dups the rows as you want it.

View solution in original post

11 REPLIES 11
Babloo
Rhodochrosite | Level 12

I have tried to use compress function in the field where there is blank in select distinct but still the field is being selected. What would be the likely cause for the issue and any help to resolve the issue?

 

create table lib.&table.as
        select * from connection to database (
          select distinct compress(&variables)
          from &source_table
        )
      ;
      
Patrick
Opal | Level 21

@Babloo Many databases make a distinction between NULL and a BLANK, or even between strings containing a different number of blanks.

Your idea to somehow "compress" the variables heads into the right direction but won't work this way starting with compress() being a SAS function that might not exist in database native SQL or have a different meaning and syntax.

 

Which database are you connecting to?

Babloo
Rhodochrosite | Level 12

@Patrick I'm querying in Snowflake from SAS. I'm not certain how to resolve this issue. Any help?

Patrick
Opal | Level 21

@Babloo wrote:

@Patrick I'm querying in Snowflake from SAS. I'm not certain how to resolve this issue. Any help?


I would have to Google Snowflake sites to try and figure out how that's done in this database. 

If the distinct on the database side reduces volumes a lot and the resulting volumes are not that massive then the quickest thing to de-dup would be to add another Distinct on the SAS side.

create table lib.&table.as
        select DISTINCT * from connection to database (
          select distinct &variables
          from &source_table
        )
      ;

 ....and just as a thought which I don't know if it would work... With many databases SAS/Access pushes a DISTINCT to the database if used in implicit SQL. You could try to define a libname to Snowflake and then just write a SAS SQL with a distinct in it and test if SAS handles the challenge for you.

Babloo
Rhodochrosite | Level 12
I'm interested to know if we have any other alternative method to remove duplicates
Patrick
Opal | Level 21

Why?

With SAS: proc sort data=table nodupkey; by _all_; run;

or getting the rows from the DB with an order by clause where you sort by all variables and then a SAS data step with if first.<variable> and the variable you use is the very last one in your sort order.

 

And just did a brief search for Snowflake:

Patrick_0-1677657748116.png

Guess what "one NULL is not equal to another NULL" means is that all rows with a least one column in NULL state will be kept if using DISTINCT.

 

Babloo
Rhodochrosite | Level 12
Adding one more 'distinct' as Patrick mentioned has resolved the issue. Is there a way to understand what is causing the duplicate records to be kept in that (Snowflake) table?
Patrick
Opal | Level 21

@Babloo wrote:
Adding one more 'distinct' as Patrick mentioned has resolved the issue. Is there a way to understand what is causing the duplicate records to be kept in that (Snowflake) table?

That's already answered in my post with the screenshot and highlighted portions.

Babloo
Rhodochrosite | Level 12

@Patrick  So "one NULL is not equal to another NULL" in Snowflake whereas in SAS all NULLs are same and equal? I could not locate the good document to understand about the NULLs in SAS. 

Tom
Super User Tom
Super User

@Babloo wrote:

@Patrick  So "one NULL is not equal to another NULL" in Snowflake whereas in SAS all NULLs are same and equal? I could not locate the good document to understand about the NULLs in SAS. 


If the null values are the issue (you could do some experiments to see) then it is because SNOWFLAKE (and most database systems) use TRI-level logic instead of BINARY logic.

 

In these tri-level logic system the answer to a boolean question, such as does A equal B, can have THREE answers instead of just TWO.  When either of the values is NULL then the expression is neither TRUE nor FALSE.

 

But in SAS the answer is always either TRUE or FALSE.  Two strings that only contain spaces are equal.  Two number that have the same missing value (SAS has 27 different missing numeric values) are equal.  A missing value does not equal an actual value.

 

If the issue is with strings with spaces then the issue is that SAS uses only FIXED length strings.  So when it compares strings is ignores the trailing spaces.  So 'A' is equal to 'A    '.  But in databases that allow VARIABLE length strings those are two different vlaues.

Patrick
Opal | Level 21

@Babloo SAS doesn't have a concept of NULL but just of missing. Column values in NULL state or only storing blanks on the database side get all converted to missing values on the SAS side and that's why the DISTINCT on the SAS side then de-dups the rows as you want it.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1939 views
  • 7 likes
  • 3 in conversation