BookmarkSubscribeRSS Feed
jeremy4
Quartz | Level 8

Hi,

 

I have three variables in my dataset:

Account_id

Month

Counted (I used the retain statement, so that the Counted variable in the 'test' dataset has a value of 'Yes' for all of the observations in the dataset).

 

If I only want distinct values of account_id and month only (i.e. keep only one of the observations when there is BOTH the same account_ID AND month) but exclude the Counted variable from being used for distinct (as all of the observations will have a value of 'Yes' in the dataset), is the code below correct? Thanks!

 

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month

            ,Counted
   from test;

quit;

 

Partial example 

Account ID      Month      Counted

1                      201801       Yes

2                      201807       Yes

3                      201804       Yes

4                      201809       Yes

2                      201807       Yes

 

Code should only keep one observation (account ID = 2, month=201807, Counted = Yes). 

Account ID      Month      Counted

1                      201801       Yes

2                      201807       Yes

3                      201804       Yes

4                      201809       Yes

2                      201807       Yes

 

I used the following code below (which works), but I also want the 'unique_accounts' table to include the variable 'Counted' (though 'Counted' should not be part of the distinct').

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month
   from test;

quit;

 

 

 

 

 

 

 

 

10 REPLIES 10
PaigeMiller
Diamond | Level 26

@jeremy4 wrote:

I have three variables in my dataset:

Account_id

Month

Counted (I used the retain statement, so that the Counted variable in the 'test' dataset has a value of 'Yes' for all of the observations in the dataset).

 

If I only want distinct values of account_id and month only (i.e. keep only one of the observations when there is BOTH the same account_ID AND month) but exclude the Counted variable from being used for distinct (as all of the observations will have a value of 'Yes' in the dataset), can someone please correct the code below (I have got an error message)? Thanks!

 

I used the following code below (which works), but I also want the 'unique_accounts' table to include the variable 'Counted' (though 'Counted' should not be part of the distinct').

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month
   from test;

quit;

 

 


Distinct applies to all variables in the SELECT statement. So it sounds like you need to take a different approach. If you KNOW (which you do) that COUNTED is YES for every row, then the solution is simple.

 

proc sql;
   create table unique_accounts as
   select distinct account_id
        ,month
        ,"YES" as counted
   from test;
quit;

 

--
Paige Miller
jeremy4
Quartz | Level 8

Thanks for your reply. Think about it again, would it matter whether I include 'counted' for distinct variables, because my understanding is that as counted = 'Yes' for every observation, it would not affect the final result as the account_id and month variables are effectively where any potential duplicates would be identified (i.e. if two observations have the same account_id and month, the 'counted' variable would not make a difference as all of the observations are 'Yes' anyway)?

 

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month

            ,counted
   from test;

quit;

PaigeMiller
Diamond | Level 26

@jeremy4 wrote:

... would it matter whether I include 'counted' for distinct variables, because my understanding is that as counted = 'Yes' for every observation, it would not affect the final result as the account_id and month variables are effectively where any potential duplicates would be identified (i.e. if two observations have the same account_id and month, the 'counted' variable would not make a difference as all of the observations are 'Yes' anyway)?

 

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month

            ,counted
   from test;

quit;


The easiest answer is to run the code and find out.

--
Paige Miller
novinosrin
Tourmaline | Level 20

Hi @jeremy4   Unique counts??

 

Can you post a better representative sample plz? Your current sample is confusing

jeremy4
Quartz | Level 8

Hi,

 

There are 200,000 accounts in my dataset and I have been told that there are duplicates in the original dataset (only containing account_ID and month). I used a retain statement, so that the updated dataset now contains three variables (account_ID, month and Counted). Counted has a value of 'Yes' for all 200,000 observations.

 

As there are 200,000 accounts in my dataset, I was wondering how to use 'distinct' in proc sql, so that where observations match, only one is kept.

 

Partial example 

Account ID      Month      Counted

1                      201801       Yes

2                      201807       Yes

3                      201804       Yes

4                      201809       Yes

2                      201807       Yes

5                      201810       Yes

3                      201804       Yes

 

Desired proc sql table output (duplicates removed). Pulls in all three variables (account_id, month and counted) but when there are duplicate observations, only one is kept. Effectively, this means only looking at distinct values of account_ID and month, as if they are different or duplicated, the counted variable will not matter as all observations have a value of 'Yes' anyway.

Account ID      Month      Counted

1                      201801       Yes

2                      201807       Yes

3                      201804       Yes

4                      201809       Yes

2                      201807       Yes

5                      201810       Yes

3                      201804       Yes

 

Can someone explain the difference in outcome if I used the two versions of code, and which one would be best to created the output required above?

 

Version 1

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month

            ,counted
   from test;

quit;

 

Version 2

proc sql;
   create table unique_accounts as
   select distinct account_id
        ,month
        ,"YES" as counted
   from test;
quit;

 

 

 

gamotte
Rhodochrosite | Level 12
I am missing something. Since Counted is the same for all observations, what is the problem with "select distinct AccountId, Month, Counted" ?
jeremy4
Quartz | Level 8
That's my question as Counted is the same for all observations, so will it identify distinct observations and produce the required output based on version 1, or does version 2 have to be used (as suggested in a reply)?
PaigeMiller
Diamond | Level 26

@jeremy4 wrote:
That's my question as Counted is the same for all observations, so will it identify distinct observations and produce the required output based on version 1, or does version 2 have to be used (as suggested in a reply)?

Why don't you try them and find out? You'll have your answer in about 10 seconds.

--
Paige Miller
novinosrin
Tourmaline | Level 20

Assuming you have a variable Counted in the input, dataset, your version is rather straight forward. What select distinct does is

1. sort

2. eliminate

based upon values in respective position

 

Version 2:

No counted variable in input dataset

So the process is

1. Assignment statement "yes" as counted will execute first

2. Sort

3. Eliminate

 

Now @jeremy4  you can choose

PhilC
Rhodochrosite | Level 12

Perhaps Proc SQL does not have this functionality, I suggest you read this article:

Difference between NOdup and NoDupkey..??

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 13710 views
  • 0 likes
  • 5 in conversation