BookmarkSubscribeRSS Feed
jeremy4
Quartz | Level 8

Hi,

 

I have three variables in my dataset:

Account_id

Month

Counted (I used the retain statement, so that the Counted variable in the 'test' dataset has a value of 'Yes' for all of the observations in the dataset).

 

If I only want distinct values of account_id and month only (i.e. keep only one of the observations when there is BOTH the same account_ID AND month) but exclude the Counted variable from being used for distinct (as all of the observations will have a value of 'Yes' in the dataset), is the code below correct? Thanks!

 

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month

            ,Counted
   from test;

quit;

 

Partial example 

Account ID      Month      Counted

1                      201801       Yes

2                      201807       Yes

3                      201804       Yes

4                      201809       Yes

2                      201807       Yes

 

Code should only keep one observation (account ID = 2, month=201807, Counted = Yes). 

Account ID      Month      Counted

1                      201801       Yes

2                      201807       Yes

3                      201804       Yes

4                      201809       Yes

2                      201807       Yes

 

I used the following code below (which works), but I also want the 'unique_accounts' table to include the variable 'Counted' (though 'Counted' should not be part of the distinct').

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month
   from test;

quit;

 

 

 

 

 

 

 

 

10 REPLIES 10
PaigeMiller
Diamond | Level 26

@jeremy4 wrote:

I have three variables in my dataset:

Account_id

Month

Counted (I used the retain statement, so that the Counted variable in the 'test' dataset has a value of 'Yes' for all of the observations in the dataset).

 

If I only want distinct values of account_id and month only (i.e. keep only one of the observations when there is BOTH the same account_ID AND month) but exclude the Counted variable from being used for distinct (as all of the observations will have a value of 'Yes' in the dataset), can someone please correct the code below (I have got an error message)? Thanks!

 

I used the following code below (which works), but I also want the 'unique_accounts' table to include the variable 'Counted' (though 'Counted' should not be part of the distinct').

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month
   from test;

quit;

 

 


Distinct applies to all variables in the SELECT statement. So it sounds like you need to take a different approach. If you KNOW (which you do) that COUNTED is YES for every row, then the solution is simple.

 

proc sql;
   create table unique_accounts as
   select distinct account_id
        ,month
        ,"YES" as counted
   from test;
quit;

 

--
Paige Miller
jeremy4
Quartz | Level 8

Thanks for your reply. Think about it again, would it matter whether I include 'counted' for distinct variables, because my understanding is that as counted = 'Yes' for every observation, it would not affect the final result as the account_id and month variables are effectively where any potential duplicates would be identified (i.e. if two observations have the same account_id and month, the 'counted' variable would not make a difference as all of the observations are 'Yes' anyway)?

 

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month

            ,counted
   from test;

quit;

PaigeMiller
Diamond | Level 26

@jeremy4 wrote:

... would it matter whether I include 'counted' for distinct variables, because my understanding is that as counted = 'Yes' for every observation, it would not affect the final result as the account_id and month variables are effectively where any potential duplicates would be identified (i.e. if two observations have the same account_id and month, the 'counted' variable would not make a difference as all of the observations are 'Yes' anyway)?

 

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month

            ,counted
   from test;

quit;


The easiest answer is to run the code and find out.

--
Paige Miller
novinosrin
Tourmaline | Level 20

Hi @jeremy4   Unique counts??

 

Can you post a better representative sample plz? Your current sample is confusing

jeremy4
Quartz | Level 8

Hi,

 

There are 200,000 accounts in my dataset and I have been told that there are duplicates in the original dataset (only containing account_ID and month). I used a retain statement, so that the updated dataset now contains three variables (account_ID, month and Counted). Counted has a value of 'Yes' for all 200,000 observations.

 

As there are 200,000 accounts in my dataset, I was wondering how to use 'distinct' in proc sql, so that where observations match, only one is kept.

 

Partial example 

Account ID      Month      Counted

1                      201801       Yes

2                      201807       Yes

3                      201804       Yes

4                      201809       Yes

2                      201807       Yes

5                      201810       Yes

3                      201804       Yes

 

Desired proc sql table output (duplicates removed). Pulls in all three variables (account_id, month and counted) but when there are duplicate observations, only one is kept. Effectively, this means only looking at distinct values of account_ID and month, as if they are different or duplicated, the counted variable will not matter as all observations have a value of 'Yes' anyway.

Account ID      Month      Counted

1                      201801       Yes

2                      201807       Yes

3                      201804       Yes

4                      201809       Yes

2                      201807       Yes

5                      201810       Yes

3                      201804       Yes

 

Can someone explain the difference in outcome if I used the two versions of code, and which one would be best to created the output required above?

 

Version 1

proc sql;
   create table unique_accounts as
   select distinct account_id
            ,month

            ,counted
   from test;

quit;

 

Version 2

proc sql;
   create table unique_accounts as
   select distinct account_id
        ,month
        ,"YES" as counted
   from test;
quit;

 

 

 

gamotte
Rhodochrosite | Level 12
I am missing something. Since Counted is the same for all observations, what is the problem with "select distinct AccountId, Month, Counted" ?
jeremy4
Quartz | Level 8
That's my question as Counted is the same for all observations, so will it identify distinct observations and produce the required output based on version 1, or does version 2 have to be used (as suggested in a reply)?
PaigeMiller
Diamond | Level 26

@jeremy4 wrote:
That's my question as Counted is the same for all observations, so will it identify distinct observations and produce the required output based on version 1, or does version 2 have to be used (as suggested in a reply)?

Why don't you try them and find out? You'll have your answer in about 10 seconds.

--
Paige Miller
novinosrin
Tourmaline | Level 20

Assuming you have a variable Counted in the input, dataset, your version is rather straight forward. What select distinct does is

1. sort

2. eliminate

based upon values in respective position

 

Version 2:

No counted variable in input dataset

So the process is

1. Assignment statement "yes" as counted will execute first

2. Sort

3. Eliminate

 

Now @jeremy4  you can choose

PhilC
Rhodochrosite | Level 12

Perhaps Proc SQL does not have this functionality, I suggest you read this article:

Difference between NOdup and NoDupkey..??

 

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 13725 views
  • 0 likes
  • 5 in conversation