proc sql select certain distinct variables only

jeremy4 · Posted 09-20-2019 07:15 AM

Hi,

I have three variables in my dataset:

Account_id

Month

Counted (I used the retain statement, so that the Counted variable in the 'test' dataset has a value of 'Yes' for all of the observations in the dataset).

If I only want distinct values of account_id and month only (i.e. keep only one of the observations when there is BOTH the same account_ID AND month) but exclude the Counted variable from being used for distinct (as all of the observations will have a value of 'Yes' in the dataset), is the code below correct? Thanks!

proc sql;
create table unique_accounts as
select distinct account_id
,month

,Counted
from test;

quit;

Partial example

Account ID Month Counted

1 201801 Yes

2 201807 Yes

3 201804 Yes

4 201809 Yes

2 201807 Yes

Code should only keep one observation (account ID = 2, month=201807, Counted = Yes).

Account ID Month Counted

1 201801 Yes

2 201807 Yes

3 201804 Yes

4 201809 Yes

~~2 201807 Yes~~

I used the following code below (which works), but I also want the 'unique_accounts' table to include the variable 'Counted' (though 'Counted' should not be part of the distinct').

proc sql;
create table unique_accounts as
select distinct account_id
,month
from test;

quit;

PaigeMiller · Posted 09-20-2019 07:31 AM

@jeremy4 wrote:

I have three variables in my dataset:

Account_id

Month

Counted (I used the retain statement, so that the Counted variable in the 'test' dataset has a value of 'Yes' for all of the observations in the dataset).

If I only want distinct values of account_id and month only (i.e. keep only one of the observations when there is BOTH the same account_ID AND month) but exclude the Counted variable from being used for distinct (as all of the observations will have a value of 'Yes' in the dataset), can someone please correct the code below (I have got an error message)? Thanks!

I used the following code below (which works), but I also want the 'unique_accounts' table to include the variable 'Counted' (though 'Counted' should not be part of the distinct').

proc sql;
create table unique_accounts as
select distinct account_id
,month
from test;

quit;

Distinct applies to all variables in the SELECT statement. So it sounds like you need to take a different approach. If you KNOW (which you do) that COUNTED is YES for every row, then the solution is simple.

proc sql;
   create table unique_accounts as
   select distinct account_id
        ,month
        ,"YES" as counted
   from test;
quit;

--
Paige Miller

jeremy4 · Posted 09-20-2019 07:57 AM

Thanks for your reply. Think about it again, would it matter whether I include 'counted' for distinct variables, because my understanding is that as counted = 'Yes' for every observation, it would not affect the final result as the account_id and month variables are effectively where any potential duplicates would be identified (i.e. if two observations have the same account_id and month, the 'counted' variable would not make a difference as all of the observations are 'Yes' anyway)?

proc sql;
create table unique_accounts as
select distinct account_id
,month

,counted
from test;

quit;

PaigeMiller · Posted 09-20-2019 08:11 AM

@jeremy4 wrote:

... would it matter whether I include 'counted' for distinct variables, because my understanding is that as counted = 'Yes' for every observation, it would not affect the final result as the account_id and month variables are effectively where any potential duplicates would be identified (i.e. if two observations have the same account_id and month, the 'counted' variable would not make a difference as all of the observations are 'Yes' anyway)?

proc sql;
create table unique_accounts as
select distinct account_id
,month

,counted
from test;

quit;

The easiest answer is to run the code and find out.

--
Paige Miller

novinosrin · Posted 09-20-2019 07:54 AM

Hi @jeremy4 Unique counts??

Can you post a better representative sample plz? Your current sample is confusing

jeremy4 · Posted 09-20-2019 08:12 AM

Hi,

There are 200,000 accounts in my dataset and I have been told that there are duplicates in the original dataset (only containing account_ID and month). I used a retain statement, so that the updated dataset now contains three variables (account_ID, month and Counted). Counted has a value of 'Yes' for all 200,000 observations.

As there are 200,000 accounts in my dataset, I was wondering how to use 'distinct' in proc sql, so that where observations match, only one is kept.

Partial example

Account ID Month Counted

1 201801 Yes

2 201807 Yes

3 201804 Yes

4 201809 Yes

2 201807 Yes

5 201810 Yes

3 201804 Yes

Desired proc sql table output (duplicates removed). Pulls in all three variables (account_id, month and counted) but when there are duplicate observations, only one is kept. Effectively, this means only looking at distinct values of account_ID and month, as if they are different or duplicated, the counted variable will not matter as all observations have a value of 'Yes' anyway.

Account ID Month Counted

1 201801 Yes

2 201807 Yes

3 201804 Yes

4 201809 Yes

~~2 201807 Yes~~

5 201810 Yes

~~3 201804 Yes~~

Can someone explain the difference in outcome if I used the two versions of code, and which one would be best to created the output required above?

Version 1

proc sql;
create table unique_accounts as
select distinct account_id
,month

,counted
from test;

quit;

Version 2

proc sql;
   create table unique_accounts as
   select distinct account_id
        ,month
        ,"YES" as counted
   from test;
quit;

gamotte · Posted 09-20-2019 08:19 AM

I am missing something. Since Counted is the same for all observations, what is the problem with "select distinct AccountId, Month, Counted" ?

jeremy4 · Posted 09-20-2019 08:25 AM

That's my question as Counted is the same for all observations, so will it identify distinct observations and produce the required output based on version 1, or does version 2 have to be used (as suggested in a reply)?

PaigeMiller · Posted 09-20-2019 08:29 AM

@jeremy4 wrote:
That's my question as Counted is the same for all observations, so will it identify distinct observations and produce the required output based on version 1, or does version 2 have to be used (as suggested in a reply)?

Why don't you try them and find out? You'll have your answer in about 10 seconds.

--
Paige Miller

novinosrin · Posted 09-20-2019 08:26 AM

Assuming you have a variable Counted in the input, dataset, your version is rather straight forward. What select distinct does is

1. sort

2. eliminate

based upon values in respective position

Version 2:

No counted variable in input dataset

So the process is

1. Assignment statement "yes" as counted will execute first

2. Sort

3. Eliminate

Now @jeremy4 you can choose

PhilC · Posted 09-20-2019 08:27 AM

Perhaps Proc SQL does not have this functionality, I suggest you read this article:

Difference between NOdup and NoDupkey..??

proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Re: proc sql select certain distinct variables only

Registration is open

Call for Content EXTENDED

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away