Hi,
I have three variables in my dataset:
Account_id
Month
Counted (I used the retain statement, so that the Counted variable in the 'test' dataset has a value of 'Yes' for all of the observations in the dataset).
If I only want distinct values of account_id and month only (i.e. keep only one of the observations when there is BOTH the same account_ID AND month) but exclude the Counted variable from being used for distinct (as all of the observations will have a value of 'Yes' in the dataset), is the code below correct? Thanks!
proc sql;
create table unique_accounts as
select distinct account_id
,month
,Counted
from test;
quit;
Partial example
Account ID Month Counted
1 201801 Yes
2 201807 Yes
3 201804 Yes
4 201809 Yes
2 201807 Yes
Code should only keep one observation (account ID = 2, month=201807, Counted = Yes).
Account ID Month Counted
1 201801 Yes
2 201807 Yes
3 201804 Yes
4 201809 Yes
2 201807 Yes
I used the following code below (which works), but I also want the 'unique_accounts' table to include the variable 'Counted' (though 'Counted' should not be part of the distinct').
proc sql;
create table unique_accounts as
select distinct account_id
,month
from test;
quit;
@jeremy4 wrote:
I have three variables in my dataset:
Account_id
Month
Counted (I used the retain statement, so that the Counted variable in the 'test' dataset has a value of 'Yes' for all of the observations in the dataset).
If I only want distinct values of account_id and month only (i.e. keep only one of the observations when there is BOTH the same account_ID AND month) but exclude the Counted variable from being used for distinct (as all of the observations will have a value of 'Yes' in the dataset), can someone please correct the code below (I have got an error message)? Thanks!
I used the following code below (which works), but I also want the 'unique_accounts' table to include the variable 'Counted' (though 'Counted' should not be part of the distinct').
proc sql;
create table unique_accounts as
select distinct account_id
,month
from test;quit;
Distinct applies to all variables in the SELECT statement. So it sounds like you need to take a different approach. If you KNOW (which you do) that COUNTED is YES for every row, then the solution is simple.
proc sql;
create table unique_accounts as
select distinct account_id
,month
,"YES" as counted
from test;
quit;
Thanks for your reply. Think about it again, would it matter whether I include 'counted' for distinct variables, because my understanding is that as counted = 'Yes' for every observation, it would not affect the final result as the account_id and month variables are effectively where any potential duplicates would be identified (i.e. if two observations have the same account_id and month, the 'counted' variable would not make a difference as all of the observations are 'Yes' anyway)?
proc sql;
create table unique_accounts as
select distinct account_id
,month
,counted
from test;
quit;
@jeremy4 wrote:
... would it matter whether I include 'counted' for distinct variables, because my understanding is that as counted = 'Yes' for every observation, it would not affect the final result as the account_id and month variables are effectively where any potential duplicates would be identified (i.e. if two observations have the same account_id and month, the 'counted' variable would not make a difference as all of the observations are 'Yes' anyway)?
proc sql;
create table unique_accounts as
select distinct account_id
,month,counted
from test;quit;
The easiest answer is to run the code and find out.
Hi @jeremy4 Unique counts??
Can you post a better representative sample plz? Your current sample is confusing
Hi,
There are 200,000 accounts in my dataset and I have been told that there are duplicates in the original dataset (only containing account_ID and month). I used a retain statement, so that the updated dataset now contains three variables (account_ID, month and Counted). Counted has a value of 'Yes' for all 200,000 observations.
As there are 200,000 accounts in my dataset, I was wondering how to use 'distinct' in proc sql, so that where observations match, only one is kept.
Partial example
Account ID Month Counted
1 201801 Yes
2 201807 Yes
3 201804 Yes
4 201809 Yes
2 201807 Yes
5 201810 Yes
3 201804 Yes
Desired proc sql table output (duplicates removed). Pulls in all three variables (account_id, month and counted) but when there are duplicate observations, only one is kept. Effectively, this means only looking at distinct values of account_ID and month, as if they are different or duplicated, the counted variable will not matter as all observations have a value of 'Yes' anyway.
Account ID Month Counted
1 201801 Yes
2 201807 Yes
3 201804 Yes
4 201809 Yes
2 201807 Yes
5 201810 Yes
3 201804 Yes
Can someone explain the difference in outcome if I used the two versions of code, and which one would be best to created the output required above?
Version 1
proc sql;
create table unique_accounts as
select distinct account_id
,month
,counted
from test;
quit;
Version 2
proc sql;
create table unique_accounts as
select distinct account_id
,month
,"YES" as counted
from test;
quit;
@jeremy4 wrote:
That's my question as Counted is the same for all observations, so will it identify distinct observations and produce the required output based on version 1, or does version 2 have to be used (as suggested in a reply)?
Why don't you try them and find out? You'll have your answer in about 10 seconds.
Assuming you have a variable Counted in the input, dataset, your version is rather straight forward. What select distinct does is
1. sort
2. eliminate
based upon values in respective position
Version 2:
No counted variable in input dataset
So the process is
1. Assignment statement "yes" as counted will execute first
2. Sort
3. Eliminate
Now @jeremy4 you can choose
Perhaps Proc SQL does not have this functionality, I suggest you read this article:
Difference between NOdup and NoDupkey..??
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.