I performed a UNION on two data sets using Proc SQL. Instead of removing duplicates, it performed a UNION ALL and kept duplicates. The Proc SQL documentation states that UNION should only return distinct values. Why isn't it working as documented?
proc sql;
create table full_customers as
select * from spring_customers
UNION
select * from fall_customers;
QUIT;
spring_customers and fall_customers are just tables I got from previous data steps.
data spring_customers;
set spring;
run;
data fall_customers;
set fall;
run;
Are there variables in the data sets that would have different values, such as may be a date of service? That would make the records different. You may need to subset variables from each table depending on what you want.
Consider:
data work.example1; id = 3; date='01Jan2019'd; run; data work.example2; id = 3; date='01Aug2019'd; run; proc sql; create table work.full as select * from work.example1 union select * from work.example2 ; quit; proc sql; create table work.id as select id from work.example1 union select id from work.example2 ; quit;
The second proc sql only returns one record because the only variables associated with the Union are the ID values, which are duplicated. The first returns both records because the value of the DATE variable varies between the two records and using the Select * says to consider all the variables.
You may also have to consider the CORR option if the order of variables varies between the two:
data work.example3; date='01Aug2019'd; id = 3; run; proc sql; create table work.full2 as select * from work.example2 union select * from work.example3 ; quit; proc sql; create table work.full3 as select * from work.example2 union corr select * from work.example3 ; quit;
Are there variables in the data sets that would have different values, such as may be a date of service? That would make the records different. You may need to subset variables from each table depending on what you want.
Consider:
data work.example1; id = 3; date='01Jan2019'd; run; data work.example2; id = 3; date='01Aug2019'd; run; proc sql; create table work.full as select * from work.example1 union select * from work.example2 ; quit; proc sql; create table work.id as select id from work.example1 union select id from work.example2 ; quit;
The second proc sql only returns one record because the only variables associated with the Union are the ID values, which are duplicated. The first returns both records because the value of the DATE variable varies between the two records and using the Select * says to consider all the variables.
You may also have to consider the CORR option if the order of variables varies between the two:
data work.example3; date='01Aug2019'd; id = 3; run; proc sql; create table work.full2 as select * from work.example2 union select * from work.example3 ; quit; proc sql; create table work.full3 as select * from work.example2 union corr select * from work.example3 ; quit;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.