DATA Step, Macro, Functions and more

how to dedup data in PROC SQL

Accepted Solution Solved
Reply
Contributor
Posts: 31
Accepted Solution

how to dedup data in PROC SQL

Hello, 

 

Could you please help me to use PROC SQL create a below data TEST_NODUP?  Thank you!

 

 

*--- TEST data;
data TEST;
  input id type cd $2. date sub_id type2 cnt;
  informat date yymmddn10.;
datalines;
574 1 6Z 1/10/2008 689 2 1
574 1 6Z 2/11/2008 616 2 1
574 1 7E 12/19/2008 8117 2 1
574 1 7E 12/19/2008 8917 2 1
574 1 7E 12/19/2008 8927 2 1
574 1 7E 12/19/2008 8422 2 1
574 1 7E 12/19/2008 8864 2 1
;
run;
proc sort data = TEST;
  by id type cd date sub_id type2;
run;

*--- final table;
data TEST_NODUP;
  set TEST;
  by id type cd date sub_id type2;
  if last.date;
  format yymmddn10.; 
run;

TEST_NODUP
574 1 6Z 1/10/2008 689 2 1
574 1 6Z 2/11/2008 616 2 1
574 1 7E 12/19/2008 8927 2 1 

 


Accepted Solutions
Solution
‎02-15-2016 11:43 AM
Super User
Posts: 10,046

Re: how to dedup data in PROC SQL

data TEST;
  input id type cd $ date : mmddyy10. sub_id type2 cnt;
  format date yymmdd10.;
datalines;
574 1 6Z 1/10/2008 689 2 1
574 1 6Z 2/11/2008 616 2 1
574 1 7E 12/19/2008 8117 2 1
574 1 7E 12/19/2008 8917 2 1
574 1 7E 12/19/2008 8927 2 1
574 1 7E 12/19/2008 8422 2 1
574 1 7E 12/19/2008 8864 2 1
;
run;
proc sql;
select distinct *
 from (select distinct * from TEST
        group by id ,type ,cd ,date
         having sub_id=max(sub_id) )
     group by  id ,type ,cd ,date
      having type2=max(type2);
quit;

View solution in original post


All Replies
Super User
Posts: 11,343

Re: how to dedup data in PROC SQL

Proc sql;

   create table test_nodup as

   select distinct *

   from test;

quit;

 

Should generally work.

Contributor
Posts: 31

Re: how to dedup data in PROC SQL

Thank you! This would not work for my case.
Super User
Posts: 3,261

Re: how to dedup data in PROC SQL

[ Edited ]

While your subject says dedup, that's not what your program does.

 

It is actually selecting the most recent record for an ID TYPE CD combination and where there are multiple records for the same date then choosing the highest SUB_ID and TYPE2.

 

In SQL you could do this as a GROUP BY sub-query, but I think the DATA step way is easier.

proc sql;
  create table test_latest as
  select a.*
  from test as a
  inner join
  (select id
             ,type
             ,cd
             ,max(date) as date_max
             ,max(sub_id) as sub_id_max
             ,max(type2) as type2_max
   from test
   group by id
           ,type
           ,cd
   )  as b
   on a.id = b.id
   and a.type = b.type
   and a.cd = b.cd
   and a.date = b.date_max
   and a.sub_id = b.sub_id_max
   and a.type2 = b.type2_max
  ;
quit;  
   

 

Trusted Advisor
Posts: 1,118

Re: how to dedup data in PROC SQL

Hello @Ying,

 

I think, there is no general equivalent to BY-group processing with last.variable in PROC SQL (leaving monotonic() aside).

 

However, if in your real data (like in dataset TEST) variable SUB_ID does not have duplicate values within a ID-TYPE-CD-DATE BY group, you could use the following technique:

proc sql;
create table test_nodup as
select *
from test
group by id, type, cd, date
having sub_id=max(sub_id)
order by id, type, cd, date;
quit;

(The ORDER BY clause should actually be redundant. The SELECT statement could be modified to specify a format for variable DATE, if needed.)

 

If there are duplicates of SUB_ID within some BY groups, but TYPE2 and CNT can serve as "tie-breakers" (i.e., the combination SUB_ID, TYPE2, CNT is unique within each ID-TYPE-CD-DATE BY group), one could think about defining an expression combining values of SUB_ID, TYPE2 and CNT to replace SUB_ID in the HAVING clause. (The details would depend on the possible values of those three variables.)

 

By the way, my SAS 9.4 does not like your INFORMAT and FORMAT statements (for good reasons).

 

Contributor
Posts: 31

Re: how to dedup data in PROC SQL

Posted in reply to FreelanceReinhard

Thank you so much!  

 

The TEST was created by ID-TYPE-CD-DATE (see below), so the way you describe should work.  I was given the code with PROC SQL and data steps and asked to replace the data steps with PROC SQL.  

proc sql;
    create table TEST as
        select id, type, cd, date, sub_id, type2, count(*) as cnt
        from    indsn
        group by 1,2,3,4,5,6          
        order by 1,2,3,4,5,6   ;
quit;

 

Another question, how to just simply rewrite the PROC SORT NODUPKEY data=... to PROC SQL to remove the dup?  

proc sql;
    create table TEST as
        select id, type, cd, date, sub_id, type2, status, dollar, amt
        from    indsn a1,
                   indsn2 a2
        where a1.id = a2.id and
                   a1.type = a2.type and
                   a1.cd = a2.;
quit;

proc sort data=TEST NODUPKEY; by id type cd; run;
Trusted Advisor
Posts: 1,118

Re: how to dedup data in PROC SQL

Your first PROC SQL step shows that the combination of SUB_ID and TYPE2 is unique within the ID-TYPE-CD-DATE BY groups. So, CNT could be ignored in the "combined" expression I mentioned.

 

The PROC SORT step retains the first observation for each ID-TYPE-CD BY group. In PROC SQL, however, we would need to define which observation should be regarded as the "first" in a BY group. If TEST was sorted by DATE, SUB_ID etc. within each BY group, we could use a similar GROUP BY/HAVING approach as for your original question. Otherwise, we might have to resort to the undocumented MONOTONIC function, but I would prefer to avoid that:

proc sql undo_policy=none;
create table test as
select * from test
group by id, type, cd
having monotonic()=min(monotonic());
quit;

(I've learned about the option undo_policy=none from this older thread, but I would recommend to avoid reading and overwriting TEST with the same PROC SQL step.)

 

What is assumed about the values of DATE, SUB_ID, etc. after the PROC SORT step?

Solution
‎02-15-2016 11:43 AM
Super User
Posts: 10,046

Re: how to dedup data in PROC SQL

data TEST;
  input id type cd $ date : mmddyy10. sub_id type2 cnt;
  format date yymmdd10.;
datalines;
574 1 6Z 1/10/2008 689 2 1
574 1 6Z 2/11/2008 616 2 1
574 1 7E 12/19/2008 8117 2 1
574 1 7E 12/19/2008 8917 2 1
574 1 7E 12/19/2008 8927 2 1
574 1 7E 12/19/2008 8422 2 1
574 1 7E 12/19/2008 8864 2 1
;
run;
proc sql;
select distinct *
 from (select distinct * from TEST
        group by id ,type ,cd ,date
         having sub_id=max(sub_id) )
     group by  id ,type ,cd ,date
      having type2=max(type2);
quit;
Contributor
Posts: 31

Re: how to dedup data in PROC SQL

Thank you so much for helping me!!
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 1076 views
  • 2 likes
  • 5 in conversation