BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ying
Fluorite | Level 6

Hello, 

 

Could you please help me to use PROC SQL create a below data TEST_NODUP?  Thank you!

 

 

*--- TEST data;
data TEST;
  input id type cd $2. date sub_id type2 cnt;
  informat date yymmddn10.;
datalines;
574 1 6Z 1/10/2008 689 2 1
574 1 6Z 2/11/2008 616 2 1
574 1 7E 12/19/2008 8117 2 1
574 1 7E 12/19/2008 8917 2 1
574 1 7E 12/19/2008 8927 2 1
574 1 7E 12/19/2008 8422 2 1
574 1 7E 12/19/2008 8864 2 1
;
run;
proc sort data = TEST;
  by id type cd date sub_id type2;
run;

*--- final table;
data TEST_NODUP;
  set TEST;
  by id type cd date sub_id type2;
  if last.date;
  format yymmddn10.; 
run;

TEST_NODUP
574 1 6Z 1/10/2008 689 2 1
574 1 6Z 2/11/2008 616 2 1
574 1 7E 12/19/2008 8927 2 1 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data TEST;
  input id type cd $ date : mmddyy10. sub_id type2 cnt;
  format date yymmdd10.;
datalines;
574 1 6Z 1/10/2008 689 2 1
574 1 6Z 2/11/2008 616 2 1
574 1 7E 12/19/2008 8117 2 1
574 1 7E 12/19/2008 8917 2 1
574 1 7E 12/19/2008 8927 2 1
574 1 7E 12/19/2008 8422 2 1
574 1 7E 12/19/2008 8864 2 1
;
run;
proc sql;
select distinct *
 from (select distinct * from TEST
        group by id ,type ,cd ,date
         having sub_id=max(sub_id) )
     group by  id ,type ,cd ,date
      having type2=max(type2);
quit;

View solution in original post

8 REPLIES 8
ballardw
Super User

Proc sql;

   create table test_nodup as

   select distinct *

   from test;

quit;

 

Should generally work.

Ying
Fluorite | Level 6
Thank you! This would not work for my case.
SASKiwi
PROC Star

While your subject says dedup, that's not what your program does.

 

It is actually selecting the most recent record for an ID TYPE CD combination and where there are multiple records for the same date then choosing the highest SUB_ID and TYPE2.

 

In SQL you could do this as a GROUP BY sub-query, but I think the DATA step way is easier.

proc sql;
  create table test_latest as
  select a.*
  from test as a
  inner join
  (select id
             ,type
             ,cd
             ,max(date) as date_max
             ,max(sub_id) as sub_id_max
             ,max(type2) as type2_max
   from test
   group by id
           ,type
           ,cd
   )  as b
   on a.id = b.id
   and a.type = b.type
   and a.cd = b.cd
   and a.date = b.date_max
   and a.sub_id = b.sub_id_max
   and a.type2 = b.type2_max
  ;
quit;  
   

 

FreelanceReinh
Jade | Level 19

Hello @Ying,

 

I think, there is no general equivalent to BY-group processing with last.variable in PROC SQL (leaving monotonic() aside).

 

However, if in your real data (like in dataset TEST) variable SUB_ID does not have duplicate values within a ID-TYPE-CD-DATE BY group, you could use the following technique:

proc sql;
create table test_nodup as
select *
from test
group by id, type, cd, date
having sub_id=max(sub_id)
order by id, type, cd, date;
quit;

(The ORDER BY clause should actually be redundant. The SELECT statement could be modified to specify a format for variable DATE, if needed.)

 

If there are duplicates of SUB_ID within some BY groups, but TYPE2 and CNT can serve as "tie-breakers" (i.e., the combination SUB_ID, TYPE2, CNT is unique within each ID-TYPE-CD-DATE BY group), one could think about defining an expression combining values of SUB_ID, TYPE2 and CNT to replace SUB_ID in the HAVING clause. (The details would depend on the possible values of those three variables.)

 

By the way, my SAS 9.4 does not like your INFORMAT and FORMAT statements (for good reasons).

 

Ying
Fluorite | Level 6

Thank you so much!  

 

The TEST was created by ID-TYPE-CD-DATE (see below), so the way you describe should work.  I was given the code with PROC SQL and data steps and asked to replace the data steps with PROC SQL.  

proc sql;
    create table TEST as
        select id, type, cd, date, sub_id, type2, count(*) as cnt
        from    indsn
        group by 1,2,3,4,5,6          
        order by 1,2,3,4,5,6   ;
quit;

 

Another question, how to just simply rewrite the PROC SORT NODUPKEY data=... to PROC SQL to remove the dup?  

proc sql;
    create table TEST as
        select id, type, cd, date, sub_id, type2, status, dollar, amt
        from    indsn a1,
                   indsn2 a2
        where a1.id = a2.id and
                   a1.type = a2.type and
                   a1.cd = a2.;
quit;

proc sort data=TEST NODUPKEY; by id type cd; run;
FreelanceReinh
Jade | Level 19

Your first PROC SQL step shows that the combination of SUB_ID and TYPE2 is unique within the ID-TYPE-CD-DATE BY groups. So, CNT could be ignored in the "combined" expression I mentioned.

 

The PROC SORT step retains the first observation for each ID-TYPE-CD BY group. In PROC SQL, however, we would need to define which observation should be regarded as the "first" in a BY group. If TEST was sorted by DATE, SUB_ID etc. within each BY group, we could use a similar GROUP BY/HAVING approach as for your original question. Otherwise, we might have to resort to the undocumented MONOTONIC function, but I would prefer to avoid that:

proc sql undo_policy=none;
create table test as
select * from test
group by id, type, cd
having monotonic()=min(monotonic());
quit;

(I've learned about the option undo_policy=none from this older thread, but I would recommend to avoid reading and overwriting TEST with the same PROC SQL step.)

 

What is assumed about the values of DATE, SUB_ID, etc. after the PROC SORT step?

Ksharp
Super User
data TEST;
  input id type cd $ date : mmddyy10. sub_id type2 cnt;
  format date yymmdd10.;
datalines;
574 1 6Z 1/10/2008 689 2 1
574 1 6Z 2/11/2008 616 2 1
574 1 7E 12/19/2008 8117 2 1
574 1 7E 12/19/2008 8917 2 1
574 1 7E 12/19/2008 8927 2 1
574 1 7E 12/19/2008 8422 2 1
574 1 7E 12/19/2008 8864 2 1
;
run;
proc sql;
select distinct *
 from (select distinct * from TEST
        group by id ,type ,cd ,date
         having sub_id=max(sub_id) )
     group by  id ,type ,cd ,date
      having type2=max(type2);
quit;
Ying
Fluorite | Level 6
Thank you so much for helping me!!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 9240 views
  • 2 likes
  • 5 in conversation