DATA Step, Macro, Functions and more

Eliminate duplicate records.

Accepted Solution Solved
Reply
Contributor
Posts: 25
Accepted Solution

Eliminate duplicate records.

Hi all,

I have fields like lnno, cycdt, corrnind and dttm. If there are duplicate records based on lnno & cycdt then i want to sort based on corrnind and dttm and corrnind (usually space or Y) if there are any then that record(corrind record because it takes precedence over dt_tm) should be in want records if it not there then the latest dt_tm should be in the want records. Other records should be eliminated and should go in duplicate dataset

data have:

@001 lnno     $09.
@010 cycno    yymmdd6.
@017 corrind  $1.
@018 dt_tm    $10.

367940841170415Y1704181430
373065728170415 1704181430
175597502170415Y1704181431
175597502170415Y1704181430
199994811170415Y1704181430
019354533170415Y1704181430
180878441170415	1704181430
180878441170515	1704181431
158072200170415	1704181430
158072200170515Y1704181430
158072200170515	1704181430
122492196170415	1704181430
261454099170415	1704181430
261454099170415Y1704181430
143798189170415	1704181430
143798189170415	1704181430
367940841170415	1704181430
373065728170415Y1704181430
430912595170415Y1704181430
000808229170415	1704181430
430912595170415	1704181430
199994811170415	1704181430
199994811170415	1704181430
019354533170415	1704181430
019354533170415Y1704181430
029267455170415Y1704181430
122381807170415	1704181430
029267455170415	1704172150
122382609170415Y1704180100
122382609170415	0000000000


data want:


367940841170415Y1704181430
373065728170415Y1704181430
175597502170415Y1704181431
199994811170415Y1704181430
019354533170415Y1704181430
180878441170515	1704181431
158072200170515Y1704181430
122492196170415	1704181430
261454099170415Y1704181430
143798189170415	1704181430
000808229170415	1704181430
430912595170415Y1704181430
029267455170415Y1704181430
122382609170415Y1704180100


data dup:

367940841170415	1704181430
373065728170415 1704181430
175597502170415Y1704181430
199994811170415	1704181430
199994811170415	1704181430
019354533170415Y1704181430
180878441170415	1704181430
158072200170415	1704181430
158072200170515	1704181430
261454099170415	1704181430
143798189170415	1704181430
430912595170415	1704181430
019354533170415	1704181430
029267455170415	1704172150
122382609170415	0000000000

I may miss a record in the dup but the no of records are not matching.

I really appreciate your help on this.

Thanks
Neal.

Accepted Solutions
Solution
‎07-10-2017 07:37 PM
PROC Star
Posts: 7,363

Re: Eliminate duplicate records.

First, the answer to your question is 'Yes". Secondly, it's more like COBOL than you think, but with some extras.

 

To answer your questions, though, think about my suggested code:

proc sort data=have out=want;
  by lnno  cycno  descending corrind descending dt_tm;
run;

That first sort puts the records in order of inno, cycno, descending corrind (thus Y comes before missing) and finally, the most recent dt_tm records first.

 

The second sort:

proc sort data=want dupout=dups nodupkey;
  by lnno  cycno;
run;

doesn't change the order of the first sort, but it's only looking at inno and cycno, accepts the first record as a non-duplicate, then all ofthers as duplicates.

 

Art, CEO, AnalystFinder.com

View solution in original post


All Replies
Contributor
Posts: 21

Re: Eliminate duplicate records.

proc sql;

 

create table foobar as

select distinct

 lnno,cycno,corrind,dt_tm

from table1;

quit;

Trusted Advisor
Posts: 1,615

Re: Eliminate duplicate records.

PROC SORT with the NODUPKEY option

Trusted Advisor
Posts: 1,378

Re: Eliminate duplicate records.

data have;
input
@001 lnno     $09.
@010 cycno    yymmdd6.
@017 corrind  $1.
@018 dt_tm    $10.
;
datalines;
367940841170415Y1704181430
373065728170415 1704181430
175597502170415Y1704181431
175597502170415Y1704181430
199994811170415Y1704181430
019354533170415Y1704181430
180878441170415	1704181430
180878441170515	1704181431
158072200170415	1704181430
158072200170515Y1704181430
158072200170515	1704181430
122492196170415	1704181430
261454099170415	1704181430
261454099170415Y1704181430
143798189170415	1704181430
143798189170415	1704181430
367940841170415	1704181430
373065728170415Y1704181430
430912595170415Y1704181430
000808229170415	1704181430
430912595170415	1704181430
199994811170415	1704181430
199994811170415	1704181430
019354533170415	1704181430
019354533170415Y1704181430
029267455170415Y1704181430
122381807170415	1704181430
029267455170415	1704172150
122382609170415Y1704180100
122382609170415	0000000000
; run;

proc sort data=have NODUPRECS
         out=distinct     dupout=dups;
  by lnno;
run;

LOG shows: 30 obs in HAVE, 10 duplicates deleted and saved in DUPS dataset, the rest 20 DISTINCT obs in out.

PROC Star
Posts: 7,363

Re: Eliminate duplicate records.

@Shmuel: Are you SURE you want to recommend using NODUPRECS in the proc sort statement? In the last 43 years that I've used SAS I've NEVER found a case where that was a useful option to use. It is quite different than NODUPKEY.

 

Art, CEO, AnalystFinder.com

 

Trusted Advisor
Posts: 1,378

Re: Eliminate duplicate records.


art297 wrote:

@Shmuel: Are you SURE you want to recommend using NODUPRECS in the proc sort statement? In the last 43 years that I've used SAS I've NEVER found a case where that was a useful option to use. It is quite different than NODUPKEY.

 

Art, CEO, AnalystFinder.com

 


@art297, pay attention - @saslovethemost talked about DUPLICATE RECORDS, without mentioning any keys or ids.

Sometimes I had to use NUDUPRECS instead NODUPKEY.

In this case, as keys were not defined, I used sort by just one variable but differed duplicates by all of them.

Contributor
Posts: 25

Re: Eliminate duplicate records.

@Shmuel, the key is lnno and cycdt, if there are any two records with this key then further look for corrnind and dt_tm. If any record has corrind as 'Y' then that should go into want dataset and remaining should go into dup dataset. Further, if there are two records with corrnind as 'Y' then the latest dt_tm should be the one should go to want dataset and remaining should go to dup datset. Hope I have explained it right.

PROC Star
Posts: 7,363

Re: Eliminate duplicate records.

First, some problems with your data and how you read it. I think you want to use:

data have;
input @001 lnno     $09.
@010 cycno    yymmdd6.
@016 corrind  $1.
@017 dt_tm    $10.;

The way to get what you want is to do two sorts: one to order the records, the other the create the two files (want and dups). e.g.:

proc sort data=have out=want;
  by lnno  cycno  descending corrind descending dt_tm;
run;

proc sort data=want dupout=dups nodupkey;
  by lnno  cycno;
run;

However, your desired output doesn't reflect your data. Some examples are:

lnno: 180878441 where it has two different cycno values

lnno: 158072200 where it has two different cycno values

lnno: 122381807 which you don't include in your want file

 

Art, CEO, AnalystFinder.com

 

 

Frequent Contributor
Posts: 83

Re: Eliminate duplicate records.

Neal 

 


data one; input

@001 lnno $09.
@010 cycno yymmdd6.
@016 corrind $1.
@017 dt_tm $10.;

cards;
367940841170415Y1704181430
373065728170415 1704181430
175597502170415Y1704181431
175597502170415Y1704181430
199994811170415Y1704181430
019354533170415Y1704181430
180878441170415 1704181430
180878441170515 1704181431
158072200170415 1704181430
158072200170515Y1704181430
158072200170515 1704181430
122492196170415 1704181430
261454099170415 1704181430
261454099170415Y1704181430
143798189170415 1704181430
143798189170415 1704181430
367940841170415 1704181430
373065728170415Y1704181430
430912595170415Y1704181430
000808229170415 1704181430
430912595170415 1704181430
199994811170415 1704181430
199994811170415 1704181430
019354533170415 1704181430
019354533170415Y1704181430
029267455170415Y1704181430
122381807170415 1704181430
029267455170415 1704172150
122382609170415Y1704180100
122382609170415 0000000000
;
proc sort out=two ; by lnno cycno dt_tm;


proc print; title "two"; run;

 

data want; set two; by lnno cycno dt_tm;
if first.dt_tm and last.dt_tm then output;

proc print; title "unique";

 

data dups; set two; by lnno cycno dt_tm;
if first.dt_tm and not last.dt_tm;

proc print; title "dups"; run;

Frequent Contributor
Posts: 83

Re: Eliminate duplicate records.

further thinking

data dups; set two; by lnno cycno dt_tm;
unique=first.dt_tm and last.dt_tm;
if not unique then output
Contributor
Posts: 25

Re: Eliminate duplicate records.

Art, thanks for the code and looks great.

 

I have a question though, will the code take corrnind into account? If the inpttxn is having corrnind as 'Y' even though that wont have the latest dt_tm the record with corrnind with 'Y' should take precedence and if two records(lnno and cycno) are having corrnind s and dt_tm s then the one with the latest dt_tm should take precedence. I dont see in the code you mentoning something like if corrnind = 'Y' etc..SAS is totally different than legacy COBOL i think.

 

What you said about the follwoing is correct, so it should be in the want dataset if it is a different cycno. If the lnno and cycno is same then only will further processed to look for corrnind and dt_tm.

 

lnno: 180878441 where it has two different cycno values

lnno: 158072200 where it has two different cycno values

lnno: 122381807 which you don't include in your want file

 

I really appreciate your time on this. Cheers.

Solution
‎07-10-2017 07:37 PM
PROC Star
Posts: 7,363

Re: Eliminate duplicate records.

First, the answer to your question is 'Yes". Secondly, it's more like COBOL than you think, but with some extras.

 

To answer your questions, though, think about my suggested code:

proc sort data=have out=want;
  by lnno  cycno  descending corrind descending dt_tm;
run;

That first sort puts the records in order of inno, cycno, descending corrind (thus Y comes before missing) and finally, the most recent dt_tm records first.

 

The second sort:

proc sort data=want dupout=dups nodupkey;
  by lnno  cycno;
run;

doesn't change the order of the first sort, but it's only looking at inno and cycno, accepts the first record as a non-duplicate, then all ofthers as duplicates.

 

Art, CEO, AnalystFinder.com

Contributor
Posts: 25

Re: Eliminate duplicate records.

Thanks for the explanation. If I want to safely come out without executing further sas statements if there are no records in the inpttxn(have) then can I have the following code at the beginning of the program after I read the input(have) file( I think I got this from one of your(Art's) posts).
data _null_;
set have nobs=recs;
if recs = 0 then call execute('endsas;');
run;
Respected Advisor
Posts: 3,892

Re: Eliminate duplicate records.

[ Edited ]

@saslovethemost

The SAS data step will stop iteration immediately at the line where the SET statement tries to read from an empty SAS data set. Therefore the SET statement needs to come after your test IF RECS=0 as else this check will never get executed in case of an empty Have table.

I'd also add a STOP statement so that in case there is data in your HAVE table you're not iterating through the whole dataset for no reason. 

Below code should do the job:

data have;
  stop;
  set sashelp.class;
run;

data _null_;
  if recs=0 then abort;
  stop;
  set have nobs=recs;
run;

 

Instead of STOP; you could also use DS option OBS=0 to not read actual data.

You can also create a specific return code in case you abort your program which in a production environment can support job monitoring (the return code can be used to tell you why the job has been aborted).

/* case: Have with 0 records */
data have;
  stop;
  set sashelp.class;
run;

data _null_;
  if recs=0 then abort return 10;
  set have(obs=0) nobs=recs;
run;
Contributor
Posts: 25

Re: Eliminate duplicate records.

In either case, I am getting the follwoing error with STAT as U999

50          data _null_;
51            if recs=0 then abort return 10;
52            set have(obs=0) nobs=recs;
53          run;

ERROR: Execution terminated by an ABORT statement at line 51 column 19, it specified the RETURN option.
NOTE: Limit set by ERRORS= option reached.  Further errors of this type will not be printed.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: The DATA statement used 0.01 CPU seconds and 13062K.
ERROR: SAS ended due to errors.
       You specified: OPTIONS ERRORABEND;.
ERROR: Errors printed on page 3.
when I used 'call execute ('endsas;'); then I get the following SYSOUT messages.

50          data _null_;
51            if recs=0 then call execute('endsas;');
52            set have(obs=0) nobs=recs;
53          run;

NOTE: There were 0 observations read from the data set WORK.HAVE.
NOTE: The DATA statement used 0.01 CPU seconds and 13206K.

NOTE: The address space has used a maximum of 876K below the line and 17084K above the line.

NOTE: CALL EXECUTE generated line.
53       !
1         + endsas;
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 275 views
  • 3 likes
  • 7 in conversation