BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SFDonovan
Calcite | Level 5

I've got SAS code that extracts data from a Peoplesoft environment (Oracle).  It was written years ago and I have to decipher it without any documentation, and move it into a Peoplecode Application Engine using SQL steps.

One piece of code clearly shows the developer finding duplicates using the following key.

STDNT_KEY = EMPLID || STRM || CLASS_NBR;

DATA DUPS_CDUP_ENRL;

SET CLS_CDUP_ENRL;

BY STDNT_KEY;

IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1);

   KEEP OWNER EMPLID STRM SESSION_CODE SUBJECT CATALOG_NBR CLASS_SECTION DUP_CODE;

RUN;

PROC EXPORT DATA=DUPS_CDUP_ENRL

            OUTFILE= "C:\Processes\C_DUP\for_1058\DUPLICATE_CDUP_ENTRIES.xls"

            DBMS=EXCEL2000 REPLACE;

RUN;

In my understanding if FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1  then he would not have any duplicates, so he appears to only look for any row that is a duplicate.  True?

Again...I have no business rules or documentation so I have to try and understand the intent of the developer via the code.

Later I see the following code...

IF (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN DELETE;

OUTPUT CDUP_TRANS;

RUN;

PROC EXPORT DATA=CDUP_TRANS

            OUTFILE= "C:\Processes\C_DUP\for_1058\CDUP_TRANS.xls"

            DBMS=EXCEL2000 REPLACE;

RUN;

I don't know what CDUP_TRANS output is supposed to represent and don't know  if the developer was only looking for dups STDNT_KEY = 1 for both first and last.  If so why did they exclude (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) from the CDUP_TRANS code? 

Is

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1)

redundant?

1 ACCEPTED SOLUTION

Accepted Solutions
mohamed_zaki
Barite | Level 11

In my understanding if FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1  then he would not have any duplicates, so he appears to only look for any row that is a duplicate.  True?

That is true he is keeping STDNT's data who have more than one entries.


I don't know what CDUP_TRANS output is supposed to represent and don't know  if the developer was only looking for dups STDNT_KEY = 1 for both first and last.  If so why did they exclude (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) from the CDUP_TRANS code?

In this code he summarize each from those who have more than one entries, representing them by the first entry only. For summarize or reporting...

But your code's first part is not complete. So it could be that he is report all the STDNT in the database table without duplicate. So the STDNT with more than one entries will be represented by the first record and those who has one entries will also be represented.

Is

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1)

redundant?

No,

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... mean the first entry in the group

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... mean the last entry in the group

*****

For example:

STDNT_KEY

1

2

3

1

1

3

After the data sorted by STDNT_KEY:

STDNT_KEY

1

1

1

2

3

3

By understanding the first and last temporary SAS variables in the sorted data, and understanding that each unique student entries is a group in it's self:

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) ... not the first or the last

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group

*********

So

IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN OUTPUT;

gives you

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group


*******

And

IF (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN DELETE;;

gives you:

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group

View solution in original post

3 REPLIES 3
mohamed_zaki
Barite | Level 11

In my understanding if FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1  then he would not have any duplicates, so he appears to only look for any row that is a duplicate.  True?

That is true he is keeping STDNT's data who have more than one entries.


I don't know what CDUP_TRANS output is supposed to represent and don't know  if the developer was only looking for dups STDNT_KEY = 1 for both first and last.  If so why did they exclude (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) from the CDUP_TRANS code?

In this code he summarize each from those who have more than one entries, representing them by the first entry only. For summarize or reporting...

But your code's first part is not complete. So it could be that he is report all the STDNT in the database table without duplicate. So the STDNT with more than one entries will be represented by the first record and those who has one entries will also be represented.

Is

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1)

redundant?

No,

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... mean the first entry in the group

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... mean the last entry in the group

*****

For example:

STDNT_KEY

1

2

3

1

1

3

After the data sorted by STDNT_KEY:

STDNT_KEY

1

1

1

2

3

3

By understanding the first and last temporary SAS variables in the sorted data, and understanding that each unique student entries is a group in it's self:

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) ... not the first or the last

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group

*********

So

IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN OUTPUT;

gives you

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group


*******

And

IF (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN DELETE;;

gives you:

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group

OS2Rules
Obsidian | Level 7

Hi.

How about you try this:

IF FIRST.STDNT_KEY = 1 and LAST.STDNT_KEY = 1 then delete;

     else output;

this way - records that are unique (first and last = 1) are deleted from the data table and only the duplicate records are kept.

ballardw
Super User

Being a bit pedantic I would say records where the key is duplicated, not duplicate records.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1179 views
  • 1 like
  • 4 in conversation