BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SFDonovan
Calcite | Level 5

I've got SAS code that extracts data from a Peoplesoft environment (Oracle).  It was written years ago and I have to decipher it without any documentation, and move it into a Peoplecode Application Engine using SQL steps.

One piece of code clearly shows the developer finding duplicates using the following key.

STDNT_KEY = EMPLID || STRM || CLASS_NBR;

DATA DUPS_CDUP_ENRL;

SET CLS_CDUP_ENRL;

BY STDNT_KEY;

IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1);

   KEEP OWNER EMPLID STRM SESSION_CODE SUBJECT CATALOG_NBR CLASS_SECTION DUP_CODE;

RUN;

PROC EXPORT DATA=DUPS_CDUP_ENRL

            OUTFILE= "C:\Processes\C_DUP\for_1058\DUPLICATE_CDUP_ENTRIES.xls"

            DBMS=EXCEL2000 REPLACE;

RUN;

In my understanding if FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1  then he would not have any duplicates, so he appears to only look for any row that is a duplicate.  True?

Again...I have no business rules or documentation so I have to try and understand the intent of the developer via the code.

Later I see the following code...

IF (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN DELETE;

OUTPUT CDUP_TRANS;

RUN;

PROC EXPORT DATA=CDUP_TRANS

            OUTFILE= "C:\Processes\C_DUP\for_1058\CDUP_TRANS.xls"

            DBMS=EXCEL2000 REPLACE;

RUN;

I don't know what CDUP_TRANS output is supposed to represent and don't know  if the developer was only looking for dups STDNT_KEY = 1 for both first and last.  If so why did they exclude (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) from the CDUP_TRANS code? 

Is

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1)

redundant?

1 ACCEPTED SOLUTION

Accepted Solutions
mohamed_zaki
Barite | Level 11

In my understanding if FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1  then he would not have any duplicates, so he appears to only look for any row that is a duplicate.  True?

That is true he is keeping STDNT's data who have more than one entries.


I don't know what CDUP_TRANS output is supposed to represent and don't know  if the developer was only looking for dups STDNT_KEY = 1 for both first and last.  If so why did they exclude (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) from the CDUP_TRANS code?

In this code he summarize each from those who have more than one entries, representing them by the first entry only. For summarize or reporting...

But your code's first part is not complete. So it could be that he is report all the STDNT in the database table without duplicate. So the STDNT with more than one entries will be represented by the first record and those who has one entries will also be represented.

Is

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1)

redundant?

No,

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... mean the first entry in the group

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... mean the last entry in the group

*****

For example:

STDNT_KEY

1

2

3

1

1

3

After the data sorted by STDNT_KEY:

STDNT_KEY

1

1

1

2

3

3

By understanding the first and last temporary SAS variables in the sorted data, and understanding that each unique student entries is a group in it's self:

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) ... not the first or the last

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group

*********

So

IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN OUTPUT;

gives you

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group


*******

And

IF (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN DELETE;;

gives you:

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group

View solution in original post

3 REPLIES 3
mohamed_zaki
Barite | Level 11

In my understanding if FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1  then he would not have any duplicates, so he appears to only look for any row that is a duplicate.  True?

That is true he is keeping STDNT's data who have more than one entries.


I don't know what CDUP_TRANS output is supposed to represent and don't know  if the developer was only looking for dups STDNT_KEY = 1 for both first and last.  If so why did they exclude (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) from the CDUP_TRANS code?

In this code he summarize each from those who have more than one entries, representing them by the first entry only. For summarize or reporting...

But your code's first part is not complete. So it could be that he is report all the STDNT in the database table without duplicate. So the STDNT with more than one entries will be represented by the first record and those who has one entries will also be represented.

Is

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1)

redundant?

No,

(FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... mean the first entry in the group

(FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... mean the last entry in the group

*****

For example:

STDNT_KEY

1

2

3

1

1

3

After the data sorted by STDNT_KEY:

STDNT_KEY

1

1

1

2

3

3

By understanding the first and last temporary SAS variables in the sorted data, and understanding that each unique student entries is a group in it's self:

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) ... not the first or the last

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group

*********

So

IF (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN OUTPUT;

gives you

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group


*******

And

IF (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0) OR

   (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) THEN DELETE;;

gives you:

STDNT_KEY

1       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 1's group

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 0)

1       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 1's group

2       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 1) ... the first entry in the 2's group and the last

3       (FIRST.STDNT_KEY = 1 & LAST.STDNT_KEY = 0) ... the first entry in the 3's group

3       (FIRST.STDNT_KEY = 0 & LAST.STDNT_KEY = 1) ... the last entry in the 3's group

OS2Rules
Obsidian | Level 7

Hi.

How about you try this:

IF FIRST.STDNT_KEY = 1 and LAST.STDNT_KEY = 1 then delete;

     else output;

this way - records that are unique (first and last = 1) are deleted from the data table and only the duplicate records are kept.

ballardw
Super User

Being a bit pedantic I would say records where the key is duplicated, not duplicate records.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 915 views
  • 1 like
  • 4 in conversation