Re: Duplicate field in EG

sasuser_sk · Posted 07-27-2021 05:03 PM

Hi Everyone. I have this data below. How can I create Duplicate field in EG based on ID. This is how I did in program code. Thank you.

data DUPS;
set WORK.ACTIVE_RECORDS ;
BY ID;
format DUPLICATE $CHAR30.;
IF (FIRST.ID) THEN DUPLICATE='';
ELSE DUPLICATE = 'DUPS';
RUN;

ID CATEGORY SUB_ID DUPLICATE
5692 C1 7000001
5986 C2 7000002
6050 C1 7000003
6080 C2 7000004
7789 C4 7000005
7789 C4 7000006 DUPS
7790 C1 7000007
7791 C1 7000008
8695 C1 7000009
8695 C2 7000010 DUPS
9526 C1 7000011
9528 C1 7000012
9598 C1 7000013
9598 C2 7000014 DUPS

Reeza · Posted 07-27-2021 05:08 PM

Are you asking how to create it using the GUI then? Or are you asking how to make your code more efficient or alternative code solutions?

@sasuser_sk wrote:

Hi Everyone. I have this data below. How can I create Duplicate field in EG based on ID. This is how I did in program code. Thank you.

data DUPS;
set WORK.ACTIVE_RECORDS ;
BY ID;
format DUPLICATE $CHAR30.;
IF (FIRST.ID) THEN DUPLICATE='';
ELSE DUPLICATE = 'DUPS';
RUN;

ID CATEGORY SUB_ID DUPLICATE
5692 C1 7000001
5986 C2 7000002
6050 C1 7000003
6080 C2 7000004
7789 C4 7000005
7789 C4 7000006 DUPS
7790 C1 7000007
7791 C1 7000008
8695 C1 7000009
8695 C2 7000010 DUPS
9526 C1 7000011
9528 C1 7000012
9598 C1 7000013
9598 C2 7000014 DUPS

SASKiwi · Posted 07-27-2021 05:08 PM

Your program will work fine in EG so I don't understand your problem.

sasuser_sk · Posted 07-27-2021 05:27 PM

I am looking for a way to do this in query builder instead in EG as I do not want to create a program code for this, which is an extra step always. If there is a way to build same logic in query builder I would love to learn that.

SASKiwi · Posted 07-27-2021 05:31 PM

The Query Builder only uses SQL and it isn't as easy that way. You can add your program as a Code Node in your EG project though.

sasuser_sk · Posted 07-27-2021 05:37 PM

If by "adding program as a Code Node" you mean something like in pic below then yes that is what I always do but due to the way my projects are I prefer to learn build similar concept in query builder. I make duplicate field very often when I make joins. I am willing to learn a harder way if it helps me keep my project clearer by building messy formulas in query builder.

Reeza · Posted 07-27-2021 06:08 PM

What are you ultimately doing with the duplicates? If you're excluding them the SORT task may be an option. Or you can exclude/identify them via SORT and then merge that result with Query Builder to identify duplicates.

Do you have other fields that help define the order of your rows, such as a date time or order variable? If so, you can often build something in query building by identifying everything that is the same as a maximum per group via a CASE statement.

You could use the RANK task first to rank within each ID, it will enumerate the variable so that any thing with a value over 1 will be a duplicate that you could then identify via Query Builder.

sasuser_sk · Posted 07-27-2021 06:23 PM

Thanks Reeza. In the last I would be removing the duplicates. But after glancing over the data with dups I choose whether to keep the dup

value or exclude the other dup value that is blank.

I like both options :

maximum per group via a CASE statement, and

ranking

I have tried one or the other depending on the needs of the project and find them useful but just to avoid a few extra steps I was looking to create something quicker for DUPS in query builder.

But if no such concept (program Code) could be done in query builder then I'll make my peace with program code.

Thank you for suggestions.

Reeza · Posted 07-27-2021 06:29 PM

I think you should relook at the features in the SORT task/proc. If you want to retain values based on another column that's doable within dual proc sorts or a single depending on the logic. Code is below, I'll leave it as an exercise for you to figure out how to do this in the SORT task.

IE sort variable you want to not be blank

*puts the missing values for the otherVariable to the bottom of the list;
proc sort data=have;
by ID descending otherVariable;
run;

proc sort data=have nodupkey;
by ID;
run;

NataljaK · Posted 08-03-2021 03:10 AM

Hi,

Firstly rank by ID ( MENU TASKS/DATA / RANK) . Secondly create file DUPS with new field Duplicate and put ’DUPS’ where rank_sub_id=2.
PROC RANK DATA = active_records
TIES=MEAN
OUT=WORK.RANKRanked(LABEL="Rank Analysis for WORK.active_records");
BY ID;
VAR SUB_ID;
RANKS rank_SUB_ID ;

PROC SQL;
CREATE TABLE WORK.DUPS AS
SELECT t1.ID,
t1.Category,
t1.SUB_ID,
/* DUPLICATE */
(CASE when t1.rank_SUB_ID=2 then 'DUPS' end) AS DUPLICATE
FROM WORK.RANKRanked t1;
QUIT;
Kind regards
Natalja

Classroom Training Available!