BookmarkSubscribeRSS Feed
sasuser_sk
Quartz | Level 8

Hi Everyone. I have this data below. How can I create Duplicate field in EG based on ID. This is how I did in program code. Thank you.

data DUPS;
set WORK.ACTIVE_RECORDS ;
BY ID;
format DUPLICATE $CHAR30.;
IF (FIRST.ID) THEN DUPLICATE='';
ELSE DUPLICATE = 'DUPS';
RUN;

 

ID CATEGORY SUB_ID DUPLICATE
5692 C1 7000001
5986 C2 7000002
6050 C1 7000003
6080 C2 7000004
7789 C4 7000005
7789 C4 7000006 DUPS
7790 C1 7000007
7791 C1 7000008
8695 C1 7000009
8695 C2 7000010 DUPS
9526 C1 7000011
9528 C1 7000012
9598 C1 7000013
9598 C2 7000014 DUPS

 

9 REPLIES 9
Reeza
Super User

Are you asking how to create it using the GUI then? Or are you asking how to make your code more efficient or alternative code solutions?

 


@sasuser_sk wrote:

Hi Everyone. I have this data below. How can I create Duplicate field in EG based on ID. This is how I did in program code. Thank you.

data DUPS;
set WORK.ACTIVE_RECORDS ;
BY ID;
format DUPLICATE $CHAR30.;
IF (FIRST.ID) THEN DUPLICATE='';
ELSE DUPLICATE = 'DUPS';
RUN;

 

ID CATEGORY SUB_ID DUPLICATE
5692 C1 7000001
5986 C2 7000002
6050 C1 7000003
6080 C2 7000004
7789 C4 7000005
7789 C4 7000006 DUPS
7790 C1 7000007
7791 C1 7000008
8695 C1 7000009
8695 C2 7000010 DUPS
9526 C1 7000011
9528 C1 7000012
9598 C1 7000013
9598 C2 7000014 DUPS

 


 

SASKiwi
PROC Star

Your program will work fine in EG so I don't understand your problem.

sasuser_sk
Quartz | Level 8

I am looking for a way to do this in query builder instead in EG as I do not want to create a program code for this, which is an extra step always. If there is a way to build same logic in query builder I would love to learn that.

SASKiwi
PROC Star

The Query Builder only uses SQL and it isn't as easy that way. You can add your program as a Code Node in your EG project though.

sasuser_sk
Quartz | Level 8

If by "adding program as a Code Node" you mean something like in pic below then yes that is what I always do but due to the way my projects are I prefer to learn build similar concept in query builder. I make duplicate field very often when I make joins. I am willing to learn a harder way if it helps me keep my project clearer by building messy formulas in query builder.

 

sasuser_sk_0-1627421671217.png

 

Reeza
Super User
What are you ultimately doing with the duplicates? If you're excluding them the SORT task may be an option. Or you can exclude/identify them via SORT and then merge that result with Query Builder to identify duplicates.

Do you have other fields that help define the order of your rows, such as a date time or order variable? If so, you can often build something in query building by identifying everything that is the same as a maximum per group via a CASE statement.

You could use the RANK task first to rank within each ID, it will enumerate the variable so that any thing with a value over 1 will be a duplicate that you could then identify via Query Builder.
sasuser_sk
Quartz | Level 8

Thanks Reeza. In the last I would be removing the duplicates. But after glancing over the data with dups I choose whether to keep the dup

value or exclude the other dup value that is blank.

 

I like both options :

maximum per group via a CASE statement, and

ranking

 

I have tried one or the other depending on the needs of the project and find them useful but just to avoid a few extra steps I was looking to create something quicker for DUPS in query builder.

 

But if no such concept (program Code) could be done in query builder then I'll make my peace with program code.

 

Thank you for suggestions.

Reeza
Super User
I think you should relook at the features in the SORT task/proc. If you want to retain values based on another column that's doable within dual proc sorts or a single depending on the logic. Code is below, I'll leave it as an exercise for you to figure out how to do this in the SORT task.

IE sort variable you want to not be blank

*puts the missing values for the otherVariable to the bottom of the list;
proc sort data=have;
by ID descending otherVariable;
run;

proc sort data=have nodupkey;
by ID;
run;
NataljaK
Obsidian | Level 7
Hi,

Firstly rank by ID ( MENU TASKS/DATA / RANK) . Secondly create file DUPS with new field Duplicate and put ’DUPS’ where rank_sub_id=2.
PROC RANK DATA = active_records
TIES=MEAN
OUT=WORK.RANKRanked(LABEL="Rank Analysis for WORK.active_records");
BY ID;
VAR SUB_ID;
RANKS rank_SUB_ID ;

PROC SQL;
CREATE TABLE WORK.DUPS AS
SELECT t1.ID,
t1.Category,
t1.SUB_ID,
/* DUPLICATE */
(CASE when t1.rank_SUB_ID=2 then 'DUPS' end) AS DUPLICATE
FROM WORK.RANKRanked t1;
QUIT;
Kind regards
Natalja

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 724 views
  • 4 likes
  • 4 in conversation