BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
ralizadeh
Obsidian | Level 7

I want to remove the duplicated records which have similar entries across all variables except 'age'. I don't care which row to keep once a duplicated row found. Also, I don't want to see 'age' column in my report either.

Any help would be appreciated.

 

 

Here is what I have:

ralizadeh_0-1682967084996.png

 

Here is what I want:

 

ralizadeh_1-1682967178287.png

 

Thanks in advance. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
yabwon
Amethyst | Level 16

First of all you need to understand that EG won't execute any thing for you becayse EG is just a "nice looking" interface to SAS computing engine, basically there is always SAS code at the end, it's just the way it works.

So the fastest, the more efficient, and foremost reproducible way would be to add program block to your EG flow with the following code:

proc sort data=have(keep=CIN YYYY MM) out=want nodupkey;
  by _all_;
run;

But as I wrote EG is "nice looking" interface so you can also do it this way:

 

1) I assume that in the WORK library there is your dataset named HAVE

2) Drag and drop the dataset to the process flow:

yabwon_0-1683053938362.png

3) double click dataset icon and open it and select tasks (of course my example has different data)

yabwon_1-1683054119124.png

4) In open task window navigate to data -> sort data:

yabwon_2-1683054204308.png

5) in the Data tab drag all 3 variables to Sort by role, and Age to Drop list:

yabwon_3-1683054357460.png

6) In the Options tab select middle "dot" for duplicates:

yabwon_4-1683054435624.png

7) click Save and in the next window click "Running Man":

yabwon_5-1683054530223.png

😎 enjoy de-duplicated data:

yabwon_6-1683054601682.png

And after 8 steps you are done.

 

 

But now if you decide to add program node to your code and paste code I shared, it will take only 2 steps.

 

Bart

 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



View solution in original post

9 REPLIES 9
ballardw
Super User
Proc sql;
   select distinct CIN, YYYY, MM
   from yourdatasetname
   ;
quit;

One way. Proc SQL will order the values differently than your source data.

 


@ralizadeh wrote:

I want to remove the duplicated records which have similar entries across all variables except 'age'. I don't care which row to keep once a duplicated row found. Also, I don't want to see 'age' column in my report either.

Any help would be appreciated.

 

 

Here is what I have:

ralizadeh_0-1682967084996.png

 

Here is what I want:

 

ralizadeh_1-1682967178287.png

 

Thanks in advance. 

 


 

ralizadeh
Obsidian | Level 7

Thanks @ballardw

If I decide to keep the age column can I just write?

Proc sql;
   select age distinct CIN, YYYY, MM
   from yourdatasetname
   ;
quit;

 

 

 

 

ralizadeh
Obsidian | Level 7

The code below works with no issue. But, I keep getting an error when I use "distinct" in the SELECT clause. Any idea?

 

PROC SQL;
	CREATE TABLE SASUSER.DE_test AS (
		SELECT
			CLAIMS_HDR.AKA_CIN,
			CLAIMS_HDR.SVC_FROM_DT_YYYY,
			CLAIMS_HDR.SVC_FROM_DT_MM,
			CLAIMS_HDR.Age	
		FROM
             mytable
		WHERE
			ELIGIBILITY.MC_STAT_A NOT IN (' ','0','9') OR ELIGIBILITY.MC_STAT_B NOT IN (' ','0','9') OR ELIGIBILITY.MC_STAT_D NOT IN (' ','0','9')
		GROUP BY
			CLAIMS_HDR.AKA_CIN,
			CLAIMS_HDR.SVC_FROM_DT_YYYY,
			CLAIMS_HDR.SVC_FROM_DT_MM,
			CLAIMS_HDR.Age);
QUIT;

The output contains some rows with the same values for all variables except 'age,' and these are the rows from which I want to keep only one row (like the examples in the pictures).

 

 

ballardw
Super User

You really need to pick a rule for which age you want.

 

I really doubt that code selects anything that makes sense if at all. You are using conditions in a WHERE clause using a data set alias of Eligibility without defining it any where as well as selecting variables from a set alias of Claims_HDR which is also not defined.

 

I might pick either  the minimum or maximum age and then DROP age from the group by as you don't want to use all of the Age values to group by.

 

PROC SQL;
	CREATE TABLE SASUSER.DE_test AS (
		SELECT
			CLAIMS_HDR.AKA_CIN,
			CLAIMS_HDR.SVC_FROM_DT_YYYY,
			CLAIMS_HDR.SVC_FROM_DT_MM,
			max(CLAIMS_HDR.Age) as Age	
		FROM
           <random nonsense deleted>
		GROUP BY
			CLAIMS_HDR.AKA_CIN,
			CLAIMS_HDR.SVC_FROM_DT_YYYY,
			CLAIMS_HDR.SVC_FROM_DT_MM
			);
QUIT;

Note: Include LOGS of code that do not create desired output. Include ALL of the code and all the messages generated from that code. If your data sets and variable names are too sensitive share either create temporary data sets with less sensitive names or quite naming them with sensitive values.  Copy the text from the log and paste all of it into a Text box.

ralizadeh
Obsidian | Level 7

I am not looking for a SAS code. Any SAS EG method that could help with my original question would be appreciated. 

I want to remove the duplicated records which have similar entries across all variables except 'age'. I don't care which row to keep once a duplicated row found. Also, I don't want to see 'age' column in my report either.

 

Here is what I have:

ralizadeh_0-1683052294703.png

 

 

Here is what I want:

 

ralizadeh_1-1683052294773.png

 

 

Thanks in advance. 

yabwon
Amethyst | Level 16

First of all you need to understand that EG won't execute any thing for you becayse EG is just a "nice looking" interface to SAS computing engine, basically there is always SAS code at the end, it's just the way it works.

So the fastest, the more efficient, and foremost reproducible way would be to add program block to your EG flow with the following code:

proc sort data=have(keep=CIN YYYY MM) out=want nodupkey;
  by _all_;
run;

But as I wrote EG is "nice looking" interface so you can also do it this way:

 

1) I assume that in the WORK library there is your dataset named HAVE

2) Drag and drop the dataset to the process flow:

yabwon_0-1683053938362.png

3) double click dataset icon and open it and select tasks (of course my example has different data)

yabwon_1-1683054119124.png

4) In open task window navigate to data -> sort data:

yabwon_2-1683054204308.png

5) in the Data tab drag all 3 variables to Sort by role, and Age to Drop list:

yabwon_3-1683054357460.png

6) In the Options tab select middle "dot" for duplicates:

yabwon_4-1683054435624.png

7) click Save and in the next window click "Running Man":

yabwon_5-1683054530223.png

😎 enjoy de-duplicated data:

yabwon_6-1683054601682.png

And after 8 steps you are done.

 

 

But now if you decide to add program node to your code and paste code I shared, it will take only 2 steps.

 

Bart

 

 

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



ralizadeh
Obsidian | Level 7

@yabwon Does this pick the one (first) row of each distinct combination of CIN, YYYY, MM?

 

Thanks

yabwon
Amethyst | Level 16

I should ask: did you use Maxim 4?

 

But it will be faster this way:

In point 6) mark:

yabwon_0-1683093922076.png

 

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



ralizadeh
Obsidian | Level 7

@yabwonI am not sure what version of SAS you are using. Mine is SAS EG 7.12. I understood your point and it worked for me. Thank you really lot.

BTW, I was able to achieve the same thing using Query Builder, only there was no way to have a distinct combination of CIN, YYYY, MM, and preserve the AGE in the Column Name. I had to get rid of the AGE column. I could, however, eliminate or maintain the AGE column using Sort methods, you suggested.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 19989 views
  • 8 likes
  • 3 in conversation