Solved: Re: How to remove duplicate rows based on some columns in SAS Enterpri...

ralizadeh · Posted 05-01-2023 02:56 PM

I want to remove the duplicated records which have similar entries across all variables except 'age'. I don't care which row to keep once a duplicated row found. Also, I don't want to see 'age' column in my report either.

Any help would be appreciated.

Here is what I have:

Here is what I want:

Thanks in advance.

yabwon · Posted 05-02-2023 03:12 PM

First of all you need to understand that EG won't execute any thing for you becayse EG is just a "nice looking" interface to SAS computing engine, basically there is always SAS code at the end, it's just the way it works.

So the fastest, the more efficient, and foremost reproducible way would be to add program block to your EG flow with the following code:

proc sort data=have(keep=CIN YYYY MM) out=want nodupkey;
  by _all_;
run;

But as I wrote EG is "nice looking" interface so you can also do it this way:

1) I assume that in the WORK library there is your dataset named HAVE

2) Drag and drop the dataset to the process flow:

3) double click dataset icon and open it and select tasks (of course my example has different data)

4) In open task window navigate to data -> sort data:

5) in the Data tab drag all 3 variables to Sort by role, and Age to Drop list:

6) In the Options tab select middle "dot" for duplicates:

7) click Save and in the next window click "Running Man":

😎 enjoy de-duplicated data:

And after 8 steps you are done.

But now if you decide to add program node to your code and paste code I shared, it will take only 2 steps.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

ballardw · Posted 05-01-2023 03:22 PM

Proc sql;
   select distinct CIN, YYYY, MM
   from yourdatasetname
   ;
quit;

One way. Proc SQL will order the values differently than your source data.

@ralizadeh wrote:

I want to remove the duplicated records which have similar entries across all variables except 'age'. I don't care which row to keep once a duplicated row found. Also, I don't want to see 'age' column in my report either.

Any help would be appreciated.

Here is what I have:

Here is what I want:

Thanks in advance.

ralizadeh · Posted 05-01-2023 07:38 PM

Thanks @ballardw

If I decide to keep the age column can I just write?

Proc sql;
   select age distinct CIN, YYYY, MM
   from yourdatasetname
   ;
quit;

ralizadeh · Posted 05-01-2023 10:07 PM

The code below works with no issue. But, I keep getting an error when I use "distinct" in the SELECT clause. Any idea?

PROC SQL;
	CREATE TABLE SASUSER.DE_test AS (
		SELECT
			CLAIMS_HDR.AKA_CIN,
			CLAIMS_HDR.SVC_FROM_DT_YYYY,
			CLAIMS_HDR.SVC_FROM_DT_MM,
			CLAIMS_HDR.Age	
		FROM
             mytable
		WHERE
			ELIGIBILITY.MC_STAT_A NOT IN (' ','0','9') OR ELIGIBILITY.MC_STAT_B NOT IN (' ','0','9') OR ELIGIBILITY.MC_STAT_D NOT IN (' ','0','9')
		GROUP BY
			CLAIMS_HDR.AKA_CIN,
			CLAIMS_HDR.SVC_FROM_DT_YYYY,
			CLAIMS_HDR.SVC_FROM_DT_MM,
			CLAIMS_HDR.Age);
QUIT;

The output contains some rows with the same values for all variables except 'age,' and these are the rows from which I want to keep only one row (like the examples in the pictures).

ballardw · Posted 05-02-2023 01:43 AM

You really need to pick a rule for which age you want.

I really doubt that code selects anything that makes sense if at all. You are using conditions in a WHERE clause using a data set alias of Eligibility without defining it any where as well as selecting variables from a set alias of Claims_HDR which is also not defined.

I might pick either the minimum or maximum age and then DROP age from the group by as you don't want to use all of the Age values to group by.

PROC SQL;
	CREATE TABLE SASUSER.DE_test AS (
		SELECT
			CLAIMS_HDR.AKA_CIN,
			CLAIMS_HDR.SVC_FROM_DT_YYYY,
			CLAIMS_HDR.SVC_FROM_DT_MM,
			max(CLAIMS_HDR.Age) as Age	
		FROM
           <random nonsense deleted>
		GROUP BY
			CLAIMS_HDR.AKA_CIN,
			CLAIMS_HDR.SVC_FROM_DT_YYYY,
			CLAIMS_HDR.SVC_FROM_DT_MM
			);
QUIT;

Note: Include LOGS of code that do not create desired output. Include ALL of the code and all the messages generated from that code. If your data sets and variable names are too sensitive share either create temporary data sets with less sensitive names or quite naming them with sensitive values. Copy the text from the log and paste all of it into a Text box.

ralizadeh · Posted 05-02-2023 02:32 PM

I am not looking for a SAS code. Any SAS EG method that could help with my original question would be appreciated.

I want to remove the duplicated records which have similar entries across all variables except 'age'. I don't care which row to keep once a duplicated row found. Also, I don't want to see 'age' column in my report either.

Here is what I have:

Here is what I want:

Thanks in advance.

yabwon · Posted 05-02-2023 03:12 PM

First of all you need to understand that EG won't execute any thing for you becayse EG is just a "nice looking" interface to SAS computing engine, basically there is always SAS code at the end, it's just the way it works.

So the fastest, the more efficient, and foremost reproducible way would be to add program block to your EG flow with the following code:

proc sort data=have(keep=CIN YYYY MM) out=want nodupkey;
  by _all_;
run;

But as I wrote EG is "nice looking" interface so you can also do it this way:

1) I assume that in the WORK library there is your dataset named HAVE

2) Drag and drop the dataset to the process flow:

3) double click dataset icon and open it and select tasks (of course my example has different data)

4) In open task window navigate to data -> sort data:

5) in the Data tab drag all 3 variables to Sort by role, and Age to Drop list:

6) In the Options tab select middle "dot" for duplicates:

7) click Save and in the next window click "Running Man":

😎 enjoy de-duplicated data:

And after 8 steps you are done.

But now if you decide to add program node to your code and paste code I shared, it will take only 2 steps.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

ralizadeh · Posted 05-02-2023 05:36 PM

@yabwon Does this pick the one (first) row of each distinct combination of CIN, YYYY, MM?

Thanks

yabwon · Posted 05-03-2023 02:06 AM

I should ask: did you use Maxim 4?

But it will be faster this way:

In point 6) mark:

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

ralizadeh · Posted 05-03-2023 11:00 AM

@yabwonI am not sure what version of SAS you are using. Mine is SAS EG 7.12. I understood your point and it worked for me. Thank you really lot.

BTW, I was able to achieve the same thing using Query Builder, only there was no way to have a distinct combination of CIN, YYYY, MM, and preserve the AGE in the Column Name. I had to get rid of the AGE column. I could, however, eliminate or maintain the AGE column using Sort methods, you suggested.

How to remove duplicate rows based on some columns in SAS Enterprise Guide

Re: How to remove duplicate rows based on some columns in SAS Enterprise Guide

Re: How to remove duplicate rows based on some columns in SAS Enterprise Guide

Re: How to remove duplicate rows based on some columns in SAS Enterprise Guide

Re: How to remove duplicate rows based on some columns in SAS Enterprise Guide

Re: How to remove duplicate rows based on some columns in SAS Enterprise Guide

How to remove duplicate rows based on some columns in SAS Enterprise Guide

Re: How to remove duplicate rows based on some columns in SAS Enterprise Guide

Re: How to remove duplicate rows based on some columns in SAS Enterprise Guide

Re: How to remove duplicate rows based on some columns in SAS Enterprise Guide

Re: How to remove duplicate rows based on some columns in SAS Enterprise Guide

SAS Innovate 2025: Call for Content

Classroom Training Available!