Solved: Re: Can't remove duplicate rows after merging two datasets

uncharted9 · Posted 01-18-2023 01:31 PM

I've two datasets with following column properties with identical column names:

Data1 is imported from an excel sheet using import wizard in SAS Enterprise Guide, while Data2 is generated using sas code.

Data1:

Data2:

I want to append Data2 after Data1 and remove the duplicate entries from the resultant dataset.

I have tried 3 methods and none of them worked out for me. I should be able to eliminate 58 duplicate rows but only 4 are eliminated.

Method 1:

data data3;
set data1
set data2
run;

proc sort data=data3
nodupkey
out=data4;
by=_all_;
run;

Method 2:

proc sql;
create table data4 as
select distinct *
from data3;
quit;

Method 3:

proc sql;
create table data4 as
select * from data1
union all
(select * from data2
except all
select * from data1)
;
quit;

Any solution will be appreciated. Thanks.

Quentin · Posted 01-18-2023 02:16 PM

Using the 20.2 format will round the values when they are displayed (and perhaps when exported to CSV), but it does not changed the values stored in the SAS dataset. PROC SORT is sorting by the numeric values in the dataset, it doesn't know about formats.

You can round the values in the DATA step, e.g.:

data data3rounded;
  set data3;
  MyNum=round(MyNum, .01) ;
run;

After rounding the values, try PROC SORT NODUPKEY again. It should work.

View solution in original post

Tom · Posted 01-18-2023 01:52 PM

When you combine the data make sure to fix the formats attached to variables to make sure you are not missing the details that are causing the observations that look to you to be the same but to SAS are not the same.

FORMATS can have an impact on how the values are printed and display in browser tools.

For example in one of the datasets the character variables has $CHAR format attached to them and on the other they have the normal $ format attached. The difference is that the $CHAR format will display leading spaces and the $ format will not.

Do this by adding a FORMAT statement to the data step.

Since none of your variables look like they really need formats attached (none appear to be date, time or datetime values) why not just remove them all?

data both;
  set data1 data2 ;
  format _all_ ;
run;

Also in one of the datasets you appear to have mistakenly attached the 20.2 INFORMAT to some of the variables. It probably will not impact you since informats are only used when converting text into values. It is not normal to specify decimal places on an INFORMAT. If you do then that means you want to force a decimal place into string that do not have one. So a string like '12345' will be come 123.45 if read using 20.2 informat since it will imply two decimal places.

uncharted9 · Posted 01-18-2023 02:31 PM

I've changed few things in the data step of data1 to make all the formats align exactly like data2. But this is not able dedupe the records. Also, added the format=_all_ line from your response.

For more context. In data1, there is a numeric column, say col1, that has enteries like 56.530000008.

The same numeric column was generated in data2 using proc sql statement.

```
create table data2 as
select sum(col1_a) FORMAT=20.2 as col1
from some_data
```
I hope this gives more clue.

Quentin · Posted 01-18-2023 02:35 PM

Please see my answer above about rounding the values . The formats do not round the value that is stored by SAS in the data.

Quentin · Posted 01-18-2023 01:56 PM

The problem is probably your data.

Particularly with numeric data, even if tow records look like they have the same value, there can be numeric precision issues where they differ by .000000000001 or some such.

To start, would remove all the formats, and look at the data. Just to make sure they look like duplicates.

If they look like duplicates, I would then round all the values to .01, or whatever, and then try removing duplicates. If duplicates are removed after rounding, then you know you have a numeric precision issue.

But generally, you shouldn't need to resort to deduplicating by _all_ variables. Life works much better when your data has a unique key / ID.

uncharted9 · Posted 01-18-2023 02:08 PM

I don't have the liberty to have unique key for either of the datasets and hence I've to make comparisons based on all the columns.

Out of all the numeric columns I only have 3 of them containing decimal values and hence the reason I'm rounding both of them to 20.2 format. When I exported data1 and data2 as csv. Both of them had records with exact same values to 2 decimal places. Numerically they have identical records, but on merging and deduplication using the methods I showed, they aren't removed.

Quentin · Posted 01-18-2023 02:16 PM

Using the 20.2 format will round the values when they are displayed (and perhaps when exported to CSV), but it does not changed the values stored in the SAS dataset. PROC SORT is sorting by the numeric values in the dataset, it doesn't know about formats.

You can round the values in the DATA step, e.g.:

data data3rounded;
  set data3;
  MyNum=round(MyNum, .01) ;
run;

After rounding the values, try PROC SORT NODUPKEY again. It should work.

uncharted9 · Posted 01-18-2023 02:56 PM

Hi Quentin,

I first have made the data1 in the same format as the data2. I don't want any new surprises.

As you can see in the question's screenshot, there are 3 columns with 20.2 format. Those are the columns that have decimal values. I rounded off those 3 columns and tried proc sort, but still the same problem of presence of duplicate entries.

I am really confused by this issue now. On visual comparison the duplicate values look exactly the same in both the tables and now I have even rounded off, I wasn't ready for the gotchas

Reeza · Posted 01-18-2023 03:00 PM

Compare the hex values of the numbers not the raw data numbers. They're likely different due to numerical precision issues.

Otherwise, you need to show an example of data that appears to not follow the de-duplication process.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lepg/p0dv87zb3bnse6n1mqo360be70qr.htm

uncharted9 · Posted 01-18-2023 03:29 PM

I think the issue was because of confusion between how SAS internally stores the data and how it shows it to user (both as Output Data and Exported csv/xlsx data). I removed all the Formats and Informats from the data1 import process after editing the copy of the import task source code. Removed all the `FORMAT 20.2` references from data2 proc sql code and combined both of the datasets and then rounded off all the columns having decimal values. After that proc sort nodupkeys by = _all_ method worked flawlessly and removed all the duplicates.

Quentin · Posted 01-18-2023 01:59 PM

This syntax is wrong:

proc sort data=data3
nodupkey
out=data4
by=_all_;
run;

BY is a statement, not an option. You want:

proc sort data=data3 nodupkey out=data4 ;
  by _all_ ;
run ;

uncharted9 · Posted 01-18-2023 02:20 PM

My mistake in writing code in the community here. I had written it correctly in the IDE, but the problem still persists.

SAS Training: Just a Click Away