I've two datasets with following column properties with identical column names:
Data1 is imported from an excel sheet using import wizard in SAS Enterprise Guide, while Data2 is generated using sas code.
Data1:
Data2:
I want to append Data2 after Data1 and remove the duplicate entries from the resultant dataset.
I have tried 3 methods and none of them worked out for me. I should be able to eliminate 58 duplicate rows but only 4 are eliminated.
Method 1:
data data3;
set data1
set data2
run;
proc sort data=data3
nodupkey
out=data4;
by=_all_;
run;
Method 2:
proc sql;
create table data4 as
select distinct *
from data3;
quit;
Method 3:
proc sql;
create table data4 as
select * from data1
union all
(select * from data2
except all
select * from data1)
;
quit;
Any solution will be appreciated. Thanks.
Using the 20.2 format will round the values when they are displayed (and perhaps when exported to CSV), but it does not changed the values stored in the SAS dataset. PROC SORT is sorting by the numeric values in the dataset, it doesn't know about formats.
You can round the values in the DATA step, e.g.:
data data3rounded;
set data3;
MyNum=round(MyNum, .01) ;
run;
After rounding the values, try PROC SORT NODUPKEY again. It should work.
When you combine the data make sure to fix the formats attached to variables to make sure you are not missing the details that are causing the observations that look to you to be the same but to SAS are not the same.
FORMATS can have an impact on how the values are printed and display in browser tools.
For example in one of the datasets the character variables has $CHAR format attached to them and on the other they have the normal $ format attached. The difference is that the $CHAR format will display leading spaces and the $ format will not.
Do this by adding a FORMAT statement to the data step.
Since none of your variables look like they really need formats attached (none appear to be date, time or datetime values) why not just remove them all?
data both;
set data1 data2 ;
format _all_ ;
run;
Also in one of the datasets you appear to have mistakenly attached the 20.2 INFORMAT to some of the variables. It probably will not impact you since informats are only used when converting text into values. It is not normal to specify decimal places on an INFORMAT. If you do then that means you want to force a decimal place into string that do not have one. So a string like '12345' will be come 123.45 if read using 20.2 informat since it will imply two decimal places.
Please see my answer above about rounding the values . The formats do not round the value that is stored by SAS in the data.
The problem is probably your data.
Particularly with numeric data, even if tow records look like they have the same value, there can be numeric precision issues where they differ by .000000000001 or some such.
To start, would remove all the formats, and look at the data. Just to make sure they look like duplicates.
If they look like duplicates, I would then round all the values to .01, or whatever, and then try removing duplicates. If duplicates are removed after rounding, then you know you have a numeric precision issue.
But generally, you shouldn't need to resort to deduplicating by _all_ variables. Life works much better when your data has a unique key / ID.
Using the 20.2 format will round the values when they are displayed (and perhaps when exported to CSV), but it does not changed the values stored in the SAS dataset. PROC SORT is sorting by the numeric values in the dataset, it doesn't know about formats.
You can round the values in the DATA step, e.g.:
data data3rounded;
set data3;
MyNum=round(MyNum, .01) ;
run;
After rounding the values, try PROC SORT NODUPKEY again. It should work.
Hi Quentin,
I first have made the data1 in the same format as the data2. I don't want any new surprises.
As you can see in the question's screenshot, there are 3 columns with 20.2 format. Those are the columns that have decimal values. I rounded off those 3 columns and tried proc sort, but still the same problem of presence of duplicate entries.
I am really confused by this issue now. On visual comparison the duplicate values look exactly the same in both the tables and now I have even rounded off, I wasn't ready for the gotchas
This syntax is wrong:
proc sort data=data3
nodupkey
out=data4
by=_all_;
run;
BY is a statement, not an option. You want:
proc sort data=data3 nodupkey out=data4 ;
by _all_ ;
run ;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.