- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I've two datasets with following column properties with identical column names:
Data1 is imported from an excel sheet using import wizard in SAS Enterprise Guide, while Data2 is generated using sas code.
Data1:
Data2:
I want to append Data2 after Data1 and remove the duplicate entries from the resultant dataset.
I have tried 3 methods and none of them worked out for me. I should be able to eliminate 58 duplicate rows but only 4 are eliminated.
Method 1:
data data3;
set data1
set data2
run;
proc sort data=data3
nodupkey
out=data4;
by=_all_;
run;
Method 2:
proc sql;
create table data4 as
select distinct *
from data3;
quit;
Method 3:
proc sql;
create table data4 as
select * from data1
union all
(select * from data2
except all
select * from data1)
;
quit;
Any solution will be appreciated. Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Using the 20.2 format will round the values when they are displayed (and perhaps when exported to CSV), but it does not changed the values stored in the SAS dataset. PROC SORT is sorting by the numeric values in the dataset, it doesn't know about formats.
You can round the values in the DATA step, e.g.:
data data3rounded;
set data3;
MyNum=round(MyNum, .01) ;
run;
After rounding the values, try PROC SORT NODUPKEY again. It should work.
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When you combine the data make sure to fix the formats attached to variables to make sure you are not missing the details that are causing the observations that look to you to be the same but to SAS are not the same.
FORMATS can have an impact on how the values are printed and display in browser tools.
For example in one of the datasets the character variables has $CHAR format attached to them and on the other they have the normal $ format attached. The difference is that the $CHAR format will display leading spaces and the $ format will not.
Do this by adding a FORMAT statement to the data step.
Since none of your variables look like they really need formats attached (none appear to be date, time or datetime values) why not just remove them all?
data both;
set data1 data2 ;
format _all_ ;
run;
Also in one of the datasets you appear to have mistakenly attached the 20.2 INFORMAT to some of the variables. It probably will not impact you since informats are only used when converting text into values. It is not normal to specify decimal places on an INFORMAT. If you do then that means you want to force a decimal place into string that do not have one. So a string like '12345' will be come 123.45 if read using 20.2 informat since it will imply two decimal places.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For more context. In data1, there is a numeric column, say col1, that has enteries like 56.530000008.
The same numeric column was generated in data2 using proc sql statement.
```
create table data2 as
select sum(col1_a) FORMAT=20.2 as col1
from some_data
```
I hope this gives more clue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Please see my answer above about rounding the values . The formats do not round the value that is stored by SAS in the data.
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The problem is probably your data.
Particularly with numeric data, even if tow records look like they have the same value, there can be numeric precision issues where they differ by .000000000001 or some such.
To start, would remove all the formats, and look at the data. Just to make sure they look like duplicates.
If they look like duplicates, I would then round all the values to .01, or whatever, and then try removing duplicates. If duplicates are removed after rounding, then you know you have a numeric precision issue.
But generally, you shouldn't need to resort to deduplicating by _all_ variables. Life works much better when your data has a unique key / ID.
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Out of all the numeric columns I only have 3 of them containing decimal values and hence the reason I'm rounding both of them to 20.2 format. When I exported data1 and data2 as csv. Both of them had records with exact same values to 2 decimal places. Numerically they have identical records, but on merging and deduplication using the methods I showed, they aren't removed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Using the 20.2 format will round the values when they are displayed (and perhaps when exported to CSV), but it does not changed the values stored in the SAS dataset. PROC SORT is sorting by the numeric values in the dataset, it doesn't know about formats.
You can round the values in the DATA step, e.g.:
data data3rounded;
set data3;
MyNum=round(MyNum, .01) ;
run;
After rounding the values, try PROC SORT NODUPKEY again. It should work.
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Quentin,
I first have made the data1 in the same format as the data2. I don't want any new surprises.
As you can see in the question's screenshot, there are 3 columns with 20.2 format. Those are the columns that have decimal values. I rounded off those 3 columns and tried proc sort, but still the same problem of presence of duplicate entries.
I am really confused by this issue now. On visual comparison the duplicate values look exactly the same in both the tables and now I have even rounded off, I wasn't ready for the gotchas
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Otherwise, you need to show an example of data that appears to not follow the de-duplication process.
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lepg/p0dv87zb3bnse6n1mqo360be70qr.htm
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This syntax is wrong:
proc sort data=data3
nodupkey
out=data4
by=_all_;
run;
BY is a statement, not an option. You want:
proc sort data=data3 nodupkey out=data4 ;
by _all_ ;
run ;
Next up: Troy Martin Hughes presents Calling Open-Source Python Functions within SAS PROC FCMP: A Google Maps API Geocoding Adventure on Wednesday April 23.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content