I have combined a few data sets successfully and now I am trying to removing duplicated rows. I have tried accomplishing this the following two ways:
proc sql;
create table sales.NODuplicates as
select distinct *
from sales.combined;
quit;
proc sort data = sales.Combined
NODUPRECS;
By _all_;
run;
However, neither appear to be working since my data set is still appearing with the duplicated rows. Any ideas as to why this may be happening?
data bmt.Approval;
length country $15;
set BMT.Approval;
Don't code like that, it makes a massive pain in the ass to determine where the issue and fix is. Makes sure to give each data set a unique name to start with and go from there.
Try that and post back. It should work so if it's not something else is the issue.
You can try the following as well, but I suspect the above fix will work. If this doesn't work, please post the log from the following and the proc freq output.
data bmt.Approval2;
length country $15;
set BMT.Approval;
if trim(lowcase(Country)) = "united kingdo"
then Country = "United Kingdom";
format Country $15.;
run;
*check results;
proc freq data=bmt.approval2;
table country;
run;
This means that although you view them as duplicates SAS doesn't for some reason.
Some of the reasons:
EDIT: a better fix is to determine why you have duplicates in the first place and remove them in the preceding queries.
@lindseyn wrote:
I have combined a few data sets successfully and now I am trying to removing duplicated rows. I have tried accomplishing this the following two ways:
proc sql; create table sales.NODuplicates as select distinct * from sales.combined; quit; proc sort data = sales.Combined NODUPRECS; By _all_; run;
However, neither appear to be working since my data set is still appearing with the duplicated rows. Any ideas as to why this may be happening?
Thank you! This is why. It looks like it has to do with spacing and some spelling errors. How would I account for something like this for a data set with approx 1500 rows?
One common misspelling that occurs is below. I tried this code and it did not remove it, There are no leading spaces yet I am not sure why this is not working.
data bmt.want;
set BMT.old;
if Country = "United Kingdo"
then Country = "United Kingdom";
run;
Right now it is set to 13. However, I tried changing the length and still saw no changes for that value.
This is what I tried:
data bmt.Approval;
length country $15;
set BMT.Approval;
if Country = "United Kingdo"
then Country = "United Kingdom";
run;
Yet the column still appeared as this and when checking proc contents length is 15.
United States
Ireland
France
France
Japan
United Kingdo
United Kingdo
... and so on
data bmt.Approval;
length country $15;
set BMT.Approval;
Don't code like that, it makes a massive pain in the ass to determine where the issue and fix is. Makes sure to give each data set a unique name to start with and go from there.
Try that and post back. It should work so if it's not something else is the issue.
You can try the following as well, but I suspect the above fix will work. If this doesn't work, please post the log from the following and the proc freq output.
data bmt.Approval2;
length country $15;
set BMT.Approval;
if trim(lowcase(Country)) = "united kingdo"
then Country = "United Kingdom";
format Country $15.;
run;
*check results;
proc freq data=bmt.approval2;
table country;
run;
@Reeza Brilliant reasoning!!!
@lindseyn is the combined one 1500 records? Or each one is approx 1500 records?
If that is so small, you could trick with HASH. You need two temp arrays though. 1 for all chars and 2 for all nums
1. first copy the incoming var values to the 2 arrays
2. While copying i.e assigning compress and upcase the chars
3. Num values should be okay unless you have rounding discrepancies with floats.
4. Load all the temp array values as keys in definekey, the data portion would be your corresponding original variables/its values
5. Use default Hash options, meaning DO NOT ALLOW DUPS with MULTIDATA:"Y". This will make sure you have
6. Load the contents into the Hash memory resident table
7. When END=Last records indicator =1 , write the contents of HASH to a SAS dataset. Done deal!!!!!!!
Hi Again @lindseyn The following is the demo of the idea I suggested.
/*creating duplicate sample*/
data have;
set sashelp.class;
if mod(_n_,2)=0 then do _n_=1 to 5;
output;
end;
else output;
run;
data _null_;
if _n_=1 then do;
dcl hash H () ;
if 0 then set have;
array c _char_;
array num _numeric_;
array ct(2)$32;
array nt(3);
do _iorc_=1 to dim(ct);
h.definekey(vname(ct(_iorc_)));
h.definedata(vname(c(_iorc_)));
end;
do _iorc_=1 to dim(nt);
h.definekey(vname(nt(_iorc_)));
h.definedata(vname(num(_iorc_)));
end;
h.definedone();
end;
set have end=z;
do _iorc_=1 to dim(ct);
ct(_iorc_)=upcase(compress(c(_iorc_)));
end;
do _iorc_=1 to dim(nt);
nt(_iorc_)=num(_iorc_);
end;
rc=h.add();
if z;
h.output(dataset:'want');
run;
Reference idea and code aka Plagiarized from
@lindseyn wrote:
Thank you! This is why. It looks like it has to do with spacing and some spelling errors. How would I account for something like this for a data set with approx 1500 rows?
Spelling there isn't much that can be done unless you have patterns of errors.
If by "spacing" you mean leading spaces then the STRIP function will remove leading spaces. In a data step: var = strip(var);
If by "spacing" you mean that some values have more than a single space between elements then the COMPBL function which removes multiple blanks from a string may help:
var = compbl(var);
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.