BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lindseyn
Obsidian | Level 7

I have combined a few data sets successfully and now I am trying to removing duplicated rows.  I have tried accomplishing this the following two ways:

 

proc sql;
	create table sales.NODuplicates as 
	select distinct * 
	from sales.combined;
quit;


proc sort data = sales.Combined
	NODUPRECS;
By _all_;
run;

However, neither appear to be working since my data set is still appearing with the duplicated rows. Any ideas as to why this may be happening?

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
data bmt.Approval;
	length country $15;
	set BMT.Approval;

Don't code like that, it makes a massive pain in the ass to determine where the issue and fix is. Makes sure to give each data set a unique name to start with and go from there. 

 

Try that and post back. It should work so if it's not something else is the issue. 

 

You can try the following as well, but I suspect the above fix will work. If this doesn't work, please post the log from the following and the proc freq output.

data bmt.Approval2;
	length country $15;
	set BMT.Approval;
	if trim(lowcase(Country)) = "united kingdo"
	then Country = "United Kingdom";
        format Country $15.;
run;

*check results;
proc freq data=bmt.approval2;
table country;
run;

 

View solution in original post

12 REPLIES 12
Reeza
Super User

This means that although you view them as duplicates SAS doesn't for some reason.

Some of the reasons:

  • All fields are not identical (e.g. an amount)
  • Some fields have different cases, e.g Jason does not equal JASON
  • Some fields have invisible white spaces that make it hard to find.

    In cases like this you'll need to provide some sample data and example duplicates otherwise we can't help you identify where the issue may be.

EDIT: a better fix is to determine why you have duplicates in the first place and remove them in the preceding queries. 


@lindseyn wrote:

I have combined a few data sets successfully and now I am trying to removing duplicated rows.  I have tried accomplishing this the following two ways:

 

proc sql;
	create table sales.NODuplicates as 
	select distinct * 
	from sales.combined;
quit;


proc sort data = sales.Combined
	NODUPRECS;
By _all_;
run;

However, neither appear to be working since my data set is still appearing with the duplicated rows. Any ideas as to why this may be happening?


 

lindseyn
Obsidian | Level 7

Thank you! This is why. It looks like it has to do with spacing and some spelling errors. How would I account for something like this for a data set with approx 1500 rows?

Reeza
Super User
Fix the spacing and spelling errors manually and then remove duplicates. I would probably recommend fixing them as part of your cleaning process. Order should be 1) Import data, 2) clean data 3)merge data 4) analyze

If you find issues in any step, I would go back and fix it at the appropriate step, in this case I would clean it manually using IF/THEN statements in my cleaning processing, and then re-run from there.

lindseyn
Obsidian | Level 7

One common misspelling that occurs is below. I tried this code and it did not remove it, There are no leading spaces yet I am not sure why this is not working.

 

data bmt.want;
	set BMT.old;
	if Country = "United Kingdo"
	then Country = "United Kingdom";
run;
Reeza
Super User
What is the length set to on your Country variable?
lindseyn
Obsidian | Level 7

Right now it is set to 13. However, I tried changing the length and still saw no changes for that value. 

Reeza
Super User
13 is why you're getting the issue. Did you change the length of the variable before the SET statement? Otherwise it won't work for increasing the length.
lindseyn
Obsidian | Level 7

This is what I tried:

data bmt.Approval;
	length country $15;
	set BMT.Approval;
	if Country = "United Kingdo"
	then Country = "United Kingdom";
run;

Yet the column still appeared as this and when checking proc contents length is 15. 

United States

Ireland

France

France

Japan

United Kingdo

United Kingdo

... and so on

 

Reeza
Super User
data bmt.Approval;
	length country $15;
	set BMT.Approval;

Don't code like that, it makes a massive pain in the ass to determine where the issue and fix is. Makes sure to give each data set a unique name to start with and go from there. 

 

Try that and post back. It should work so if it's not something else is the issue. 

 

You can try the following as well, but I suspect the above fix will work. If this doesn't work, please post the log from the following and the proc freq output.

data bmt.Approval2;
	length country $15;
	set BMT.Approval;
	if trim(lowcase(Country)) = "united kingdo"
	then Country = "United Kingdom";
        format Country $15.;
run;

*check results;
proc freq data=bmt.approval2;
table country;
run;

 

novinosrin
Tourmaline | Level 20

@Reeza  Brilliant reasoning!!!

 

@lindseyn   is the combined one 1500 records? Or each one is approx 1500 records?

 

If that is so small, you could trick with HASH. You need two temp arrays though.  1 for all chars and 2 for all nums

 

1. first copy the incoming var values to the 2 arrays

2. While copying i.e assigning compress and upcase the chars

3. Num values should be okay unless you have rounding discrepancies with floats.

4. Load all the temp array values as keys in definekey, the data portion would be your corresponding original variables/its values

5. Use default Hash options, meaning  DO NOT ALLOW DUPS with MULTIDATA:"Y". This will make sure you have 

6. Load the contents into the Hash memory resident table

7. When END=Last records indicator =1 , write the contents of HASH to a SAS dataset. Done deal!!!!!!!

 

novinosrin
Tourmaline | Level 20

Hi Again @lindseyn  The following is the demo of the idea I suggested. 

 


/*creating duplicate sample*/
data have;
set sashelp.class;
if mod(_n_,2)=0 then do _n_=1 to 5;
output;
end;
else output;
run;


data _null_;
 if _n_=1 then do;
  dcl hash H () ;
  if 0 then set have;
  array c _char_;
  array num _numeric_;
  array ct(2)$32;
  array nt(3);
  do _iorc_=1 to dim(ct);
   h.definekey(vname(ct(_iorc_)));
   h.definedata(vname(c(_iorc_)));
  end;
  do _iorc_=1 to dim(nt);
    h.definekey(vname(nt(_iorc_)));
	h.definedata(vname(num(_iorc_)));
  end;
  h.definedone();
 end;
 set have end=z;
  do _iorc_=1 to dim(ct);
    ct(_iorc_)=upcase(compress(c(_iorc_)));
  end;
  do _iorc_=1 to dim(nt);
   nt(_iorc_)=num(_iorc_);
 end;
  rc=h.add();
 if z;
 h.output(dataset:'want');
run;

 

Reference idea and code aka Plagiarized from 

@hashman  and @DonH  's Book

 

Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study 

By Paul Dorfman and Don Henderson

 

All the Best!!!

ballardw
Super User

@lindseyn wrote:

Thank you! This is why. It looks like it has to do with spacing and some spelling errors. How would I account for something like this for a data set with approx 1500 rows?


Spelling there isn't much that can be done unless you have patterns of errors.

If by "spacing" you mean leading spaces then the STRIP function will remove leading spaces. In a data step:  var = strip(var);

If by "spacing" you mean that some values have more than a single space between elements then the COMPBL function which removes multiple blanks from a string may help:

 

var = compbl(var);

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 1598 views
  • 4 likes
  • 4 in conversation