Solved: Re: Removing Duplicated Rows

lindseyn · Posted 12-03-2019 01:46 PM

I have combined a few data sets successfully and now I am trying to removing duplicated rows. I have tried accomplishing this the following two ways:

proc sql;
	create table sales.NODuplicates as 
	select distinct * 
	from sales.combined;
quit;


proc sort data = sales.Combined
	NODUPRECS;
By _all_;
run;

However, neither appear to be working since my data set is still appearing with the duplicated rows. Any ideas as to why this may be happening?

Reeza · Posted 12-04-2019 04:23 PM

data bmt.Approval;
	length country $15;
	set BMT.Approval;

Don't code like that, it makes a massive pain in the ass to determine where the issue and fix is. Makes sure to give each data set a unique name to start with and go from there.

Try that and post back. It should work so if it's not something else is the issue.

You can try the following as well, but I suspect the above fix will work. If this doesn't work, please post the log from the following and the proc freq output.

data bmt.Approval2;
	length country $15;
	set BMT.Approval;
	if trim(lowcase(Country)) = "united kingdo"
	then Country = "United Kingdom";
        format Country $15.;
run;

*check results;
proc freq data=bmt.approval2;
table country;
run;

View solution in original post

Reeza · Posted 12-03-2019 02:18 PM

This means that although you view them as duplicates SAS doesn't for some reason.

Some of the reasons:

All fields are not identical (e.g. an amount)
Some fields have different cases, e.g Jason does not equal JASON
Some fields have invisible white spaces that make it hard to find.

In cases like this you'll need to provide some sample data and example duplicates otherwise we can't help you identify where the issue may be.

EDIT: a better fix is to determine why you have duplicates in the first place and remove them in the preceding queries.

@lindseyn wrote:

I have combined a few data sets successfully and now I am trying to removing duplicated rows. I have tried accomplishing this the following two ways:
proc sql;
	create table sales.NODuplicates as 
	select distinct * 
	from sales.combined;
quit;


proc sort data = sales.Combined
	NODUPRECS;
By _all_;
run;
However, neither appear to be working since my data set is still appearing with the duplicated rows. Any ideas as to why this may be happening?

lindseyn · Posted 12-03-2019 02:24 PM

Thank you! This is why. It looks like it has to do with spacing and some spelling errors. How would I account for something like this for a data set with approx 1500 rows?

Reeza · Posted 12-03-2019 02:27 PM

Fix the spacing and spelling errors manually and then remove duplicates. I would probably recommend fixing them as part of your cleaning process. Order should be 1) Import data, 2) clean data 3)merge data 4) analyze

If you find issues in any step, I would go back and fix it at the appropriate step, in this case I would clean it manually using IF/THEN statements in my cleaning processing, and then re-run from there.

lindseyn · Posted 12-04-2019 03:42 PM

One common misspelling that occurs is below. I tried this code and it did not remove it, There are no leading spaces yet I am not sure why this is not working.

data bmt.want;
	set BMT.old;
	if Country = "United Kingdo"
	then Country = "United Kingdom";
run;

Reeza · Posted 12-04-2019 04:01 PM

What is the length set to on your Country variable?

lindseyn · Posted 12-04-2019 04:10 PM

Right now it is set to 13. However, I tried changing the length and still saw no changes for that value.

Reeza · Posted 12-04-2019 04:12 PM

13 is why you're getting the issue. Did you change the length of the variable before the SET statement? Otherwise it won't work for increasing the length.

lindseyn · Posted 12-04-2019 04:20 PM

This is what I tried:

data bmt.Approval;
	length country $15;
	set BMT.Approval;
	if Country = "United Kingdo"
	then Country = "United Kingdom";
run;

Yet the column still appeared as this and when checking proc contents length is 15.

United States

Ireland

France

Japan

United Kingdo

... and so on

Reeza · Posted 12-04-2019 04:23 PM

data bmt.Approval;
	length country $15;
	set BMT.Approval;

Don't code like that, it makes a massive pain in the ass to determine where the issue and fix is. Makes sure to give each data set a unique name to start with and go from there.

Try that and post back. It should work so if it's not something else is the issue.

You can try the following as well, but I suspect the above fix will work. If this doesn't work, please post the log from the following and the proc freq output.

data bmt.Approval2;
	length country $15;
	set BMT.Approval;
	if trim(lowcase(Country)) = "united kingdo"
	then Country = "United Kingdom";
        format Country $15.;
run;

*check results;
proc freq data=bmt.approval2;
table country;
run;

novinosrin · Posted 12-03-2019 02:39 PM

@Reeza Brilliant reasoning!!!

@lindseyn is the combined one 1500 records? Or each one is approx 1500 records?

If that is so small, you could trick with HASH. You need two temp arrays though. 1 for all chars and 2 for all nums

1. first copy the incoming var values to the 2 arrays

2. While copying i.e assigning compress and upcase the chars

3. Num values should be okay unless you have rounding discrepancies with floats.

4. Load all the temp array values as keys in definekey, the data portion would be your corresponding original variables/its values

5. Use default Hash options, meaning DO NOT ALLOW DUPS with MULTIDATA:"Y". This will make sure you have

6. Load the contents into the Hash memory resident table

7. When END=Last records indicator =1 , write the contents of HASH to a SAS dataset. Done deal!!!!!!!

novinosrin · Posted 12-03-2019 03:14 PM

Hi Again @lindseyn The following is the demo of the idea I suggested.


/*creating duplicate sample*/
data have;
set sashelp.class;
if mod(_n_,2)=0 then do _n_=1 to 5;
output;
end;
else output;
run;


data _null_;
 if _n_=1 then do;
  dcl hash H () ;
  if 0 then set have;
  array c _char_;
  array num _numeric_;
  array ct(2)$32;
  array nt(3);
  do _iorc_=1 to dim(ct);
   h.definekey(vname(ct(_iorc_)));
   h.definedata(vname(c(_iorc_)));
  end;
  do _iorc_=1 to dim(nt);
    h.definekey(vname(nt(_iorc_)));
	h.definedata(vname(num(_iorc_)));
  end;
  h.definedone();
 end;
 set have end=z;
  do _iorc_=1 to dim(ct);
    ct(_iorc_)=upcase(compress(c(_iorc_)));
  end;
  do _iorc_=1 to dim(nt);
   nt(_iorc_)=num(_iorc_);
 end;
  rc=h.add();
 if z;
 h.output(dataset:'want');
run;

Reference idea and code aka Plagiarized from

@hashman and @DonH 's Book

Data Management Solutions Using SAS^® Hash Table Operations: A Business Intelligence Case Study

By Paul Dorfman and Don Henderson

All the Best!!!

ballardw · Posted 12-03-2019 07:15 PM

@lindseyn wrote:

Thank you! This is why. It looks like it has to do with spacing and some spelling errors. How would I account for something like this for a data set with approx 1500 rows?

Spelling there isn't much that can be done unless you have patterns of errors.

If by "spacing" you mean leading spaces then the STRIP function will remove leading spaces. In a data step: var = strip(var);

If by "spacing" you mean that some values have more than a single space between elements then the COMPBL function which removes multiple blanks from a string may help:

var = compbl(var);

Data Management Solutions Using SAS® Hash Table Operations: A Business Intelligence Case Study

Registration is open

SAS Training: Just a Click Away

Data Management Solutions Using SAS^® Hash Table Operations: A Business Intelligence Case Study