Solved: Re: Using a Hiter Object to Eliminate Duplicates

InfoAlisaA · Posted 04-23-2012 04:07 PM

Hello Everyone,

So I am stuck on a problem on how to use a hiter object to eliminate duplicates from a data set in order to create a new one.

My example data set is orion.order_fact:

Order_

Obs Customer_ID Type

1 4 3

2 4 1

3 4 3

4 4 3

5 4 3

6 4 1

7 4 1

8 4 3

9 4 1

10 5 2

11 5 1

12 5 2

13 5 2

14 5 2

15 5 1

16 5 2

17 5 2

18 5 2

19 5 1

20 5 1

What I need to do is to eliminate the duplicate values of Customer_ID and Order_Type so only one record would appear like

Customer_ID Order_Type

4 1

4 3

5 1

5 2

So far my code attempt is pretty pathetic since I really don't know what I am doing:

data different1;

length Customer_ID 8 Order_Type 8.;

if _N_=1 then do;

if 0 then set orion.order_fact(keep=Customer_ID Order_Type);/*Using this method instead of the call missing to initialize the variables*/

declare hash test(dataset:'orion.order_fact',ordered='A',duplicate:'r');

test.definekey('Customer_ID','Order_Type');

test.definedata('Customer_ID','Order_Type');

test.definedone();

declare hiter t('test');

/*call missing('Customer_ID','Order_Type');*/

end;

t.removeDup();

if rc=0 then output different1;

stop;

run;

proc print data=different1 (obs=10);

run;

If anyone out there has any suggestions on what I should do to remedy this problem, please let me know.

I know there are other ways I could go about this without using a hiter object, but this is for a homework assignment and I need to use a hiter object.

Thanks,

Alisa

Ksharp · Posted 04-24-2012 12:14 AM

You don't need Hiter Object. Since you didn't use Hash Table parameter multidata:'Y' .

  
data order_fact;
input Obs     Customer_ID     Order_Type ;
cards;
1               4                 3
2               4                 1
3               4                 3
15               5       1
16               5       2
17               5       2
;
run;
data different1;
   if _N_=1 then do;
   if 0 then set order_fact(keep=Customer_ID Order_Type);/*Using this method instead of the call missing to initialize the variables*/
   declare hash test(dataset:'order_fact',ordered:'A',duplicate:'r');
   test.definekey('Customer_ID','Order_Type');
   test.definedata('Customer_ID','Order_Type');
   test.definedone();
end;
test.output(dataset:'want');
stop;
run;

Ksharp

View solution in original post

Ksharp · Posted 04-24-2012 12:14 AM

You don't need Hiter Object. Since you didn't use Hash Table parameter multidata:'Y' .

  
data order_fact;
input Obs     Customer_ID     Order_Type ;
cards;
1               4                 3
2               4                 1
3               4                 3
15               5       1
16               5       2
17               5       2
;
run;
data different1;
   if _N_=1 then do;
   if 0 then set order_fact(keep=Customer_ID Order_Type);/*Using this method instead of the call missing to initialize the variables*/
   declare hash test(dataset:'order_fact',ordered:'A',duplicate:'r');
   test.definekey('Customer_ID','Order_Type');
   test.definedata('Customer_ID','Order_Type');
   test.definedone();
end;
test.output(dataset:'want');
stop;
run;

Ksharp

InfoAlisaA · Posted 04-24-2012 03:21 PM

Hi Ksharp,

This worked out perfectly!

Thanks for your help!

Alisa

Haikuo · Posted 04-24-2012 03:33 PM

Hi Alisa,

As far as I know, there is NO way that you can use hash hiter to achieve this task, as eliminating duplicates implies needed keys, while hiter does not use keys ( not that type of keys anyway). As you can see from Ksharp's post, hash() has been built in this capacity to do the job. And there are also ways to do it more strenuously using removedup(), but I can't hardly see any roles that hiter can play.

Since you mentioned using hiter was part of your assignment requirement, I am more than interested to know the answers after they are released by your instructor. Please report it back. Thanks in advance!

Haikuo

InfoAlisaA · Posted 04-24-2012 08:19 PM

Hi Hai.kuo,

Here is the answer that my professor gave:

data different;

drop rc;

if _N_=1 then do;

if 0 then set orion.order_fact(keep=Customer_ID

Order_Type);

declare hash orders(dataset: 'orion.order_fact',

ordered: 'yes');

declare hiter O_F('orders');

orders.defineKey('Customer_ID', 'Order_Type');

orders.defineData('Customer_ID', 'Order_Type');

orders.defineDone();

end;

rc=O_F.first();

do while (rc=0);

output;

rc=O_F.next();

end;

stop;

run;

proc print data=different(obs=10);

title "No Duplicates";

run;

Haikuo · Posted 04-24-2012 08:56 PM

Alisa,

So as you can see, hiter does not remove duplicates. Hiter here just simply reads in every record in the hash object that is already been processed in term of 'eliminating duplicates'. Hash will load unique records only when not specifying 'multidata' as option.

BTW, IMHO, comparing to Ksharp's approach of output(), using hiter here is kinda clumsy.

Thanks,

Haikuo

jorge_sanchez · Posted 05-07-2012 06:29 PM

Hai:

May I ask a question?

I have many tables with hundreds of columns and thousands of rows.

I need to eliminate the dupplicated rows in each of these tables.

Do you have a generic script that I can run on each ot the tables that delete the dupplicated records and leave the records that do not have dupplicates?

Thanks a lot

Jorge sanchez.

FriedEgg · Posted 05-07-2012 07:20 PM

data order_fact;

input customer_id order_type amt;

cards;

4 3 4.99

4 1 1.00

4 3 0.99

5 1 0.02

5 2 9.99

5 2 5.47

;

run;

data some_other_table;

input foo bar;

cards;

1 1

1 2

1 3

1 1

4 1

5 1

4 1

5 2

5 3

5 1

;

run;

%macro hash_noduprec(dataset);

data _null_;

if 0 then set &dataset;

dcl hash i(dataset:"&dataset",ordered:'a'); *order will be based on variable order in dataset;

i.definekey(all:'y');

i.definedata(all:'y');

i.definedone();

i.output(dataset:"&dataset._ndr");

stop;

run;

%mend;

%hash_noduprec(order_fact);

%hash_noduprec(some_other_table);

*I don't see why not to just use proc sort for this task;

proc sort data=order_fact nodupkey; by _all_; run;

Message was edited by: Matthew Kastin

Patrick · Posted 05-14-2012 09:12 AM

Hi FriedEgg

When using noduprec you want to use all variables for sorting. What can happen else is that you get multiple obs in a by group and as Proc Sort Noduprec only compares consecutive observations you still might end up with duplicates (Noduprec is kind of deceptive - may be not one of the best things SAS invented).

proc sort data=order_fact noduprec; by _all_; run

Ksharp · Posted 05-14-2012 10:35 PM

Patrick.

Agree! option noduprec is almost useless .

Your code is exactly the same with FriedEgg .

Ksharp

Linlin · Posted 05-07-2012 10:45 PM

Hi,

for dataset have:

data have;

input id age;

cards;

1 20

2 30

3 40

;

do you want

1 20

2 30

3 40

or

2 30

3 40

?

Patrick · Posted 04-25-2012 01:06 AM

Alisa,

Agree with Haikuo. Your teacher should amend the code. It's most definitely clumsy.

Thanks

Patrick

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away