Removing Duplicates of variable A with hierarchy of variable B

CharlesFowler72 · Posted 07-17-2018 10:48 AM

Hi guys,

What I’m looking to do is to removes all duplicates of variable A so each variable A is unique to the output. However, a second variable (let’s call variable B) is a classification which has a desired hierarchy of the output.

If the hierarchy goes Class A, Class B, Class C then Class D then the below table

Variable A	Variable B
Company 1	Class A
Company 2	Class A
Company 1	Class B
Company 3	Class B
Company 2	Class C
Company 4	Class D

Would go into

Variable A	Variable B
Company 1	Class A
Company 2	Class A
Company 3	Class B
Company 4	Class D

Any idea?

Thanks

Astounding · Posted 07-17-2018 10:51 AM

Assuming that the results mistakenly omitted the combination of Company 2 with Class C, you could use:

proc sort data=have out=want nodupkey;

by variable_a variable_b;

run;

If the posted results are actually correct, however, you might want:

proc sort data=have;

by variable_a variable_b;

run;

data want;

set have;

by variable_a;

if first.variable_a;

run;

The more I think about it, the more I think the bottom program is the right one and the question you posted got it right.

CharlesFowler72 · Posted 07-17-2018 10:55 AM

Hi Astounding,

No that was omitted on purpose. Variable A has to be unique and as Class A is higher up in the hierarchy than Class C, the line Company 2 Class C is removed.

Astounding · Posted 07-17-2018 10:58 AM

OK, then go with the bottom program that I posted. There is a requirement, however. The sorted order must match the hierarchy. In other words, within a company, the program selects the "smallest" value of variable_b. So the "highest" priority in the hierarchy has to be the "smallest" value when sorting.

CharlesFowler72 · Posted 07-17-2018 11:37 AM

That's close but the next issue is that Variable B is not alphabetical. It is a list. Sorry that's my fault for not thinking and just putting A, B and C. If I use your second solution the order is not correct for the data step.

novinosrin · Posted 07-17-2018 11:01 AM

data have;
input VariableA	 $ &15. VariableB $ &15.;
cards;
Company 1	Class A
Company 2	Class A
Company 1	Class B
Company 3	Class B
Company 2	Class C
Company 4	Class D
;
data _null_;
if 0 then set have;
dcl hash H (dataset:'have',ordered: "A") ;
   h.definekey  ("variablea") ;
   h.definedata ("variablea", "variableb") ;
   h.definedone () ;
h.output(dataset:'want');
stop;
run;

EEng · Posted 07-17-2018 11:50 AM

You just have to have 2 sort procedures.

data have;
	 length var_A $9;
	 input var_A $ var_B $;
	 datalines;
Company_1 Class_A
Company_2 Class_A
Company_1 Class_B
Company_3 Class_B
Company_2 Class_C
Company_4 Class_D
;
proc print data=have;
run;
proc sort data=have;
	 by var_A var_B;
run;
proc sort nodupkey data=have out=want;
	 by var_A;
run;
proc print data=want;
run;

Output

Obs var_A var_B

1	Company_1	Class_A
2	Company_2	Class_A
3	Company_3	Class_B
4	Company_4	Class_D

CharlesFowler72 · Posted 07-18-2018 03:58 AM

Hi EENG,

do you know how I would do that but using a custom sort for variable B? In reality the order is not alphabetical.

EEng · Posted 07-18-2018 09:29 AM

But there is an order to B, right? Can't you create a new field like, If var_B = 'Tree' then var_B_id=1; else if var_B='Plant' then var_B_id=2; Then sort by var_A var_B_id; Then nodupkey on var_A. This way you do create an order to the field.

CharlesFowler72 · Posted 07-18-2018 10:02 AM

that was one of the first things I've tried. The issue with using the order like that is that I want to order first by variable A then A but remove duplicates for A not B. Sorry, I'm probably not explaining myself well. Might help if I switch variables around. If you see table below, the order is by Var_b Var_a but then removing duplicates is Var_A from top to bottom in order Var_b Var_a. So lines in red would be removed.

Variable B	Variable A
1_VB	Company 2
1_VB	Company 1
2_VB	Company 2
2_VB	Company 3
3_VB	Company 3
3_VB	Company 1

this makes using proc sort hard because the order has to be the same for removing duplicates. I've also tried labelling using First.var_A but it comes up with the same issue that the order have to be just Var_a.

CharlesFowler72 · Posted 07-18-2018 10:27 AM

Posted alternative questions asking for a counter that cannot be sorted. here: https://communities.sas.com/t5/SAS-Procedures/Counter-Variable-without-sorting/m-p/479082#M71438

now ill just limit for count =1 and that's done.

Thanks for the help.

EEng · Posted 07-18-2018 10:57 AM

Yeah, maybe I was not following. But looking at the table you had, maybe I need to see the data. I thought I was getting what you needed.

data have;
	 length var_A $9;
	 input var_B $ var_A $;
	 datalines;
1_VB Company_2
1_VB Company_1
2_VB Company_2
2_VB Company_3
3_VB Company_3
3_VB Company_1
;
proc print data=have noobs;
	 var var_B var_A;
run;
proc sort data=have;
	 by var_B var_A;
run;
proc print data=have noobs;
	 var var_B var_A;
run;
proc sort nodupkey data=have out=want;
	 by var_A;
run;
proc print data=want noobs;
	 var var_B var_A;
run;

The input data...sorted...results

var_B var_A

1_VB	Company_2
1_VB	Company_1
2_VB	Company_2
2_VB	Company_3
3_VB	Company_3
3_VB	Company_1

var_B var_A

1_VB	Company_1
1_VB	Company_2
2_VB	Company_2
2_VB	Company_3
3_VB	Company_1
3_VB	Company_3

var_B var_A

1_VB	Company_1
1_VB	Company_2
2_VB	Company_3

Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Re: Removing Duplicates of variable A with hierarchy of variable B

Registration is open

Registration is open

SAS Training: Just a Click Away