BookmarkSubscribeRSS Feed
CharlesFowler72
Obsidian | Level 7

Hi guys,

What I’m looking to do is to removes all duplicates of variable A so each variable A is unique to the output. However, a second variable (let’s call variable B) is a classification which has a desired hierarchy of the output.

If the hierarchy goes Class A, Class B, Class C then Class D then the below table

Variable A

Variable B

Company 1

Class A

Company 2

Class A

Company 1

Class B

Company 3

Class B

Company 2

Class C

Company 4

Class D

 

Would go into

Variable A

Variable B

Company 1

Class A

Company 2

Class A

Company 3

Class B

Company 4

Class D

 

Any idea?

Thanks

11 REPLIES 11
Astounding
PROC Star

Assuming that the results mistakenly omitted the combination of Company 2 with Class C, you could use:

 

proc sort data=have out=want nodupkey;

  by variable_a variable_b;

run;

 

If the posted results are actually correct, however, you might want:

 

proc sort data=have;

   by variable_a variable_b;

run;

 

data want;

set have;

by variable_a;

if first.variable_a;

run;

 

The more I think about it, the more I think the bottom program is the right one and the question you posted got it right.

CharlesFowler72
Obsidian | Level 7

Hi Astounding,

No that was omitted on purpose. Variable A has to be unique and as Class A is higher up in the hierarchy than Class C, the line Company 2 Class C is removed.

Astounding
PROC Star

OK, then go with the bottom program that I posted.  There is a requirement, however.  The sorted order must match the hierarchy.  In other words, within a company, the program selects the "smallest" value of variable_b.  So the "highest" priority in the hierarchy has to be the "smallest" value when sorting.

CharlesFowler72
Obsidian | Level 7
That's close but the next issue is that Variable B is not alphabetical. It is a list. Sorry that's my fault for not thinking and just putting A, B and C. If I use your second solution the order is not correct for the data step.
novinosrin
Tourmaline | Level 20
data have;
input VariableA	 $ &15. VariableB $ &15.;
cards;
Company 1	Class A
Company 2	Class A
Company 1	Class B
Company 3	Class B
Company 2	Class C
Company 4	Class D
;
data _null_;
if 0 then set have;
dcl hash H (dataset:'have',ordered: "A") ;
   h.definekey  ("variablea") ;
   h.definedata ("variablea", "variableb") ;
   h.definedone () ;
h.output(dataset:'want');
stop;
run;
EEng
Obsidian | Level 7

You just have to have 2 sort procedures.

 

data have;
	 length var_A $9;
	 input var_A $ var_B $;
	 datalines;
Company_1 Class_A
Company_2 Class_A
Company_1 Class_B
Company_3 Class_B
Company_2 Class_C
Company_4 Class_D
;
proc print data=have;
run;
proc sort data=have;
	 by var_A var_B;
run;
proc sort nodupkey data=have out=want;
	 by var_A;
run;
proc print data=want;
run;

Output

 

Obs var_A var_B
1Company_1Class_A
2Company_2Class_A
3Company_3Class_B
4Company_4Class_D
CharlesFowler72
Obsidian | Level 7

Hi EENG,

 

do you know how I would do that but using a custom sort for variable B? In reality the order is not alphabetical.

EEng
Obsidian | Level 7

But there is an order to B, right? Can't you create a new field like, If var_B = 'Tree' then var_B_id=1; else if var_B='Plant' then var_B_id=2; Then sort by var_A var_B_id; Then nodupkey on var_A. This way you do create an order to the field.

CharlesFowler72
Obsidian | Level 7

that was one of the first things I've tried. The issue with using the order like that is that I want to order first by variable A then A but remove duplicates for A not B. Sorry, I'm probably not explaining myself well. Might help if I switch variables around. If you see table below, the order is by Var_b Var_a but then removing duplicates is Var_A from top to bottom in order Var_b Var_a. So lines in red would be removed.

 

Variable B

Variable A

1_VB

Company 2

1_VB

Company 1

2_VB

Company 2

2_VB

Company 3

3_VB

Company 3

3_VB

Company 1

 

this makes using proc sort hard because the order has to be the same for removing duplicates. I've also tried labelling using First.var_A but it comes up with the same issue that the order have to be just Var_a.

CharlesFowler72
Obsidian | Level 7

Posted alternative questions asking for a counter that cannot be sorted. here: https://communities.sas.com/t5/SAS-Procedures/Counter-Variable-without-sorting/m-p/479082#M71438

now ill just limit for count =1 and that's done.

 

Thanks for the help.

EEng
Obsidian | Level 7

Yeah, maybe I was not following. But looking at the table you had, maybe I need to see the data. I thought I was getting what you needed.

 

data have;
	 length var_A $9;
	 input var_B $ var_A $;
	 datalines;
1_VB Company_2
1_VB Company_1
2_VB Company_2
2_VB Company_3
3_VB Company_3
3_VB Company_1
;
proc print data=have noobs;
	 var var_B var_A;
run;
proc sort data=have;
	 by var_B var_A;
run;
proc print data=have noobs;
	 var var_B var_A;
run;
proc sort nodupkey data=have out=want;
	 by var_A;
run;
proc print data=want noobs;
	 var var_B var_A;
run;

 

The input data...sorted...results

var_B var_A
1_VBCompany_2
1_VBCompany_1
2_VBCompany_2
2_VBCompany_3
3_VBCompany_3
3_VBCompany_1
var_B var_A
1_VBCompany_1
1_VBCompany_2
2_VBCompany_2
2_VBCompany_3
3_VBCompany_1
3_VBCompany_3
var_B var_A
1_VBCompany_1
1_VBCompany_2
2_VBCompany_3

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2841 views
  • 0 likes
  • 4 in conversation