Re: How can i subsplit a column based on unique value?

Shonesum · Posted 07-23-2019 12:11 PM

Hello all,

How can I add a column into dataset based on 1 column value. How can I divide each rank into 2 splits or 3 splits. If I used Proc rank. it just splits the dataset into 2/3 equal portion. Is there a way to split the each unique ranks into 2/3 split. In the example below I have splited into 2 splits.

Have:

Test1	Test2	Rank
Apple	dd	1
Ball	dd	1
Cat	dd	1
Dog	dd	1
Elephant	dd	2
dsd	dd	2
sdf	dd	2
asdf	e	3
asdf	e	3
asdf	e	3
adsf	e	3

want:

Test1	Test2	Rank	Desired
Apple	dd	1	1
Ball	dd	1	1
Cat	dd	1	2
Dog	dd	1	2
Elephant	dd	2	1
dsd	dd	2	2
sdf	dd	2	2
asdf	e	3	1
asdf	e	3	1
asdf	e	3	2
adsf	e	3	2

proc rank data = have out = want group =2;

var = rank;

ranks =desired;

quit;

This code gave me :

Test1	Test2	Rank	Desired
Apple	dd	1	0
Ball	dd	1	0
Cat	dd	1	0
Dog	dd	1	0
Elephant	dd	2	1
dsd	dd	2	1
sdf	dd	2	1
asdf	e	3	1
asdf	e	3	2
asdf	e	3	2
adsf	e	3	2

Thank you in advance.

PaigeMiller · Posted 07-23-2019 01:47 PM

@Shonesum wrote:

Hello all,

How can I add a column into dataset based on 1 column value. How can I divide each rank into 2 splits or 3 splits. If I used Proc rank. it just splits the dataset into 2/3 equal portion. Is there a way to split the each unique ranks into 2/3 split. In the example below I have splited into 2 splits.

Have:

Test1 Test2 Rank

Apple dd 1

Ball dd 1

Cat dd 1

Dog dd 1

Elephant dd 2

dsd dd 2

sdf dd 2

asdf e 3

asdf e 3

asdf e 3

adsf e 3

want:

Test1 Test2 Rank Desired

Apple dd 1 1

Ball dd 1 1

Cat dd 1 2

Dog dd 1 2

Elephant dd 2 1

dsd dd 2 2

sdf dd 2 2

asdf e 3 1

asdf e 3 1

asdf e 3 2

adsf e 3 2

What is the rule that decides that sometimes you get 1/2 of the rows as 1 and 1/2 as 2 for some ranks, and 1/3 versus 2/3 for other ranks? What happens if there are 5 observations in a Rank? Is this done at random or non-random and sequentially?

--
Paige Miller

Shonesum · Posted 07-23-2019 01:49 PM

Hi Paige it is done sequentially. I solved it by using "by" statement in proc rank 🙂

PaigeMiller · Posted 07-23-2019 01:55 PM

@Shonesum wrote:

Hi Paige it is done sequentially. I solved it by using "by" statement in proc rank 🙂

That's great!

Could you please share your code in case someone else reads this thread and wants to know the answer?

--
Paige Miller

Shonesum · Posted 07-23-2019 02:04 PM

Since there was no numeric column, I added col variable then by rank.

data work.have;
set work.test;
col=_n_;
run;

proc rank data = work.have out = work.want group =2;
by rank;
var  col;

ranks desired;

quit;

PaigeMiller · Posted 07-23-2019 02:09 PM

Nice! Well done!

--
Paige Miller

Shonesum · Posted 07-23-2019 02:40 PM

Thank you 🙂

KachiM · Posted 07-25-2019 06:57 AM

You want the "desired" to start from 1. You also desire 2 or 3 groups within each RANK given in your HAVE data set. Here is a Data Step for SUB SPLITTING. If the RANK numbers is less than 6 I made a split of 2 groups otherwise 3 groups. With your example data set, two groups are made. I have added another 6 rows to show for a split of 3 groups.

data have;
input Test1 :$8. Test2 :$2. Rank;
datalines;
Apple    dd 1
Ball     dd 1
Cat      dd 1
Dog      dd 1
Elephant dd 2
dsd      dd 2
sdf      dd 2
asdf     e  3
asdf     e  3
asdf     e  3
adsf     e  3
;
run;

The following program splits into 2 groups:


data want;
   do _n_ = 1 by 1 until(last.rank);
      set have;
      by rank;
   end;
   ngroups = ceil(ifN(_n_ >= 6, _n_/3, _n_/2));
   do _n_ = 1 by 1 until(last.rank);
      set have;
      by rank;
      if _n_ <= ngroups then desired = 1;
      else desired + mod(_n_, ngroups);
      output;
   end;
drop ngroups;
run;

The statement:

   ngroups = ceil(ifN(_n_ >= 6, _n_/3, _n_/2));

computes the desired number of rows to be present in each split. The following statements

      if _n_ <= ngroups then desired = 1;
      else desired + mod(_n_, ngroups);

creates the values for "desired" based on the number of observations per RANK.

Is this what you want?

Here is the additional observations. The same program works.

data have;
input Test1 :$8. Test2 :$2. Rank;
datalines;
Apple    dd 1
Ball     dd 1
Cat      dd 1
Dog      dd 1
Elephant dd 2
dsd      dd 2
sdf      dd 2
asdf     e  3
asdf     e  3
asdf     e  3
adsf     e  3
aaa      g  4
aaa      g  4
aaa      g  4
aaa      g  4
aaa      g  4
aaa      g  4
;
run;

The output of this:

Obs 	Test1 	Test2 	Rank 	desired
1 	Apple 	dd 	1 	1
2 	Ball 	dd 	1 	1
3 	Cat 	dd 	1 	2
4 	Dog 	dd 	1 	2
5 	Elephant 	dd 	2 	1
6 	dsd 	dd 	2 	1
7 	sdf 	dd 	2 	2
8 	asdf 	e 	3 	1
9 	asdf 	e 	3 	1
10 	asdf 	e 	3 	2
11 	adsf 	e 	3 	2
12 	aaa 	g 	4 	1
13 	aaa 	g 	4 	1
14 	aaa 	g 	4 	2
15 	aaa 	g 	4 	2
16 	aaa 	g 	4 	3
17 	aaa 	g 	4 	3