BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Krislynn742005
Fluorite | Level 6

I have two rows of data that look like this:

 

Assignment                       Players

10_163                              7, 25

10_125                              2, 100

 

I need them to look like this in the end:

Assignment                       Players

10_125_163                      2, 7, 25, 100

 

I know that I could write a simple line of code that would take care of this.  However, these two rows are from a 100,000+ row data set and I need a way to combine all the rows that may have similar assignment numbers.  Any advice is welcome.

 

Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Below is a more elementary solution, based on an iterative algorithm, developed independently of @Ksharp's solution.
Limitation: If the sum of group numbers g becomes too large (about 9.0E15 on Windows systems), a warning is issued and the criterion "&n1=&n2" should be replaced, e.g., by "equality of t1 and t2 up to sort order" (requires amended code).

 

@Krislynn742005: So, you can test both solutions and compare the run times as well as the results (please note, however, that the sort order of the lists is likely to differ). Please let us know how they performed on the real data.

 

/* Create test data */

data have;
input a p;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;


/* Macro to perform the iterative algorithm */

%macro iter( dsin =have  /* input dataset                              */
           , dsout=want  /* output dataset                             */
           , v1=a        /* variable name for assignment IDs           */
           , v2=p        /* variable name for player IDs               */
           , d1=_        /* delimiter for final list of assignment IDs */
           , d2=%str(,)  /* delimiter for final list of player IDs     */
           );
%local i n1 n2;

%let i=1;

proc sort data=&dsin presorted;
by &v1;
run;

/* Assign initial group numbers g */

data t1;
set &dsin;
by &v1;
g+first.&v1;
run;

proc sql noprint;        /* The sum of all group numbers */
select sum(g) into :n1   /* is used to detect changes    */
from t1;                 /* after reassigning groups.    */
quit;

%if &n1>=%sysfunc(constant(EXACTINT))
%then %put %str(WAR)NING: Sum of group numbers has become too large. Results may be incorrect!;

%do %until(&n1=&n2);
  %let i=%eval(3-&i); /* toggle i between 1 and 2 */

  proc sql noprint;         /* If one player has been assigned   */
  create table t&i as       /* multiple group numbers, they are  */
  select &v1, &v2, min(g) as g  /* replaced by their minimum.    */
  from t%eval(3-&i)         /* In the next iteration the same    */
  group by &&v&i;           /* procedure is applied to assign-   */
                            /* ment IDs rather than player IDs.  */
  select sum(g) into :n&i   /* And so on, alternating, until     */
  from t&i;                 /* no reassignments occur, i.e., t1  */
  quit;                     /* and t2 are equal up to sort order */
%end;                       /* and hence &n1=&n2.                */

%do i=1 %to 2;              /* Sorted lists of distinct assign-  */
  proc sql;                 /* ment and player IDs are prepared. */
  create table u&i as
  select distinct g, &&v&i
  from t&i;
  quit;

  data r&i;                 /* Finally, the delimited lists of  */
  do until(last.g);         /* IDs are created for reporting.   */
    set u&i;
    by g;
    length &&v&i..list $32767; /* length might be reduced carefully */
    &&v&i..list=catx("&&d&i", &&v&i..list, &&v&i);
  end;
  drop &&v&i;
  run;
%end;

/* Create result dataset */

data &dsout;
merge r1 r2;
by g;
drop g;
label &v1.list='Assignment ID'
      &v2.list='Players';
run;
%mend iter;

%iter;

proc print data=want noobs label;
run;

Edit: Introduced macro parameters replacing previously hardcoded local macro variables and added sort step at the beginning.

 

View solution in original post

51 REPLIES 51
ballardw
Super User

What are the rules determining "similar assignment numbers"?

 

I'm also curious what will be done with the resulting data.

Krislynn742005
Fluorite | Level 6

The end purpose is for scheduling.

 

Here is the business problem.

 

I have 100,000+ assignments and 10,000+ players.  Assignments can have multiple unique players assigned to them.  players can be in multiple assignments.

 

I have to schedule all of the assignments and players together.  So if an assignment has 5 players they all need to be scheduled on January 1st.  However, if those 5 players also fall into other assignments, they need to be scheduled on January 1st as well.

 

All players in an assignment and all assignments with mutual estids need to be scheduled at the same time.  So if looking at this data set, the final answer should be below.

Assignment ID

Player

1

7

2

7

10

11

10

12

10

13

10

15

10

19

10

20

24

7

163

8

163

9

163

11

163

14

164

10

164

15

164

16

164

17

164

21

174

11

1063

7

1356

7

3080

6

3099

5

3099

6

3099

18

3099

23

3112

6

3112

18

4947

1

4947

2

4947

3

4947

4

4947

22

4947

24

4948

2

4948

22

4948

24

4949

3

4950

1

4950

4

 

 

Assignment ID

Players

1_2_24_1063_1356

7

10_163_164_174

8,9,10,11,12,13,14,15,16,17,19,20,21

3080_3099_3112

5,6,18,23

4947_4948_4949_4950

1,2,3,4,22,24

andreas_lds
Jade | Level 19

I am not sure whether the following solution performs as good as the already posted ones, but at least it is short. The want-dataset needs some final tweaking to get the requested output-format, maybe i can finish this later.

 

data have;
   infile cards ;
   input assignment player;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;
run;


data work.want;
   set work.have;

   Game = .;
run;

proc ds2;

   data _null_;

      declare double g a p;

      declare package sqlstmt refresh(
         'update work.want {options locktable=share} set Game=? where Assignment=? or player=? 
            or Game in (select Game from work.want {options locktable=share} where Game is not missing and (Assignment=? or player=?))',
         [g a p a p]
      );

      declare package sqlstmt getter(
         'select * from work.want {options locktable=share}'
      );


      declare double nextGame;
      retain nextGame;

      method init();
         nextGame = 1;
      end;


      method run();
         getter.execute();
         getter.bindresults([a p g]);

         do while (getter.fetch() = 0);

            if missing(g) then do;
               g = nextGame;
               nextGame = nextGame + 1;
            end;

            refresh.execute();
         end;
      end;
   enddata;

run;quit;


proc report data=work.want spanrows missing;
   columns Game assignment player;
   define Game / order;
run;
FreelanceReinh
Jade | Level 19

Hello @Krislynn742005,

 

Character variables containing lists of values are not well suited for further processing. Your example is a case in point: To obtain the result, it is necessary to take the original lists apart, perform some processing, and reassemble the values.


In many cases it makes life much easier to have the individual values in distinct observations. Lists can be created for reporting purposes.

LinusH
Tourmaline | Level 20
Sorry, but I think that you are destroying your data - making it less useful for other uses.
IMO you should go the other direction. Split the observationsame so that each player has it's own row. Then it will be simpler to do any type of reports.
Data never sleeps
Krislynn742005
Fluorite | Level 6

This is for scheduling and in order to the scheduling I need to see all of the assignment/player combinations as a single item.  There is no other use for this data after this is done.

FreelanceReinh
Jade | Level 19

Thanks, @Krislynn742005, for the clarification.

 

So, this has turned out to be a quite interesting algorithmic problem. It should attract experts familiar with the hash object or SAS/IML. Therefore, I could imagine that when I return tomorrow (Central European Time) @Ksharp (from China) will have already provided a brilliant solution. Otherwise, I think I'd be able to come up, e.g., with an array-based solution, which would work for the sample data, but which might not scale well to the full data.

 

Essentially, it's about assigning a group ID to each assignment ID and each player ID. As the last step, the sorted lists of assignment IDs and player IDs with the same group ID could be written to a report.

 

Ksharp
Super User
@FreelanceReinhard
You are trying to push me and give me pressure yet ??  :-)
But I love to solve the tough question.

Assuming I understand what OP want.
BTW, It is real tough question.






data have;
infile cards ;
input from $  to $ ;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;
run;

data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node','flag');
     h.definedata('node','flag');
     h.definedone();
  end;
  flag=-1;output;
  node=from; h.replace();
  from=to; to=node;
  flag=1;output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household flag);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last','flag');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key','flag');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from','flag');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node','flag');
 no.definedata('node','flag');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1;output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();_flag=flag;
   do while(rx=0);
     
      key=to;flag=flag*-1;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove();
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;    
      flag=_flag;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

proc sort data=want;
 by household flag;
run;
data want_temp;
 set want;
 by household flag;
 length temp $ 32767 id $ 40;
 retain temp;
 temp=catx(',',temp,node);
 if last.flag then do;
  id=ifc(flag=-1,'Assignment_ID','Players');
  output;
  call missing(temp);
 end;
 drop node flag;
run;
proc transpose data=want_temp out=want_final(drop=_name_);
by household ;
id id;
var temp;
run;


FreelanceReinh
Jade | Level 19

@Ksharp: Brilliant! I knew this would be just the right type of challenge for you.

Looking forward to the day when I'll be able to write something like this code ...

FreelanceReinh
Jade | Level 19

Below is a more elementary solution, based on an iterative algorithm, developed independently of @Ksharp's solution.
Limitation: If the sum of group numbers g becomes too large (about 9.0E15 on Windows systems), a warning is issued and the criterion "&n1=&n2" should be replaced, e.g., by "equality of t1 and t2 up to sort order" (requires amended code).

 

@Krislynn742005: So, you can test both solutions and compare the run times as well as the results (please note, however, that the sort order of the lists is likely to differ). Please let us know how they performed on the real data.

 

/* Create test data */

data have;
input a p;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;


/* Macro to perform the iterative algorithm */

%macro iter( dsin =have  /* input dataset                              */
           , dsout=want  /* output dataset                             */
           , v1=a        /* variable name for assignment IDs           */
           , v2=p        /* variable name for player IDs               */
           , d1=_        /* delimiter for final list of assignment IDs */
           , d2=%str(,)  /* delimiter for final list of player IDs     */
           );
%local i n1 n2;

%let i=1;

proc sort data=&dsin presorted;
by &v1;
run;

/* Assign initial group numbers g */

data t1;
set &dsin;
by &v1;
g+first.&v1;
run;

proc sql noprint;        /* The sum of all group numbers */
select sum(g) into :n1   /* is used to detect changes    */
from t1;                 /* after reassigning groups.    */
quit;

%if &n1>=%sysfunc(constant(EXACTINT))
%then %put %str(WAR)NING: Sum of group numbers has become too large. Results may be incorrect!;

%do %until(&n1=&n2);
  %let i=%eval(3-&i); /* toggle i between 1 and 2 */

  proc sql noprint;         /* If one player has been assigned   */
  create table t&i as       /* multiple group numbers, they are  */
  select &v1, &v2, min(g) as g  /* replaced by their minimum.    */
  from t%eval(3-&i)         /* In the next iteration the same    */
  group by &&v&i;           /* procedure is applied to assign-   */
                            /* ment IDs rather than player IDs.  */
  select sum(g) into :n&i   /* And so on, alternating, until     */
  from t&i;                 /* no reassignments occur, i.e., t1  */
  quit;                     /* and t2 are equal up to sort order */
%end;                       /* and hence &n1=&n2.                */

%do i=1 %to 2;              /* Sorted lists of distinct assign-  */
  proc sql;                 /* ment and player IDs are prepared. */
  create table u&i as
  select distinct g, &&v&i
  from t&i;
  quit;

  data r&i;                 /* Finally, the delimited lists of  */
  do until(last.g);         /* IDs are created for reporting.   */
    set u&i;
    by g;
    length &&v&i..list $32767; /* length might be reduced carefully */
    &&v&i..list=catx("&&d&i", &&v&i..list, &&v&i);
  end;
  drop &&v&i;
  run;
%end;

/* Create result dataset */

data &dsout;
merge r1 r2;
by g;
drop g;
label &v1.list='Assignment ID'
      &v2.list='Players';
run;
%mend iter;

%iter;

proc print data=want noobs label;
run;

Edit: Introduced macro parameters replacing previously hardcoded local macro variables and added sort step at the beginning.

 

Krislynn742005
Fluorite | Level 6

GENIUS! 

 

Both of them are genius!  Although I will admit I went with the second "more elementary" method than the first.  My brain almost exploded trying to figure out the first.  All of you are amazing and I can only hope to get even to a tenth of your level!!!

Krislynn742005
Fluorite | Level 6

I need some more help.  I now need to add scores in and sum them.  Where would I put that in this iterative macro?  My data is as follows:

 

Assignment          Player           ScoreA    ScoreB    ScoreC

A                          2                    10            20             30

B                          2                    10            20             30

C                          3                    5             10             15

D                          3                    5             10             15

E                          4                    2             4               30

F                          4                    2             4               30

G                         1                    100             200       30

H                         1                    100             200       30

I                           1                    100             200       30

A                          5                    10             20           30

B                          6                    10             20           30

C                          6                    10             20           30

D                          7                    5             10              30

E                          8                    2              4               30

F                          9                    100           200           30

G                         9                    100           200           30

H                         10                  100           200            30

I                          10                   100           200            30

 

Thanks in advance.

FreelanceReinh
Jade | Level 19

Hi @Krislynn742005,

 

Your "data" suggests that the scores depend only on the player: e.g., player 1 has scores (100, 200, 30) regardless of the assignment (G, H or I). Correct?

 

In the original task we divided players (and assignments) into groups. What are the rules for summing scores? For example, would the scores for players 1, 2, 3, 4, 22 and 24 (who comprised group 4 in your sample data of March 23) be added to obtain, e.g., "SumScoreA" for group 4? That is, SumScoreA(group 4) = 100 + 10 + 5 + 2 + ScoreA(player 22) + ScoreA(player 24)?

 

Or are the assigments involved in the calculation somehow?

Krislynn742005
Fluorite | Level 6

Hi FreelanceReinhard,

 

Yes, the scores are the same for each player regardless of what assignment they are in.  I need to do exactly what the code did before except now I need to total the scores for all of the players in an assignment.

 

We want to sum the scores for all players in an assignment group. So in the data set that I sent earlier today the final result should be:

assignment           player           ScoreA           ScoreB           ScoreC

A_B_C_D             2,3,5,6,7       65                   130                 195

E_F_G_H_I          1,4,8,9,10     706                  1412               2118

 

I still need all players in an assignment and all assignments with mutual players to be grouped together.  But now we have to sum the three scores for all of the players in each assignment/player group.

 

I hope this answered your question.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 51 replies
  • 2236 views
  • 6 likes
  • 6 in conversation