Solved: Combining rows of data

Krislynn742005 · Posted 03-23-2016 03:11 PM

I have two rows of data that look like this:

Assignment Players

10_163 7, 25

10_125 2, 100

I need them to look like this in the end:

Assignment Players

10_125_163 2, 7, 25, 100

I know that I could write a simple line of code that would take care of this. However, these two rows are from a 100,000+ row data set and I need a way to combine all the rows that may have similar assignment numbers. Any advice is welcome.

Thanks

FreelanceReinh · Posted 03-25-2016 09:17 AM

Below is a more elementary solution, based on an iterative algorithm, developed independently of @Ksharp's solution.
Limitation: If the sum of group numbers g becomes too large (about 9.0E15 on Windows systems), a warning is issued and the criterion "&n1=&n2" should be replaced, e.g., by "equality of t1 and t2 up to sort order" (requires amended code).

@Krislynn742005: So, you can test both solutions and compare the run times as well as the results (please note, however, that the sort order of the lists is likely to differ). Please let us know how they performed on the real data.

/* Create test data */

data have;
input a p;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;


/* Macro to perform the iterative algorithm */

%macro iter( dsin =have  /* input dataset                              */
           , dsout=want  /* output dataset                             */
           , v1=a        /* variable name for assignment IDs           */
           , v2=p        /* variable name for player IDs               */
           , d1=_        /* delimiter for final list of assignment IDs */
           , d2=%str(,)  /* delimiter for final list of player IDs     */
           );
%local i n1 n2;

%let i=1;

proc sort data=&dsin presorted;
by &v1;
run;

/* Assign initial group numbers g */

data t1;
set &dsin;
by &v1;
g+first.&v1;
run;

proc sql noprint;        /* The sum of all group numbers */
select sum(g) into :n1   /* is used to detect changes    */
from t1;                 /* after reassigning groups.    */
quit;

%if &n1>=%sysfunc(constant(EXACTINT))
%then %put %str(WAR)NING: Sum of group numbers has become too large. Results may be incorrect!;

%do %until(&n1=&n2);
  %let i=%eval(3-&i); /* toggle i between 1 and 2 */

  proc sql noprint;         /* If one player has been assigned   */
  create table t&i as       /* multiple group numbers, they are  */
  select &v1, &v2, min(g) as g  /* replaced by their minimum.    */
  from t%eval(3-&i)         /* In the next iteration the same    */
  group by &&v&i;           /* procedure is applied to assign-   */
                            /* ment IDs rather than player IDs.  */
  select sum(g) into :n&i   /* And so on, alternating, until     */
  from t&i;                 /* no reassignments occur, i.e., t1  */
  quit;                     /* and t2 are equal up to sort order */
%end;                       /* and hence &n1=&n2.                */

%do i=1 %to 2;              /* Sorted lists of distinct assign-  */
  proc sql;                 /* ment and player IDs are prepared. */
  create table u&i as
  select distinct g, &&v&i
  from t&i;
  quit;

  data r&i;                 /* Finally, the delimited lists of  */
  do until(last.g);         /* IDs are created for reporting.   */
    set u&i;
    by g;
    length &&v&i..list $32767; /* length might be reduced carefully */
    &&v&i..list=catx("&&d&i", &&v&i..list, &&v&i);
  end;
  drop &&v&i;
  run;
%end;

/* Create result dataset */

data &dsout;
merge r1 r2;
by g;
drop g;
label &v1.list='Assignment ID'
      &v2.list='Players';
run;
%mend iter;

%iter;

proc print data=want noobs label;
run;

Edit: Introduced macro parameters replacing previously hardcoded local macro variables and added sort step at the beginning.

View solution in original post

ballardw · Posted 03-23-2016 03:57 PM

What are the rules determining "similar assignment numbers"?

I'm also curious what will be done with the resulting data.

Krislynn742005 · Posted 03-23-2016 04:33 PM

The end purpose is for scheduling.

Here is the business problem.

I have 100,000+ assignments and 10,000+ players. Assignments can have multiple unique players assigned to them. players can be in multiple assignments.

I have to schedule all of the assignments and players together. So if an assignment has 5 players they all need to be scheduled on January 1st. However, if those 5 players also fall into other assignments, they need to be scheduled on January 1st as well.

All players in an assignment and all assignments with mutual estids need to be scheduled at the same time. So if looking at this data set, the final answer should be below.

Assignment ID	Player
1	7
2	7
10	11
10	12
10	13
10	15
10	19
10	20
24	7
163	8
163	9
163	11
163	14
164	10
164	15
164	16
164	17
164	21
174	11
1063	7
1356	7
3080	6
3099	5
3099	6
3099	18
3099	23
3112	6
3112	18
4947	1
4947	2
4947	3
4947	4
4947	22
4947	24
4948	2
4948	22
4948	24
4949	3
4950	1
4950	4

Assignment ID	Players
1_2_24_1063_1356	7
10_163_164_174	8,9,10,11,12,13,14,15,16,17,19,20,21
3080_3099_3112	5,6,18,23
4947_4948_4949_4950	1,2,3,4,22,24

andreas_lds · Posted 03-30-2016 03:23 AM

I am not sure whether the following solution performs as good as the already posted ones, but at least it is short. The want-dataset needs some final tweaking to get the requested output-format, maybe i can finish this later.

data have;
   infile cards ;
   input assignment player;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;
run;


data work.want;
   set work.have;

   Game = .;
run;

proc ds2;

   data _null_;

      declare double g a p;

      declare package sqlstmt refresh(
         'update work.want {options locktable=share} set Game=? where Assignment=? or player=? 
            or Game in (select Game from work.want {options locktable=share} where Game is not missing and (Assignment=? or player=?))',
         [g a p a p]
      );

      declare package sqlstmt getter(
         'select * from work.want {options locktable=share}'
      );


      declare double nextGame;
      retain nextGame;

      method init();
         nextGame = 1;
      end;


      method run();
         getter.execute();
         getter.bindresults([a p g]);

         do while (getter.fetch() = 0);

            if missing(g) then do;
               g = nextGame;
               nextGame = nextGame + 1;
            end;

            refresh.execute();
         end;
      end;
   enddata;

run;quit;


proc report data=work.want spanrows missing;
   columns Game assignment player;
   define Game / order;
run;

FreelanceReinh · Posted 03-23-2016 04:29 PM

Hello @Krislynn742005,

Character variables containing lists of values are not well suited for further processing. Your example is a case in point: To obtain the result, it is necessary to take the original lists apart, perform some processing, and reassemble the values.

In many cases it makes life much easier to have the individual values in distinct observations. Lists can be created for reporting purposes.

LinusH · Posted 03-23-2016 04:33 PM

Sorry, but I think that you are destroying your data - making it less useful for other uses.
IMO you should go the other direction. Split the observationsame so that each player has it's own row. Then it will be simpler to do any type of reports.

Data never sleeps

Krislynn742005 · Posted 03-23-2016 04:36 PM

This is for scheduling and in order to the scheduling I need to see all of the assignment/player combinations as a single item. There is no other use for this data after this is done.

FreelanceReinh · Posted 03-23-2016 06:46 PM

Thanks, @Krislynn742005, for the clarification.

So, this has turned out to be a quite interesting algorithmic problem. It should attract experts familiar with the hash object or SAS/IML. Therefore, I could imagine that when I return tomorrow (Central European Time) @Ksharp (from China) will have already provided a brilliant solution. Otherwise, I think I'd be able to come up, e.g., with an array-based solution, which would work for the sample data, but which might not scale well to the full data.

Essentially, it's about assigning a group ID to each assignment ID and each player ID. As the last step, the sorted lists of assignment IDs and player IDs with the same group ID could be written to a report.

Ksharp · Posted 03-24-2016 12:41 AM

@FreelanceReinhard

You are trying to push me and give me pressure yet ??  :-)
But I love to solve the tough question.

Assuming I understand what OP want.
BTW, It is real tough question.






data have;
infile cards ;
input from $  to $ ;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;
run;

data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node','flag');
     h.definedata('node','flag');
     h.definedone();
  end;
  flag=-1;output;
  node=from; h.replace();
  from=to; to=node;
  flag=1;output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household flag);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last','flag');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key','flag');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata&colon;'y');
 from_to.definekey('from','flag');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node','flag');
 no.definedata('node','flag');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1;output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();_flag=flag;
   do while(rx=0);
     
      key=to;flag=flag*-1;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove();
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;    
      flag=_flag;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

proc sort data=want;
 by household flag;
run;
data want_temp;
 set want;
 by household flag;
 length temp $ 32767 id $ 40;
 retain temp;
 temp=catx(',',temp,node);
 if last.flag then do;
  id=ifc(flag=-1,'Assignment_ID','Players');
  output;
  call missing(temp);
 end;
 drop node flag;
run;
proc transpose data=want_temp out=want_final(drop=_name_);
by household ;
id id;
var temp;
run;

FreelanceReinh · Posted 03-24-2016 06:39 AM

@Ksharp: Brilliant! I knew this would be just the right type of challenge for you.

Looking forward to the day when I'll be able to write something like this code ...

FreelanceReinh · Posted 03-25-2016 09:17 AM

Below is a more elementary solution, based on an iterative algorithm, developed independently of @Ksharp's solution.
Limitation: If the sum of group numbers g becomes too large (about 9.0E15 on Windows systems), a warning is issued and the criterion "&n1=&n2" should be replaced, e.g., by "equality of t1 and t2 up to sort order" (requires amended code).

@Krislynn742005: So, you can test both solutions and compare the run times as well as the results (please note, however, that the sort order of the lists is likely to differ). Please let us know how they performed on the real data.

/* Create test data */

data have;
input a p;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;


/* Macro to perform the iterative algorithm */

%macro iter( dsin =have  /* input dataset                              */
           , dsout=want  /* output dataset                             */
           , v1=a        /* variable name for assignment IDs           */
           , v2=p        /* variable name for player IDs               */
           , d1=_        /* delimiter for final list of assignment IDs */
           , d2=%str(,)  /* delimiter for final list of player IDs     */
           );
%local i n1 n2;

%let i=1;

proc sort data=&dsin presorted;
by &v1;
run;

/* Assign initial group numbers g */

data t1;
set &dsin;
by &v1;
g+first.&v1;
run;

proc sql noprint;        /* The sum of all group numbers */
select sum(g) into :n1   /* is used to detect changes    */
from t1;                 /* after reassigning groups.    */
quit;

%if &n1>=%sysfunc(constant(EXACTINT))
%then %put %str(WAR)NING: Sum of group numbers has become too large. Results may be incorrect!;

%do %until(&n1=&n2);
  %let i=%eval(3-&i); /* toggle i between 1 and 2 */

  proc sql noprint;         /* If one player has been assigned   */
  create table t&i as       /* multiple group numbers, they are  */
  select &v1, &v2, min(g) as g  /* replaced by their minimum.    */
  from t%eval(3-&i)         /* In the next iteration the same    */
  group by &&v&i;           /* procedure is applied to assign-   */
                            /* ment IDs rather than player IDs.  */
  select sum(g) into :n&i   /* And so on, alternating, until     */
  from t&i;                 /* no reassignments occur, i.e., t1  */
  quit;                     /* and t2 are equal up to sort order */
%end;                       /* and hence &n1=&n2.                */

%do i=1 %to 2;              /* Sorted lists of distinct assign-  */
  proc sql;                 /* ment and player IDs are prepared. */
  create table u&i as
  select distinct g, &&v&i
  from t&i;
  quit;

  data r&i;                 /* Finally, the delimited lists of  */
  do until(last.g);         /* IDs are created for reporting.   */
    set u&i;
    by g;
    length &&v&i..list $32767; /* length might be reduced carefully */
    &&v&i..list=catx("&&d&i", &&v&i..list, &&v&i);
  end;
  drop &&v&i;
  run;
%end;

/* Create result dataset */

data &dsout;
merge r1 r2;
by g;
drop g;
label &v1.list='Assignment ID'
      &v2.list='Players';
run;
%mend iter;

%iter;

proc print data=want noobs label;
run;

Edit: Introduced macro parameters replacing previously hardcoded local macro variables and added sort step at the beginning.

Krislynn742005 · Posted 03-29-2016 01:57 PM

GENIUS!

Both of them are genius! Although I will admit I went with the second "more elementary" method than the first. My brain almost exploded trying to figure out the first. All of you are amazing and I can only hope to get even to a tenth of your level!!!

Krislynn742005 · Posted 05-17-2016 11:56 AM

I need some more help. I now need to add scores in and sum them. Where would I put that in this iterative macro? My data is as follows:

Assignment Player ScoreA ScoreB ScoreC

A 2 10 20 30

B 2 10 20 30

C 3 5 10 15

D 3 5 10 15

E 4 2 4 30

F 4 2 4 30

G 1 100 200 30

H 1 100 200 30

I 1 100 200 30

A 5 10 20 30

B 6 10 20 30

C 6 10 20 30

D 7 5 10 30

E 8 2 4 30

F 9 100 200 30

G 9 100 200 30

H 10 100 200 30

I 10 100 200 30

Thanks in advance.

FreelanceReinh · Posted 05-17-2016 04:28 PM

Hi @Krislynn742005,

Your "data" suggests that the scores depend only on the player: e.g., player 1 has scores (100, 200, 30) regardless of the assignment (G, H or I). Correct?

In the original task we divided players (and assignments) into groups. What are the rules for summing scores? For example, would the scores for players 1, 2, 3, 4, 22 and 24 (who comprised group 4 in your sample data of March 23) be added to obtain, e.g., "SumScoreA" for group 4? That is, SumScoreA(group 4) = 100 + 10 + 5 + 2 + ScoreA(player 22) + ScoreA(player 24)?

Or are the assigments involved in the calculation somehow?

Krislynn742005 · Posted 05-17-2016 04:41 PM

Hi FreelanceReinhard,

Yes, the scores are the same for each player regardless of what assignment they are in. I need to do exactly what the code did before except now I need to total the scores for all of the players in an assignment.

We want to sum the scores for all players in an assignment group. So in the data set that I sent earlier today the final result should be:

assignment player ScoreA ScoreB ScoreC

A_B_C_D 2,3,5,6,7 65 130 195

E_F_G_H_I 1,4,8,9,10 706 1412 2118

I still need all players in an assignment and all assignments with mutual players to be grouped together. But now we have to sum the three scores for all of the players in each assignment/player group.

I hope this answered your question.

Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Re: Combining rows of data

Register Today!

SAS Training: Just a Click Away