I have two rows of data that look like this:
Assignment Players
10_163 7, 25
10_125 2, 100
I need them to look like this in the end:
Assignment Players
10_125_163 2, 7, 25, 100
I know that I could write a simple line of code that would take care of this. However, these two rows are from a 100,000+ row data set and I need a way to combine all the rows that may have similar assignment numbers. Any advice is welcome.
Thanks
Below is a more elementary solution, based on an iterative algorithm, developed independently of @Ksharp's solution.
Limitation: If the sum of group numbers g becomes too large (about 9.0E15 on Windows systems), a warning is issued and the criterion "&n1=&n2" should be replaced, e.g., by "equality of t1 and t2 up to sort order" (requires amended code).
@Krislynn742005: So, you can test both solutions and compare the run times as well as the results (please note, however, that the sort order of the lists is likely to differ). Please let us know how they performed on the real data.
/* Create test data */
data have;
input a p;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;
/* Macro to perform the iterative algorithm */
%macro iter( dsin =have /* input dataset */
, dsout=want /* output dataset */
, v1=a /* variable name for assignment IDs */
, v2=p /* variable name for player IDs */
, d1=_ /* delimiter for final list of assignment IDs */
, d2=%str(,) /* delimiter for final list of player IDs */
);
%local i n1 n2;
%let i=1;
proc sort data=&dsin presorted;
by &v1;
run;
/* Assign initial group numbers g */
data t1;
set &dsin;
by &v1;
g+first.&v1;
run;
proc sql noprint; /* The sum of all group numbers */
select sum(g) into :n1 /* is used to detect changes */
from t1; /* after reassigning groups. */
quit;
%if &n1>=%sysfunc(constant(EXACTINT))
%then %put %str(WAR)NING: Sum of group numbers has become too large. Results may be incorrect!;
%do %until(&n1=&n2);
%let i=%eval(3-&i); /* toggle i between 1 and 2 */
proc sql noprint; /* If one player has been assigned */
create table t&i as /* multiple group numbers, they are */
select &v1, &v2, min(g) as g /* replaced by their minimum. */
from t%eval(3-&i) /* In the next iteration the same */
group by &&v&i; /* procedure is applied to assign- */
/* ment IDs rather than player IDs. */
select sum(g) into :n&i /* And so on, alternating, until */
from t&i; /* no reassignments occur, i.e., t1 */
quit; /* and t2 are equal up to sort order */
%end; /* and hence &n1=&n2. */
%do i=1 %to 2; /* Sorted lists of distinct assign- */
proc sql; /* ment and player IDs are prepared. */
create table u&i as
select distinct g, &&v&i
from t&i;
quit;
data r&i; /* Finally, the delimited lists of */
do until(last.g); /* IDs are created for reporting. */
set u&i;
by g;
length &&v&i..list $32767; /* length might be reduced carefully */
&&v&i..list=catx("&&d&i", &&v&i..list, &&v&i);
end;
drop &&v&i;
run;
%end;
/* Create result dataset */
data &dsout;
merge r1 r2;
by g;
drop g;
label &v1.list='Assignment ID'
&v2.list='Players';
run;
%mend iter;
%iter;
proc print data=want noobs label;
run;
Edit: Introduced macro parameters replacing previously hardcoded local macro variables and added sort step at the beginning.
What are the rules determining "similar assignment numbers"?
I'm also curious what will be done with the resulting data.
The end purpose is for scheduling.
Here is the business problem.
I have 100,000+ assignments and 10,000+ players. Assignments can have multiple unique players assigned to them. players can be in multiple assignments.
I have to schedule all of the assignments and players together. So if an assignment has 5 players they all need to be scheduled on January 1st. However, if those 5 players also fall into other assignments, they need to be scheduled on January 1st as well.
All players in an assignment and all assignments with mutual estids need to be scheduled at the same time. So if looking at this data set, the final answer should be below.
Assignment ID | Player |
1 | 7 |
2 | 7 |
10 | 11 |
10 | 12 |
10 | 13 |
10 | 15 |
10 | 19 |
10 | 20 |
24 | 7 |
163 | 8 |
163 | 9 |
163 | 11 |
163 | 14 |
164 | 10 |
164 | 15 |
164 | 16 |
164 | 17 |
164 | 21 |
174 | 11 |
1063 | 7 |
1356 | 7 |
3080 | 6 |
3099 | 5 |
3099 | 6 |
3099 | 18 |
3099 | 23 |
3112 | 6 |
3112 | 18 |
4947 | 1 |
4947 | 2 |
4947 | 3 |
4947 | 4 |
4947 | 22 |
4947 | 24 |
4948 | 2 |
4948 | 22 |
4948 | 24 |
4949 | 3 |
4950 | 1 |
4950 | 4 |
Assignment ID | Players |
1_2_24_1063_1356 | 7 |
10_163_164_174 | 8,9,10,11,12,13,14,15,16,17,19,20,21 |
3080_3099_3112 | 5,6,18,23 |
4947_4948_4949_4950 | 1,2,3,4,22,24 |
I am not sure whether the following solution performs as good as the already posted ones, but at least it is short. The want-dataset needs some final tweaking to get the requested output-format, maybe i can finish this later.
data have;
infile cards ;
input assignment player;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;
run;
data work.want;
set work.have;
Game = .;
run;
proc ds2;
data _null_;
declare double g a p;
declare package sqlstmt refresh(
'update work.want {options locktable=share} set Game=? where Assignment=? or player=?
or Game in (select Game from work.want {options locktable=share} where Game is not missing and (Assignment=? or player=?))',
[g a p a p]
);
declare package sqlstmt getter(
'select * from work.want {options locktable=share}'
);
declare double nextGame;
retain nextGame;
method init();
nextGame = 1;
end;
method run();
getter.execute();
getter.bindresults([a p g]);
do while (getter.fetch() = 0);
if missing(g) then do;
g = nextGame;
nextGame = nextGame + 1;
end;
refresh.execute();
end;
end;
enddata;
run;quit;
proc report data=work.want spanrows missing;
columns Game assignment player;
define Game / order;
run;
Hello @Krislynn742005,
Character variables containing lists of values are not well suited for further processing. Your example is a case in point: To obtain the result, it is necessary to take the original lists apart, perform some processing, and reassemble the values.
In many cases it makes life much easier to have the individual values in distinct observations. Lists can be created for reporting purposes.
This is for scheduling and in order to the scheduling I need to see all of the assignment/player combinations as a single item. There is no other use for this data after this is done.
Thanks, @Krislynn742005, for the clarification.
So, this has turned out to be a quite interesting algorithmic problem. It should attract experts familiar with the hash object or SAS/IML. Therefore, I could imagine that when I return tomorrow (Central European Time) @Ksharp (from China) will have already provided a brilliant solution. Otherwise, I think I'd be able to come up, e.g., with an array-based solution, which would work for the sample data, but which might not scale well to the full data.
Essentially, it's about assigning a group ID to each assignment ID and each player ID. As the last step, the sorted lists of assignment IDs and player IDs with the same group ID could be written to a report.
You are trying to push me and give me pressure yet ?? :-) But I love to solve the tough question. Assuming I understand what OP want. BTW, It is real tough question. data have; infile cards ; input from $ to $ ; cards; 1 7 2 7 10 11 10 12 10 13 10 15 10 19 10 20 24 7 163 8 163 9 163 11 163 14 164 10 164 15 164 16 164 17 164 21 174 11 1063 7 1356 7 3080 6 3099 5 3099 6 3099 18 3099 23 3112 6 3112 18 4947 1 4947 2 4947 3 4947 4 4947 22 4947 24 4948 2 4948 22 4948 24 4949 3 4950 1 4950 4 ; run; data full; set have end=last; if _n_ eq 1 then do; declare hash h(); h.definekey('node','flag'); h.definedata('node','flag'); h.definedone(); end; flag=-1;output; node=from; h.replace(); from=to; to=node; flag=1;output; node=from; h.replace(); if last then h.output(dataset:'node'); drop node; run; data want(keep=node household flag); declare hash ha(ordered:'a'); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('last','flag'); ha.definedone(); declare hash _ha(hashexp: 20); _ha.definekey('key','flag'); _ha.definedone(); if 0 then set full; declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y'); from_to.definekey('from','flag'); from_to.definedata('to'); from_to.definedone(); if 0 then set node; declare hash no(dataset:'node'); declare hiter hi_no('no'); no.definekey('node','flag'); no.definedata('node','flag'); no.definedone(); do while(hi_no.next()=0); household+1;output; count=1; key=node;_ha.add(); last=node;ha.add(); rc=hi.first(); do while(rc=0); from=last;rx=from_to.find();_flag=flag; do while(rx=0); key=to;flag=flag*-1;ry=_ha.check(); if ry ne 0 then do; node=to;output;rr=no.remove(); key=to;_ha.add(); count+1; last=to;ha.add(); end; flag=_flag; rx=from_to.find_next(); end; rc=hi.next(); end; ha.clear();_ha.clear(); end; stop; run; proc sort data=want; by household flag; run; data want_temp; set want; by household flag; length temp $ 32767 id $ 40; retain temp; temp=catx(',',temp,node); if last.flag then do; id=ifc(flag=-1,'Assignment_ID','Players'); output; call missing(temp); end; drop node flag; run; proc transpose data=want_temp out=want_final(drop=_name_); by household ; id id; var temp; run;
@Ksharp: Brilliant! I knew this would be just the right type of challenge for you.
Looking forward to the day when I'll be able to write something like this code ...
Below is a more elementary solution, based on an iterative algorithm, developed independently of @Ksharp's solution.
Limitation: If the sum of group numbers g becomes too large (about 9.0E15 on Windows systems), a warning is issued and the criterion "&n1=&n2" should be replaced, e.g., by "equality of t1 and t2 up to sort order" (requires amended code).
@Krislynn742005: So, you can test both solutions and compare the run times as well as the results (please note, however, that the sort order of the lists is likely to differ). Please let us know how they performed on the real data.
/* Create test data */
data have;
input a p;
cards;
1 7
2 7
10 11
10 12
10 13
10 15
10 19
10 20
24 7
163 8
163 9
163 11
163 14
164 10
164 15
164 16
164 17
164 21
174 11
1063 7
1356 7
3080 6
3099 5
3099 6
3099 18
3099 23
3112 6
3112 18
4947 1
4947 2
4947 3
4947 4
4947 22
4947 24
4948 2
4948 22
4948 24
4949 3
4950 1
4950 4
;
/* Macro to perform the iterative algorithm */
%macro iter( dsin =have /* input dataset */
, dsout=want /* output dataset */
, v1=a /* variable name for assignment IDs */
, v2=p /* variable name for player IDs */
, d1=_ /* delimiter for final list of assignment IDs */
, d2=%str(,) /* delimiter for final list of player IDs */
);
%local i n1 n2;
%let i=1;
proc sort data=&dsin presorted;
by &v1;
run;
/* Assign initial group numbers g */
data t1;
set &dsin;
by &v1;
g+first.&v1;
run;
proc sql noprint; /* The sum of all group numbers */
select sum(g) into :n1 /* is used to detect changes */
from t1; /* after reassigning groups. */
quit;
%if &n1>=%sysfunc(constant(EXACTINT))
%then %put %str(WAR)NING: Sum of group numbers has become too large. Results may be incorrect!;
%do %until(&n1=&n2);
%let i=%eval(3-&i); /* toggle i between 1 and 2 */
proc sql noprint; /* If one player has been assigned */
create table t&i as /* multiple group numbers, they are */
select &v1, &v2, min(g) as g /* replaced by their minimum. */
from t%eval(3-&i) /* In the next iteration the same */
group by &&v&i; /* procedure is applied to assign- */
/* ment IDs rather than player IDs. */
select sum(g) into :n&i /* And so on, alternating, until */
from t&i; /* no reassignments occur, i.e., t1 */
quit; /* and t2 are equal up to sort order */
%end; /* and hence &n1=&n2. */
%do i=1 %to 2; /* Sorted lists of distinct assign- */
proc sql; /* ment and player IDs are prepared. */
create table u&i as
select distinct g, &&v&i
from t&i;
quit;
data r&i; /* Finally, the delimited lists of */
do until(last.g); /* IDs are created for reporting. */
set u&i;
by g;
length &&v&i..list $32767; /* length might be reduced carefully */
&&v&i..list=catx("&&d&i", &&v&i..list, &&v&i);
end;
drop &&v&i;
run;
%end;
/* Create result dataset */
data &dsout;
merge r1 r2;
by g;
drop g;
label &v1.list='Assignment ID'
&v2.list='Players';
run;
%mend iter;
%iter;
proc print data=want noobs label;
run;
Edit: Introduced macro parameters replacing previously hardcoded local macro variables and added sort step at the beginning.
GENIUS!
Both of them are genius! Although I will admit I went with the second "more elementary" method than the first. My brain almost exploded trying to figure out the first. All of you are amazing and I can only hope to get even to a tenth of your level!!!
I need some more help. I now need to add scores in and sum them. Where would I put that in this iterative macro? My data is as follows:
Assignment Player ScoreA ScoreB ScoreC
A 2 10 20 30
B 2 10 20 30
C 3 5 10 15
D 3 5 10 15
E 4 2 4 30
F 4 2 4 30
G 1 100 200 30
H 1 100 200 30
I 1 100 200 30
A 5 10 20 30
B 6 10 20 30
C 6 10 20 30
D 7 5 10 30
E 8 2 4 30
F 9 100 200 30
G 9 100 200 30
H 10 100 200 30
I 10 100 200 30
Thanks in advance.
Hi @Krislynn742005,
Your "data" suggests that the scores depend only on the player: e.g., player 1 has scores (100, 200, 30) regardless of the assignment (G, H or I). Correct?
In the original task we divided players (and assignments) into groups. What are the rules for summing scores? For example, would the scores for players 1, 2, 3, 4, 22 and 24 (who comprised group 4 in your sample data of March 23) be added to obtain, e.g., "SumScoreA" for group 4? That is, SumScoreA(group 4) = 100 + 10 + 5 + 2 + ScoreA(player 22) + ScoreA(player 24)?
Or are the assigments involved in the calculation somehow?
Hi FreelanceReinhard,
Yes, the scores are the same for each player regardless of what assignment they are in. I need to do exactly what the code did before except now I need to total the scores for all of the players in an assignment.
We want to sum the scores for all players in an assignment group. So in the data set that I sent earlier today the final result should be:
assignment player ScoreA ScoreB ScoreC
A_B_C_D 2,3,5,6,7 65 130 195
E_F_G_H_I 1,4,8,9,10 706 1412 2118
I still need all players in an assignment and all assignments with mutual players to be grouped together. But now we have to sum the three scores for all of the players in each assignment/player group.
I hope this answered your question.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.