Can anyone help me to solve below problem? I want to count number of students that take all combinations of courses by gender. SAS Proc, data step or sql please.
sample data that I have, in the sample there are only 4 courses, but there are more than 100 courses in my dataset.
data student;
input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
Table that I want to get, combination is all the possible combinations of courses, Count_F is number of Female students take the combination of courses, Count_M is number of Male students take the combination of courses,
Combination | Count_F | Count_M |
Math, Computer | 1 | 1 |
Math, Physics | 1 | 2 |
Math, Chemstry | 0 | 1 |
Computer, Physics | 1 | 2 |
Computer, Chemstry | 0 | 2 |
Physics, Chemstry | 0 | 2 |
Math, Computer, Physics | 1 | 1 |
… | … | … |
Math, Computer, Physics, Chemstry | 0 | 1 |
You'll need to explain more how your output is derived.
I see a Math F, and a Computer F and two Computer M's.
I see a result of Math, Computer 1M 1F.
I don't know your rules.
Also, post your data in a data step.
Thanks HB.
I have explained more about the question.
I'm sure you realize ... if you have 10 possible courses, you are getting a table with roughly 1,000 rows. At any rate, here is one approach.
data want;
array courses {4} Chemistry Computer Math Physics;
do until (last.Student_ID);
set have;
by Student_ID;
if course='Chemistry' then chemistry=1;
else if course='Computer' then computer=1;
else if course='Math' then math=1;
else if course='Physics' then physics=1;
end;
length combination $ 40;
if chemistry=computer=math=physics=1 then do;
Combination = 'Chemistry, Computer, Math, Physics';
output;
end;
do k1=1 to 3;
first_course = vname(courses{k1});
if courses{k1}=1 then do k2=k1+1 to 4;
if courses{k2}=1 then do;
combination = strip(first_course) || ', ' || vname(courses{k2});
output;
end;
end;
end;
do k1=1 to 2;
first_course = vname(courses{k1});
if courses{k1}=1 then do k2=k1+1 to 3;
first_2 = strip(first_course) || ', ' || vname(courses{k2});
do k3=k2+1 to 4;
if courses{k3}=1 then do;
combination = strip(first_2) || ', ' || vname(courses{k3});
output;
end;
end;
end;
end;
keep combination gender;
run;
proc freq data=want;
tables combination * gender / norow nocol nopercent;
run;
It's untested code at the moment ... I will try to test it and update it if necessary.
Testing results:
It gets the right numbers, but ...
It's not pretty. (You can get a nicer looking table by switching from PROC FREQ to PROC TABULATE.)
It doesn't print the zeros. This also means if a combination doesn't exist in the data, there is no row in the report for that combination.
For more than 4 courses, there may be better approaches that aren't so lengthy and clumsy. SAS does include functions that get all combinations (grayscale?).
How many distinct courses are included in your data?
I may have understood your requirements differently. Here's a solution that solves what I think is the problem...but I could be wrong!
Tom
proc sort data=student;
by Student_ID Course;
run;
data Inter01;
length Course_List $200;
retain Course_List Count_F Count_M;
drop Student_ID Course Gender;
set student;
by Student_ID;
if first.Student_ID
then do;
Count_F = 0;
Count_M = 0;
if Gender = 'F' then
Count_F = 1;
else Count_M = 1;
Course_List = Course;
end;
else Course_List = strip(Course_List)||", "||strip(Course);
if last.Student_ID then
output;
run;
proc means data=Inter01 nway noprint;
var Count_F Count_M;
class Course_List;
output out=Want(drop=_TYPE_ _FREQ_) sum()=;
run;
There are more than 100 courses in the my dataset.
You want a frequency, by sex, of all within-student 2-way, 3-way, ... combinations of classes taken. This is a good case for using the CALL LEXCOMBI call routine, which gets all non-missing combinations user-specified size, from a larger list of values.
data student;
input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
proc sort data=student; by student_id course;run;
data need (keep=student_id gender crse: combo_size crse_combo);
array _crse{10} $8 _temporary_;
format crse_combo $char100.;
retain NC 0;
set student;
by student_id;
if first.student_id then do;
call missing(of _crse{*});
NC=0;
end;
NC+1;
_crse{NC}=course;
if last.student_id and NC>1;
array crse{10} $9;
array _i2[2];
array _i3[3];
array _i4[4];
array _i5[5];
array _i6[6];
array _i7[7];
array _i8[8];
array _i9[9];
array _i10[10];
array all_i{2:10,10}
_i2: dum dum dum dum dum dum dum dum
_i3: dum dum dum dum dum dum dum
_i4: dum dum dum dum dum dum
_i5: dum dum dum dum dum
_i6: dum dum dum dum
_i7: dum dum dum
_i8: dum dum
_i9: dum
_i10: ;
if nc>1 then do combo_size=2 to NC;
all_i{combo_size,1}=0;
ncomb=comb(nc, combo_size);
do j=1 to ncomb;
select (combo_size);
when (2) rc=lexcombi(nc, combo_size, of _i2[*]);
when (3) rc=lexcombi(nc, combo_size, of _i3[*]);
when (4) rc=lexcombi(nc, combo_size, of _i4[*]);
when (5) rc=lexcombi(nc, combo_size, of _i5[*]);
when (6) rc=lexcombi(nc, combo_size, of _i6[*]);
when (7) rc=lexcombi(nc, combo_size, of _i7[*]);
when (8) rc=lexcombi(nc, combo_size, of _i8[*]);
when (9) rc=lexcombi(nc, combo_size, of _i9[*]);
when (10) rc=lexcombi(nc, combo_size, of _i10[*]);
end;
do h=1 to combo_size;
crse[h]=_crse[all_i{combo_size,h}];
end;
crse_combo=put(combo_size,2.) || ' ' || cat(of crse{*});
output;
end;
end;
run;
proc tabulate data=need noseps;
class gender crse_combo;
tables crse_combo=' ',gender=' '*N=' '*f=4.
/ rts=100 box='Size / Course List' misstext='0';
run;
Notes:
@mkeintzHere is the smart solution you provided, with slight macro usage twist to enable data driven code generation for the array processing.
data student;
input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
run;
/*Extract the count of unique courses and store it in a macro variable*/
proc sql noprint;
select count(distinct (course))
into :g_numUnqCourse
from work.student;
quit;
proc sort data=student;
by student_id course;
run;
/* Declare a macro to auto generate SAS statements */
%macro InitDynArray(p_max);
%local l_n;
%do l_n=2 %to &p_max;
%str(ARRAY _i&l_n[&l_n];)
%end;
ARRAY all_i{2:&p_max,&p_max}
%do l_n=2 %to &p_max;
_i&l_n:
%do j=1 %to %eval(&p_max - &l_n);
%str(dum )
%end;
%end;
%str(;)
%mend;
/* Declare a macro to auto generate SAS statements */
%macro dynArrayCombo(p_max);
%local l_n;
%do l_n=2 %to &p_max;
WHEN (&l_n) rc=LEXCOMBI(nc, combo_size, of _i&l_n[*]);
%end;
%mend;
OPTIONS MPRINT; /*Display generated statements by the macro calls*/
data need (keep=student_id gender crse: combo_size crse_combo);
array _crse{&g_numUnqCourse} $8 _temporary_;
format crse_combo $char100.;
retain NC 0;
set student;
by student_id;
if first.student_id then
do;
call missing(of _crse{*});
NC=0;
end;
nc+1;
_crse{nc}=course;
if ((last.student_id) and (nc > 1));
ARRAY crse{&g_numUnqCourse} $9;
%InitDynArray(&g_numUnqCourse); /* Call a macro */
if (nc > 1) then
do combo_size=2 to nc;
all_i{combo_size,1}=0;
ncomb=comb(nc, combo_size);
do j=1 to ncomb;
select (combo_size);
%dynArrayCombo(&g_numUnqCourse) /* Call a macro */
otherwise;
end; /* End of Select */
do h=1 to combo_size;
crse[h]=_crse[all_i{combo_size,h}];
end;
crse_combo=put(combo_size,2.) || ': ' || catx(',',of crse{*});
output;
end; /* End of j=1 to ncomb */
end; /* End of combo_size=2 to NC */
run;
proc tabulate data=need noseps;
class gender crse_combo;
tables crse_combo=' ',gender=' '*N=' '*f=4.
/ rts=100 box='Size / Course List' misstext='0';
run;
Regards,
Ahmed
I appreciate your macro-izing of the code I provided.
But to be honest, if the dataset is large, I would run a PROC SQL against the entire data set just to generate the maximum number of possible courses. I would just parameterize the macro it with a number that would surely exceed all possible course counts.
After all once you've macro-ized, you aren't subject to typing errors when coding for each combination size, so the penalty for a too-large number becomes negligible
Of course, if I had been true to efficiency-uber-alles, I would have avoided the proc sort, since the data were already sorted by student_id, but not course-within-student_id. Instead, after reading in the array of courses for a given student, I would have used CALL SORTC of the array, followed by left-shifting of the results to avoid blank course values at the lower bound of the array.
Thanks a lot Ahmed, your solution is great. I have one question for you, your code is working well when I test it in sample data set which has about 10 courses, but it gives me an error when I test in another data set which has about 100 courses in it.
Could you please give me a hint to make it work for large data set?
data student;
input Student_ID$ Course$ Gender$;
retain value 1;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
proc transpose data=student out=temp ;
by Student_ID gender;
var value;
id course;
run;
proc sql noprint;
select distinct Course into : list separated by ' '
from student;
select count(distinct course) into : n
from student;
quit;
data temp1;
set temp;
array x{*} &list ;
array y{&n};
length combination $ 2000;
k=-1;
do i=1 to 2**&n ;
rc=graycode(k,of y{*});
yes=1;
do j=1 to &n;
if y{j}=1 and x{j}=. then yes=0;
end;
call missing(combination);
if yes then do;
do j=1 to &n;
if y{j}=1 then combination=catx(',',combination,vname(x{j})) ;
end;
if not missing(combination) then output;
end;
end;
keep gender combination;
run;
proc freq data=temp1 noprint;
table gender*combination/out=temp2 list;
run;
proc sort data=temp2;by combination;run;
proc transpose data=temp2 out=want prefix=count_;
by combination;
var count;
id gender;
run;
P.S. modified. It seems it is not so easy.
Hi,
Here is yet another solution, While I have used some code from previous contributors, I've used Custom written function via Proc FCMP to produce and return all possible course combinations.
The getCombos(values [*] $, sep $) $; user function, accepts an array and a separator character, and returns a separated String of course combinations. The logic implemented inside that function was inspired from an old SUGI 29 paper .
proc fcmp
outlib=sasuser.functions.customFunctions;
function getCombos(values [*] $, sep $) $;
length combos $100 l_sep $1;
length i vCount binaryDigit 4;
length max_Type_ start i_tmp 8;
length final_combo $1000;
l_sep=ifc (sep='','I',sep) ;
*put values[*];
vCount=0;
do i=1 to dim(values);
vCount+ifc(values[i]='',0,1);
end;
max_Type_ = 2**vCount - 1;
*put max_Type_;
/* Loop through till max_Type_ is reached*/
do start=0 to max_Type_;
i_tmp = start;
/* Loop through all passed-in values */
do j=vCount to 1 by -1;
binaryDigit= int(i_tmp/(2**(j-1)) );
if (binaryDigit = 1) then
do;
combos = catx(',',combos,values[(vCount-j+1)]);
* Decrement i_tmp if binaryDigit is in types binary representation;
i_tmp = i_tmp - 2**(j-1);
end;
end; /*End of j=vCount to 1 by -1*/
/* Only maintain combinations of more than 1 value */
if (index(combos,',') GT 0) then
do;
*put combos;
final_combo = catx(l_sep,final_combo,combos);
*put final_combo;
end;
combos='';
end; /* End of start=0 to max_Type_*/
return (final_combo);
endsub;
run;
options cmplib=(sasuser.functions);
data student;
input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
run;
proc sort data=student;
by student_id course;
run;
data want(keep=student_id gender crse_combo);
array _crse{10} $9 _temporary_;
retain nc 0;
set student;
by student_id;
length courses $1000;
if first.student_id then
do;
call missing(of _crse{*});
nc=0;
end;
nc+1;
_crse{nc}=course;
if ((last.student_id) and (nc > 1));
courses = getCombos(_crse, "~");
*put courses=;
comboCount=countc(courses,'~')+1;
do c=1 to comboCount;
crse_combo=scan(courses,c,'~');
combo_size=countc(crse_combo,',')+1;
crse_combo= put(combo_size,2.) || ': ' || crse_combo;
output;
end;
run;
proc tabulate data=want noseps;
class gender crse_combo;
tables crse_combo=' ',gender=' '*N=' '*f=4.
/ rts=100 box='Size / Course List' misstext='0';
run;
Hope this helps,
Ahmed
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.