Can anyone help me to solve below problem? I want to count number of students that take all combinations of courses by gender. SAS Proc, data step or sql please.
sample data that I have, in the sample there are only 4 courses, but there are more than 100 courses in my dataset.
data student;
input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
Table that I want to get, combination is all the possible combinations of courses, Count_F is number of Female students take the combination of courses, Count_M is number of Male students take the combination of courses,
Combination | Count_F | Count_M |
Math, Computer | 1 | 1 |
Math, Physics | 1 | 2 |
Math, Chemstry | 0 | 1 |
Computer, Physics | 1 | 2 |
Computer, Chemstry | 0 | 2 |
Physics, Chemstry | 0 | 2 |
Math, Computer, Physics | 1 | 1 |
… | … | … |
Math, Computer, Physics, Chemstry | 0 | 1 |
You'll need to explain more how your output is derived.
I see a Math F, and a Computer F and two Computer M's.
I see a result of Math, Computer 1M 1F.
I don't know your rules.
Also, post your data in a data step.
Thanks HB.
I have explained more about the question.
I'm sure you realize ... if you have 10 possible courses, you are getting a table with roughly 1,000 rows. At any rate, here is one approach.
data want;
array courses {4} Chemistry Computer Math Physics;
do until (last.Student_ID);
set have;
by Student_ID;
if course='Chemistry' then chemistry=1;
else if course='Computer' then computer=1;
else if course='Math' then math=1;
else if course='Physics' then physics=1;
end;
length combination $ 40;
if chemistry=computer=math=physics=1 then do;
Combination = 'Chemistry, Computer, Math, Physics';
output;
end;
do k1=1 to 3;
first_course = vname(courses{k1});
if courses{k1}=1 then do k2=k1+1 to 4;
if courses{k2}=1 then do;
combination = strip(first_course) || ', ' || vname(courses{k2});
output;
end;
end;
end;
do k1=1 to 2;
first_course = vname(courses{k1});
if courses{k1}=1 then do k2=k1+1 to 3;
first_2 = strip(first_course) || ', ' || vname(courses{k2});
do k3=k2+1 to 4;
if courses{k3}=1 then do;
combination = strip(first_2) || ', ' || vname(courses{k3});
output;
end;
end;
end;
end;
keep combination gender;
run;
proc freq data=want;
tables combination * gender / norow nocol nopercent;
run;
It's untested code at the moment ... I will try to test it and update it if necessary.
Testing results:
It gets the right numbers, but ...
It's not pretty. (You can get a nicer looking table by switching from PROC FREQ to PROC TABULATE.)
It doesn't print the zeros. This also means if a combination doesn't exist in the data, there is no row in the report for that combination.
For more than 4 courses, there may be better approaches that aren't so lengthy and clumsy. SAS does include functions that get all combinations (grayscale?).
How many distinct courses are included in your data?
I may have understood your requirements differently. Here's a solution that solves what I think is the problem...but I could be wrong!
Tom
proc sort data=student;
by Student_ID Course;
run;
data Inter01;
length Course_List $200;
retain Course_List Count_F Count_M;
drop Student_ID Course Gender;
set student;
by Student_ID;
if first.Student_ID
then do;
Count_F = 0;
Count_M = 0;
if Gender = 'F' then
Count_F = 1;
else Count_M = 1;
Course_List = Course;
end;
else Course_List = strip(Course_List)||", "||strip(Course);
if last.Student_ID then
output;
run;
proc means data=Inter01 nway noprint;
var Count_F Count_M;
class Course_List;
output out=Want(drop=_TYPE_ _FREQ_) sum()=;
run;
There are more than 100 courses in the my dataset.
You want a frequency, by sex, of all within-student 2-way, 3-way, ... combinations of classes taken. This is a good case for using the CALL LEXCOMBI call routine, which gets all non-missing combinations user-specified size, from a larger list of values.
data student;
input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
proc sort data=student; by student_id course;run;
data need (keep=student_id gender crse: combo_size crse_combo);
array _crse{10} $8 _temporary_;
format crse_combo $char100.;
retain NC 0;
set student;
by student_id;
if first.student_id then do;
call missing(of _crse{*});
NC=0;
end;
NC+1;
_crse{NC}=course;
if last.student_id and NC>1;
array crse{10} $9;
array _i2[2];
array _i3[3];
array _i4[4];
array _i5[5];
array _i6[6];
array _i7[7];
array _i8[8];
array _i9[9];
array _i10[10];
array all_i{2:10,10}
_i2: dum dum dum dum dum dum dum dum
_i3: dum dum dum dum dum dum dum
_i4: dum dum dum dum dum dum
_i5: dum dum dum dum dum
_i6: dum dum dum dum
_i7: dum dum dum
_i8: dum dum
_i9: dum
_i10: ;
if nc>1 then do combo_size=2 to NC;
all_i{combo_size,1}=0;
ncomb=comb(nc, combo_size);
do j=1 to ncomb;
select (combo_size);
when (2) rc=lexcombi(nc, combo_size, of _i2[*]);
when (3) rc=lexcombi(nc, combo_size, of _i3[*]);
when (4) rc=lexcombi(nc, combo_size, of _i4[*]);
when (5) rc=lexcombi(nc, combo_size, of _i5[*]);
when (6) rc=lexcombi(nc, combo_size, of _i6[*]);
when (7) rc=lexcombi(nc, combo_size, of _i7[*]);
when (8) rc=lexcombi(nc, combo_size, of _i8[*]);
when (9) rc=lexcombi(nc, combo_size, of _i9[*]);
when (10) rc=lexcombi(nc, combo_size, of _i10[*]);
end;
do h=1 to combo_size;
crse[h]=_crse[all_i{combo_size,h}];
end;
crse_combo=put(combo_size,2.) || ' ' || cat(of crse{*});
output;
end;
end;
run;
proc tabulate data=need noseps;
class gender crse_combo;
tables crse_combo=' ',gender=' '*N=' '*f=4.
/ rts=100 box='Size / Course List' misstext='0';
run;
Notes:
@mkeintzHere is the smart solution you provided, with slight macro usage twist to enable data driven code generation for the array processing.
data student;
input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
run;
/*Extract the count of unique courses and store it in a macro variable*/
proc sql noprint;
select count(distinct (course))
into :g_numUnqCourse
from work.student;
quit;
proc sort data=student;
by student_id course;
run;
/* Declare a macro to auto generate SAS statements */
%macro InitDynArray(p_max);
%local l_n;
%do l_n=2 %to &p_max;
%str(ARRAY _i&l_n[&l_n];)
%end;
ARRAY all_i{2:&p_max,&p_max}
%do l_n=2 %to &p_max;
_i&l_n:
%do j=1 %to %eval(&p_max - &l_n);
%str(dum )
%end;
%end;
%str(;)
%mend;
/* Declare a macro to auto generate SAS statements */
%macro dynArrayCombo(p_max);
%local l_n;
%do l_n=2 %to &p_max;
WHEN (&l_n) rc=LEXCOMBI(nc, combo_size, of _i&l_n[*]);
%end;
%mend;
OPTIONS MPRINT; /*Display generated statements by the macro calls*/
data need (keep=student_id gender crse: combo_size crse_combo);
array _crse{&g_numUnqCourse} $8 _temporary_;
format crse_combo $char100.;
retain NC 0;
set student;
by student_id;
if first.student_id then
do;
call missing(of _crse{*});
NC=0;
end;
nc+1;
_crse{nc}=course;
if ((last.student_id) and (nc > 1));
ARRAY crse{&g_numUnqCourse} $9;
%InitDynArray(&g_numUnqCourse); /* Call a macro */
if (nc > 1) then
do combo_size=2 to nc;
all_i{combo_size,1}=0;
ncomb=comb(nc, combo_size);
do j=1 to ncomb;
select (combo_size);
%dynArrayCombo(&g_numUnqCourse) /* Call a macro */
otherwise;
end; /* End of Select */
do h=1 to combo_size;
crse[h]=_crse[all_i{combo_size,h}];
end;
crse_combo=put(combo_size,2.) || ': ' || catx(',',of crse{*});
output;
end; /* End of j=1 to ncomb */
end; /* End of combo_size=2 to NC */
run;
proc tabulate data=need noseps;
class gender crse_combo;
tables crse_combo=' ',gender=' '*N=' '*f=4.
/ rts=100 box='Size / Course List' misstext='0';
run;
Regards,
Ahmed
I appreciate your macro-izing of the code I provided.
But to be honest, if the dataset is large, I would run a PROC SQL against the entire data set just to generate the maximum number of possible courses. I would just parameterize the macro it with a number that would surely exceed all possible course counts.
After all once you've macro-ized, you aren't subject to typing errors when coding for each combination size, so the penalty for a too-large number becomes negligible
Of course, if I had been true to efficiency-uber-alles, I would have avoided the proc sort, since the data were already sorted by student_id, but not course-within-student_id. Instead, after reading in the array of courses for a given student, I would have used CALL SORTC of the array, followed by left-shifting of the results to avoid blank course values at the lower bound of the array.
Thanks a lot Ahmed, your solution is great. I have one question for you, your code is working well when I test it in sample data set which has about 10 courses, but it gives me an error when I test in another data set which has about 100 courses in it.
Could you please give me a hint to make it work for large data set?
data student;
input Student_ID$ Course$ Gender$;
retain value 1;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
proc transpose data=student out=temp ;
by Student_ID gender;
var value;
id course;
run;
proc sql noprint;
select distinct Course into : list separated by ' '
from student;
select count(distinct course) into : n
from student;
quit;
data temp1;
set temp;
array x{*} &list ;
array y{&n};
length combination $ 2000;
k=-1;
do i=1 to 2**&n ;
rc=graycode(k,of y{*});
yes=1;
do j=1 to &n;
if y{j}=1 and x{j}=. then yes=0;
end;
call missing(combination);
if yes then do;
do j=1 to &n;
if y{j}=1 then combination=catx(',',combination,vname(x{j})) ;
end;
if not missing(combination) then output;
end;
end;
keep gender combination;
run;
proc freq data=temp1 noprint;
table gender*combination/out=temp2 list;
run;
proc sort data=temp2;by combination;run;
proc transpose data=temp2 out=want prefix=count_;
by combination;
var count;
id gender;
run;
P.S. modified. It seems it is not so easy.
Hi,
Here is yet another solution, While I have used some code from previous contributors, I've used Custom written function via Proc FCMP to produce and return all possible course combinations.
The getCombos(values [*] $, sep $) $; user function, accepts an array and a separator character, and returns a separated String of course combinations. The logic implemented inside that function was inspired from an old SUGI 29 paper .
proc fcmp
outlib=sasuser.functions.customFunctions;
function getCombos(values [*] $, sep $) $;
length combos $100 l_sep $1;
length i vCount binaryDigit 4;
length max_Type_ start i_tmp 8;
length final_combo $1000;
l_sep=ifc (sep='','I',sep) ;
*put values[*];
vCount=0;
do i=1 to dim(values);
vCount+ifc(values[i]='',0,1);
end;
max_Type_ = 2**vCount - 1;
*put max_Type_;
/* Loop through till max_Type_ is reached*/
do start=0 to max_Type_;
i_tmp = start;
/* Loop through all passed-in values */
do j=vCount to 1 by -1;
binaryDigit= int(i_tmp/(2**(j-1)) );
if (binaryDigit = 1) then
do;
combos = catx(',',combos,values[(vCount-j+1)]);
* Decrement i_tmp if binaryDigit is in types binary representation;
i_tmp = i_tmp - 2**(j-1);
end;
end; /*End of j=vCount to 1 by -1*/
/* Only maintain combinations of more than 1 value */
if (index(combos,',') GT 0) then
do;
*put combos;
final_combo = catx(l_sep,final_combo,combos);
*put final_combo;
end;
combos='';
end; /* End of start=0 to max_Type_*/
return (final_combo);
endsub;
run;
options cmplib=(sasuser.functions);
data student;
input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
run;
proc sort data=student;
by student_id course;
run;
data want(keep=student_id gender crse_combo);
array _crse{10} $9 _temporary_;
retain nc 0;
set student;
by student_id;
length courses $1000;
if first.student_id then
do;
call missing(of _crse{*});
nc=0;
end;
nc+1;
_crse{nc}=course;
if ((last.student_id) and (nc > 1));
courses = getCombos(_crse, "~");
*put courses=;
comboCount=countc(courses,'~')+1;
do c=1 to comboCount;
crse_combo=scan(courses,c,'~');
combo_size=countc(crse_combo,',')+1;
crse_combo= put(combo_size,2.) || ': ' || crse_combo;
output;
end;
run;
proc tabulate data=want noseps;
class gender crse_combo;
tables crse_combo=' ',gender=' '*N=' '*f=4.
/ rts=100 box='Size / Course List' misstext='0';
run;
Hope this helps,
Ahmed
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.