BookmarkSubscribeRSS Feed
vicky1229
Calcite | Level 5

Can anyone help me to solve below problem? I want to count number of students that take all combinations of courses by gender. SAS Proc, data step or sql please.

 

sample data that I have, in the sample there are only 4 courses, but there are more than 100 courses in my dataset. 

data student;

input Student_ID$ Course$ Gender$;

datalines;

1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M

;

 

Table that I want to get, combination is all the possible combinations of courses, Count_F is number of Female students take the combination of courses, Count_M is number of Male students take the combination of courses, 

CombinationCount_FCount_M
Math, Computer11
Math, Physics12
Math, Chemstry01
Computer, Physics12
Computer, Chemstry02
Physics, Chemstry02
Math, Computer, Physics11
Math, Computer, Physics, Chemstry01
13 REPLIES 13
HB
Barite | Level 11 HB
Barite | Level 11

You'll need to explain more how your output is derived.

 

I see a Math F, and a Computer F and two Computer M's. 

 

I see a result of Math, Computer 1M 1F. 

 

I don't know your rules.

 

Also, post your data in a data step.

vicky1229
Calcite | Level 5

Thanks HB.

I have explained more about the question.

Astounding
PROC Star

I'm sure you realize ... if you have 10 possible courses, you are getting a table with roughly 1,000 rows.  At any rate, here is one approach.

 

data want;

array courses {4} Chemistry Computer Math Physics;

do until (last.Student_ID);

   set have;

   by Student_ID;

   if course='Chemistry' then chemistry=1;

   else if course='Computer' then computer=1;

   else if course='Math' then math=1;

   else if course='Physics' then physics=1;

end;

length combination $ 40;

if chemistry=computer=math=physics=1 then do;

   Combination = 'Chemistry, Computer, Math, Physics';

   output;

end;

do k1=1 to 3;

   first_course = vname(courses{k1});

   if courses{k1}=1 then do k2=k1+1 to 4;

      if courses{k2}=1 then do;

         combination = strip(first_course) || ', ' || vname(courses{k2});

         output;

      end;

   end;

end;

do k1=1 to 2;

   first_course = vname(courses{k1});

   if courses{k1}=1 then do k2=k1+1 to 3;

      first_2 = strip(first_course) || ', ' || vname(courses{k2});

      do k3=k2+1 to 4;

         if courses{k3}=1 then do;

            combination = strip(first_2) || ', ' || vname(courses{k3});

            output;

         end;

      end;

   end;

end;

keep combination gender;

run;

 

proc freq data=want;

tables combination * gender / norow nocol nopercent;

run;

 

It's untested code at the moment ... I will try to test it and update it if necessary. 

 

Testing results:

 

It gets the right numbers, but ...

 

It's not pretty.  (You can get a nicer looking table by switching from PROC FREQ to PROC TABULATE.)

 

It doesn't print the zeros.  This also means if a combination doesn't exist in the data, there is no row in the report for that combination.

 

For more than 4 courses, there may be better approaches that aren't so lengthy and clumsy.  SAS does include functions that get all combinations (grayscale?).

 

ballardw
Super User

How many distinct courses are included in your data?

TomKari
Onyx | Level 15

I may have understood your requirements differently. Here's a solution that solves what I think is the problem...but I could be wrong!

 

Tom

 

proc sort data=student;
	by Student_ID Course;
run;

data Inter01;
	length Course_List $200;
	retain Course_List Count_F Count_M;
	drop Student_ID Course Gender;
	set student;
	by Student_ID;

	if first.Student_ID
		then do;
		Count_F = 0;
		Count_M = 0;

		if Gender = 'F' then
			Count_F = 1;
		else Count_M = 1;
		Course_List = Course;
	end;
	else Course_List = strip(Course_List)||", "||strip(Course);

	if last.Student_ID then
		output;
run;

proc means data=Inter01 nway noprint;
	var Count_F Count_M;
	class Course_List;
	output out=Want(drop=_TYPE_ _FREQ_) sum()=;
run;
vicky1229
Calcite | Level 5

There are more than 100 courses in the my dataset.

mkeintz
PROC Star

You want a frequency, by sex, of all within-student 2-way, 3-way, ... combinations of classes taken.  This is a good case for using the CALL LEXCOMBI call routine, which gets all non-missing combinations user-specified size, from a larger list of values.

 

data student;
  input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;


proc sort data=student; by student_id course;run;

data need (keep=student_id gender crse: combo_size crse_combo);
  array _crse{10} $8 _temporary_;
  format crse_combo $char100.;
  retain NC 0;

  set student;
  by student_id;
  if first.student_id then do;
    call missing(of _crse{*});
    NC=0;
  end;
  NC+1;
  _crse{NC}=course;

  if last.student_id and NC>1;

  array crse{10} $9;

  array _i2[2];
  array _i3[3];
  array _i4[4];
  array _i5[5];
  array _i6[6];
  array _i7[7];
  array _i8[8];
  array _i9[9];
  array _i10[10];

  array all_i{2:10,10}
    _i2:  dum dum dum dum dum dum dum dum
    _i3:  dum dum dum dum dum dum dum
    _i4:  dum dum dum dum dum dum
    _i5:  dum dum dum dum dum
    _i6:  dum dum dum dum
    _i7:  dum dum dum 
    _i8:  dum dum 
    _i9:  dum 
    _i10:  ;

 if nc>1 then do combo_size=2 to NC;
   all_i{combo_size,1}=0;
   ncomb=comb(nc, combo_size);
   do j=1 to ncomb;
      select (combo_size);
        when (2)  rc=lexcombi(nc, combo_size, of _i2[*]);
        when (3)  rc=lexcombi(nc, combo_size, of _i3[*]);
        when (4)  rc=lexcombi(nc, combo_size, of _i4[*]);
        when (5)  rc=lexcombi(nc, combo_size, of _i5[*]);
        when (6)  rc=lexcombi(nc, combo_size, of _i6[*]);
        when (7)  rc=lexcombi(nc, combo_size, of _i7[*]);
        when (8)  rc=lexcombi(nc, combo_size, of _i8[*]);
        when (9)  rc=lexcombi(nc, combo_size, of _i9[*]);
        when (10) rc=lexcombi(nc, combo_size, of _i10[*]);
      end;
      do h=1 to combo_size;
        crse[h]=_crse[all_i{combo_size,h}];
      end;
      crse_combo=put(combo_size,2.) || ' ' || cat(of crse{*});
      output;
   end;
 end;  
run;

proc tabulate data=need noseps;
  class gender crse_combo;
  tables crse_combo=' ',gender=' '*N=' '*f=4. 
    / rts=100 box='Size / Course List' misstext='0';
run;

Notes:

 

 

  1. The proc sort is done, so that all combinations are in alphabetical order  (i.e. there are math/physics, but no physics/math).
  2. I declare index arrays for each expected combination size (I used a maximum of size 10).
  3. Each array (_i2, _i3, etc.) is used with call lexcombi for its own size.  You can't use "of _I{*}" with call lexcombi for a size other than its own.   That's why you see the "select (combo_size)" group of statements.
  4. By declaring a 2-dimensional array ALL_i  (one row per expected combo_size), constructed of the individual _i arrays and repeated use of the DUM variable, I can use compact statements such as   "crse[h]=_crse[all_i{combo_size,h}];" instead of another set of select statements.  Array all_i has an upper triangle of repeated use of the dum vars.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
AhmedAl_Attar
Rhodochrosite | Level 12

@mkeintzHere is the smart solution you provided, with slight macro usage twist to enable data driven code generation for the array processing.

data student;
    input Student_ID$ Course$ Gender$;
    datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
run;

/*Extract the count of unique courses and store it in a macro variable*/ proc sql noprint; select count(distinct (course)) into :g_numUnqCourse from work.student; quit; proc sort data=student; by student_id course; run; /* Declare a macro to auto generate SAS statements */ %macro InitDynArray(p_max); %local l_n; %do l_n=2 %to &p_max; %str(ARRAY _i&l_n[&l_n];) %end; ARRAY all_i{2:&p_max,&p_max} %do l_n=2 %to &p_max; _i&l_n: %do j=1 %to %eval(&p_max - &l_n); %str(dum ) %end; %end; %str(;) %mend; /* Declare a macro to auto generate SAS statements */ %macro dynArrayCombo(p_max); %local l_n; %do l_n=2 %to &p_max; WHEN (&l_n) rc=LEXCOMBI(nc, combo_size, of _i&l_n[*]); %end; %mend;
OPTIONS MPRINT; /*Display generated statements by the macro calls*/
data need (keep=student_id gender crse: combo_size crse_combo); array _crse{&g_numUnqCourse} $8 _temporary_; format crse_combo $char100.; retain NC 0; set student; by student_id; if first.student_id then do; call missing(of _crse{*}); NC=0; end; nc+1; _crse{nc}=course; if ((last.student_id) and (nc > 1)); ARRAY crse{&g_numUnqCourse} $9; %InitDynArray(&g_numUnqCourse); /* Call a macro */ if (nc > 1) then do combo_size=2 to nc; all_i{combo_size,1}=0; ncomb=comb(nc, combo_size); do j=1 to ncomb; select (combo_size); %dynArrayCombo(&g_numUnqCourse) /* Call a macro */ otherwise; end; /* End of Select */ do h=1 to combo_size; crse[h]=_crse[all_i{combo_size,h}]; end; crse_combo=put(combo_size,2.) || ': ' || catx(',',of crse{*}); output; end; /* End of j=1 to ncomb */ end; /* End of combo_size=2 to NC */ run; proc tabulate data=need noseps; class gender crse_combo; tables crse_combo=' ',gender=' '*N=' '*f=4. / rts=100 box='Size / Course List' misstext='0'; run;

Regards,

Ahmed

mkeintz
PROC Star

@AhmedAl_Attar

 

I appreciate your macro-izing of the code I provided.

 

But to be honest, if the dataset is large, I would run a PROC SQL against the entire data set just to generate the maximum number of possible courses.  I would just parameterize the macro it with a number that would surely exceed all possible course counts.

 

After all once you've macro-ized, you aren't subject to typing errors when coding for each combination size, so the penalty for a too-large number becomes negligible

 

Of course, if I had been true to efficiency-uber-alles, I would have avoided the proc sort, since the data were already sorted by student_id, but not course-within-student_id.  Instead, after reading in the array of courses for a given student, I would have used CALL SORTC of the array, followed by left-shifting of the results to avoid blank course values at the lower bound of the array.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
vicky1229
Calcite | Level 5

Thanks a lot Ahmed, your solution is great. I have one question for you, your code is working well when I test it in sample data set which has about 10 courses, but it gives me an error when I test in another data set which has about 100 courses in it. 

Could you please give me a hint to make it work for large data set?

 

 

AhmedAl_Attar
Rhodochrosite | Level 12
Hi Vicky,
Without knowing the exact nor the nature of the error you are getting, I'm guessing it would have to do with
- Character length ----> Change this line: format crse_combo $char100.;
- Memory allocated to your SAS session ---> use -MEMSIZE 4GB on your SAS invocation command line. Increase the allocated GB as needed.

With 100 courses --> ((2**100) -1) permutations!!
Which is really a very big number: 1.26765060023E30 !!

Hope this helps,
Ahmed
Ksharp
Super User
data student;
  input Student_ID$ Course$ Gender$;
  retain value 1;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;

proc transpose data=student out=temp ;
by Student_ID gender;
var value;
id course;
run;

proc sql noprint;
select distinct Course into : list separated by ' '
 from student;
select count(distinct course) into : n
 from student;
quit;

data temp1;
 set temp;
 array x{*} &list ;
 array y{&n};
 length combination $ 2000;
 k=-1;
 do i=1 to 2**&n ;
  rc=graycode(k,of y{*});
  yes=1;
  do j=1 to &n;
   if y{j}=1 and x{j}=. then yes=0;
  end;
  call missing(combination);
  if yes then do;
    do j=1 to &n;
      if y{j}=1 then combination=catx(',',combination,vname(x{j})) ;
    end;
	if not missing(combination) then output;
  end;
 end;
 keep gender combination;
run;
proc freq data=temp1 noprint;
table  gender*combination/out=temp2 list;
run;
proc sort data=temp2;by combination;run;
proc transpose data=temp2 out=want prefix=count_;
by combination;
var count;
id gender;
run;

 

P.S. modified. It seems it is not so easy.

AhmedAl_Attar
Rhodochrosite | Level 12

Hi,

Here is yet another solution, While I have used some code from previous contributors, I've used Custom written function via Proc FCMP to produce and return all possible course combinations.

 

The  getCombos(values [*] $, sep $) $; user function, accepts an array and a separator character, and returns a separated String of course combinations. The logic implemented inside that function was inspired from an old SUGI 29 paper .

 


proc fcmp
    outlib=sasuser.functions.customFunctions;

    function getCombos(values [*] $, sep $) $;
        length combos $100 l_sep $1;
        length i vCount binaryDigit 4;
        length max_Type_  start i_tmp 8;
        length final_combo $1000;

        l_sep=ifc (sep='','I',sep) ;

        *put values[*];
        vCount=0;
        do i=1 to dim(values);
            vCount+ifc(values[i]='',0,1);
        end;

        max_Type_ = 2**vCount - 1;
        *put max_Type_;

        /*  Loop through till max_Type_ is reached*/
        do start=0 to max_Type_;
            i_tmp = start;

            /* Loop through all passed-in values */
            do j=vCount to 1 by -1;
                binaryDigit= int(i_tmp/(2**(j-1)) );
                if (binaryDigit = 1) then
                do;
                    combos = catx(',',combos,values[(vCount-j+1)]);

                    * Decrement i_tmp if binaryDigit is in types binary representation;
                    i_tmp = i_tmp - 2**(j-1);
                end;
            end; /*End of  j=vCount to 1 by -1*/

            /* Only maintain combinations of more than 1 value */
            if (index(combos,',') GT 0) then
            do;
                *put combos;
                final_combo = catx(l_sep,final_combo,combos);
                *put final_combo;
            end;
            combos='';
        end; /* End of start=0 to max_Type_*/

        return (final_combo);
    endsub;
run;

options cmplib=(sasuser.functions);

data student;
    input Student_ID$ Course$ Gender$;
    datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
run;
proc sort data=student;
    by student_id course;
run;

data want(keep=student_id gender crse_combo);
    array _crse{10} $9 _temporary_;
    retain nc 0;
    set student;
    by student_id;
    length courses $1000;

    if first.student_id then
    do;
        call missing(of _crse{*});
        nc=0;
    end;

    nc+1;
    _crse{nc}=course;

    if ((last.student_id) and (nc > 1));
    courses = getCombos(_crse, "~");
    *put courses=;
    comboCount=countc(courses,'~')+1;
    do c=1 to comboCount;
        crse_combo=scan(courses,c,'~');
        combo_size=countc(crse_combo,',')+1;
        crse_combo= put(combo_size,2.) || ': ' || crse_combo;
        output;
    end;
run;

proc tabulate data=want noseps;
    class gender crse_combo;
    tables crse_combo=' ',gender=' '*N=' '*f=4. 
        / rts=100 box='Size / Course List' misstext='0';
run;

Hope this helps,

Ahmed

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 4341 views
  • 1 like
  • 8 in conversation