SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Produce and count all possible combinations of the values in a column

Reply
New Contributor
Posts: 4

Produce and count all possible combinations of the values in a column

[ Edited ]

Can anyone help me to solve below problem? I want to count number of students that take all combinations of courses by gender. SAS Proc, data step or sql please.

 

sample data that I have, in the sample there are only 4 courses, but there are more than 100 courses in my dataset. 

data student;

input Student_ID$ Course$ Gender$;

datalines;

1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M

;

 

Table that I want to get, combination is all the possible combinations of courses, Count_F is number of Female students take the combination of courses, Count_M is number of Male students take the combination of courses, 

CombinationCount_FCount_M
Math, Computer11
Math, Physics12
Math, Chemstry01
Computer, Physics12
Computer, Chemstry02
Physics, Chemstry02
Math, Computer, Physics11
Math, Computer, Physics, Chemstry01
Regular Contributor
Regular Contributor
Posts: 201

Re: Produce and count all possible combinations of the values in a column

[ Edited ]
Posted in reply to vicky1229

You'll need to explain more how your output is derived.

 

I see a Math F, and a Computer F and two Computer M's. 

 

I see a result of Math, Computer 1M 1F. 

 

I don't know your rules.

 

Also, post your data in a data step.

New Contributor
Posts: 4

Re: Produce and count all possible combinations of the values in a column

Thanks HB.

I have explained more about the question.

Super User
Posts: 5,699

Re: Produce and count all possible combinations of the values in a column

[ Edited ]
Posted in reply to vicky1229

I'm sure you realize ... if you have 10 possible courses, you are getting a table with roughly 1,000 rows.  At any rate, here is one approach.

 

data want;

array courses {4} Chemistry Computer Math Physics;

do until (last.Student_ID);

   set have;

   by Student_ID;

   if course='Chemistry' then chemistry=1;

   else if course='Computer' then computer=1;

   else if course='Math' then math=1;

   else if course='Physics' then physics=1;

end;

length combination $ 40;

if chemistry=computer=math=physics=1 then do;

   Combination = 'Chemistry, Computer, Math, Physics';

   output;

end;

do k1=1 to 3;

   first_course = vname(courses{k1});

   if courses{k1}=1 then do k2=k1+1 to 4;

      if courses{k2}=1 then do;

         combination = strip(first_course) || ', ' || vname(courses{k2});

         output;

      end;

   end;

end;

do k1=1 to 2;

   first_course = vname(courses{k1});

   if courses{k1}=1 then do k2=k1+1 to 3;

      first_2 = strip(first_course) || ', ' || vname(courses{k2});

      do k3=k2+1 to 4;

         if courses{k3}=1 then do;

            combination = strip(first_2) || ', ' || vname(courses{k3});

            output;

         end;

      end;

   end;

end;

keep combination gender;

run;

 

proc freq data=want;

tables combination * gender / norow nocol nopercent;

run;

 

It's untested code at the moment ... I will try to test it and update it if necessary. 

 

Testing results:

 

It gets the right numbers, but ...

 

It's not pretty.  (You can get a nicer looking table by switching from PROC FREQ to PROC TABULATE.)

 

It doesn't print the zeros.  This also means if a combination doesn't exist in the data, there is no row in the report for that combination.

 

For more than 4 courses, there may be better approaches that aren't so lengthy and clumsy.  SAS does include functions that get all combinations (grayscale?).

 

Super User
Posts: 11,752

Re: Produce and count all possible combinations of the values in a column

Posted in reply to vicky1229

How many distinct courses are included in your data?

PROC Star
Posts: 1,185

Re: Produce and count all possible combinations of the values in a column

I may have understood your requirements differently. Here's a solution that solves what I think is the problem...but I could be wrong!

 

Tom

 

proc sort data=student;
	by Student_ID Course;
run;

data Inter01;
	length Course_List $200;
	retain Course_List Count_F Count_M;
	drop Student_ID Course Gender;
	set student;
	by Student_ID;

	if first.Student_ID
		then do;
		Count_F = 0;
		Count_M = 0;

		if Gender = 'F' then
			Count_F = 1;
		else Count_M = 1;
		Course_List = Course;
	end;
	else Course_List = strip(Course_List)||", "||strip(Course);

	if last.Student_ID then
		output;
run;

proc means data=Inter01 nway noprint;
	var Count_F Count_M;
	class Course_List;
	output out=Want(drop=_TYPE_ _FREQ_) sum()=;
run;
New Contributor
Posts: 4

Re: Produce and count all possible combinations of the values in a column

There are more than 100 courses in the my dataset.

Trusted Advisor
Posts: 1,115

Re: Produce and count all possible combinations of the values in a column

Posted in reply to vicky1229

You want a frequency, by sex, of all within-student 2-way, 3-way, ... combinations of classes taken.  This is a good case for using the CALL LEXCOMBI call routine, which gets all non-missing combinations user-specified size, from a larger list of values.

 

data student;
  input Student_ID$ Course$ Gender$;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;


proc sort data=student; by student_id course;run;

data need (keep=student_id gender crse: combo_size crse_combo);
  array _crse{10} $8 _temporary_;
  format crse_combo $char100.;
  retain NC 0;

  set student;
  by student_id;
  if first.student_id then do;
    call missing(of _crse{*});
    NC=0;
  end;
  NC+1;
  _crse{NC}=course;

  if last.student_id and NC>1;

  array crse{10} $9;

  array _i2[2];
  array _i3[3];
  array _i4[4];
  array _i5[5];
  array _i6[6];
  array _i7[7];
  array _i8[8];
  array _i9[9];
  array _i10[10];

  array all_i{2:10,10}
    _i2:  dum dum dum dum dum dum dum dum
    _i3:  dum dum dum dum dum dum dum
    _i4:  dum dum dum dum dum dum
    _i5:  dum dum dum dum dum
    _i6:  dum dum dum dum
    _i7:  dum dum dum 
    _i8:  dum dum 
    _i9:  dum 
    _i10:  ;

 if nc>1 then do combo_size=2 to NC;
   all_i{combo_size,1}=0;
   ncomb=comb(nc, combo_size);
   do j=1 to ncomb;
      select (combo_size);
        when (2)  rc=lexcombi(nc, combo_size, of _i2[*]);
        when (3)  rc=lexcombi(nc, combo_size, of _i3[*]);
        when (4)  rc=lexcombi(nc, combo_size, of _i4[*]);
        when (5)  rc=lexcombi(nc, combo_size, of _i5[*]);
        when (6)  rc=lexcombi(nc, combo_size, of _i6[*]);
        when (7)  rc=lexcombi(nc, combo_size, of _i7[*]);
        when (8)  rc=lexcombi(nc, combo_size, of _i8[*]);
        when (9)  rc=lexcombi(nc, combo_size, of _i9[*]);
        when (10) rc=lexcombi(nc, combo_size, of _i10[*]);
      end;
      do h=1 to combo_size;
        crse[h]=_crse[all_i{combo_size,h}];
      end;
      crse_combo=put(combo_size,2.) || ' ' || cat(of crse{*});
      output;
   end;
 end;  
run;

proc tabulate data=need noseps;
  class gender crse_combo;
  tables crse_combo=' ',gender=' '*N=' '*f=4. 
    / rts=100 box='Size / Course List' misstext='0';
run;

Notes:

 

 

  1. The proc sort is done, so that all combinations are in alphabetical order  (i.e. there are math/physics, but no physics/math).
  2. I declare index arrays for each expected combination size (I used a maximum of size 10).
  3. Each array (_i2, _i3, etc.) is used with call lexcombi for its own size.  You can't use "of _I{*}" with call lexcombi for a size other than its own.   That's why you see the "select (combo_size)" group of statements.
  4. By declaring a 2-dimensional array ALL_i  (one row per expected combo_size), constructed of the individual _i arrays and repeated use of the DUM variable, I can use compact statements such as   "crse[h]=_crse[all_i{combo_size,h}];" instead of another set of select statements.  Array all_i has an upper triangle of repeated use of the dum vars.
Regular Contributor
Posts: 243

Re: Produce and count all possible combinations of the values in a column

@mkeintzHere is the smart solution you provided, with slight macro usage twist to enable data driven code generation for the array processing.

data student;
    input Student_ID$ Course$ Gender$;
    datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
run;

/*Extract the count of unique courses and store it in a macro variable*/ proc sql noprint; select count(distinct (course)) into :g_numUnqCourse from work.student; quit; proc sort data=student; by student_id course; run; /* Declare a macro to auto generate SAS statements */ %macro InitDynArray(p_max); %local l_n; %do l_n=2 %to &p_max; %str(ARRAY _i&l_n[&l_n];) %end; ARRAY all_i{2:&p_max,&p_max} %do l_n=2 %to &p_max; _i&l_n: %do j=1 %to %eval(&p_max - &l_n); %str(dum ) %end; %end; %str(;) %mend; /* Declare a macro to auto generate SAS statements */ %macro dynArrayCombo(p_max); %local l_n; %do l_n=2 %to &p_max; WHEN (&l_n) rc=LEXCOMBI(nc, combo_size, of _i&l_n[*]); %end; %mend;
OPTIONS MPRINT; /*Display generated statements by the macro calls*/
data need (keep=student_id gender crse: combo_size crse_combo); array _crse{&g_numUnqCourse} $8 _temporary_; format crse_combo $char100.; retain NC 0; set student; by student_id; if first.student_id then do; call missing(of _crse{*}); NC=0; end; nc+1; _crse{nc}=course; if ((last.student_id) and (nc > 1)); ARRAY crse{&g_numUnqCourse} $9; %InitDynArray(&g_numUnqCourse); /* Call a macro */ if (nc > 1) then do combo_size=2 to nc; all_i{combo_size,1}=0; ncomb=comb(nc, combo_size); do j=1 to ncomb; select (combo_size); %dynArrayCombo(&g_numUnqCourse) /* Call a macro */ otherwise; end; /* End of Select */ do h=1 to combo_size; crse[h]=_crse[all_i{combo_size,h}]; end; crse_combo=put(combo_size,2.) || ': ' || catx(',',of crse{*}); output; end; /* End of j=1 to ncomb */ end; /* End of combo_size=2 to NC */ run; proc tabulate data=need noseps; class gender crse_combo; tables crse_combo=' ',gender=' '*N=' '*f=4. / rts=100 box='Size / Course List' misstext='0'; run;

Regards,

Ahmed

Trusted Advisor
Posts: 1,115

Re: Produce and count all possible combinations of the values in a column

Posted in reply to AhmedAl_Attar

@AhmedAl_Attar

 

I appreciate your macro-izing of the code I provided.

 

But to be honest, if the dataset is large, I would run a PROC SQL against the entire data set just to generate the maximum number of possible courses.  I would just parameterize the macro it with a number that would surely exceed all possible course counts.

 

After all once you've macro-ized, you aren't subject to typing errors when coding for each combination size, so the penalty for a too-large number becomes negligible

 

Of course, if I had been true to efficiency-uber-alles, I would have avoided the proc sort, since the data were already sorted by student_id, but not course-within-student_id.  Instead, after reading in the array of courses for a given student, I would have used CALL SORTC of the array, followed by left-shifting of the results to avoid blank course values at the lower bound of the array.

New Contributor
Posts: 4

Re: Produce and count all possible combinations of the values in a column

Posted in reply to AhmedAl_Attar

Thanks a lot Ahmed, your solution is great. I have one question for you, your code is working well when I test it in sample data set which has about 10 courses, but it gives me an error when I test in another data set which has about 100 courses in it. 

Could you please give me a hint to make it work for large data set?

 

 

Regular Contributor
Posts: 243

Re: Produce and count all possible combinations of the values in a column

Posted in reply to vicky1229
Hi Vicky,
Without knowing the exact nor the nature of the error you are getting, I'm guessing it would have to do with
- Character length ----> Change this line: format crse_combo $char100.;
- Memory allocated to your SAS session ---> use -MEMSIZE 4GB on your SAS invocation command line. Increase the allocated GB as needed.

With 100 courses --> ((2**100) -1) permutations!!
Which is really a very big number: 1.26765060023E30 !!

Hope this helps,
Ahmed
Super User
Posts: 10,194

Re: Produce and count all possible combinations of the values in a column

[ Edited ]
Posted in reply to vicky1229
data student;
  input Student_ID$ Course$ Gender$;
  retain value 1;
datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;

proc transpose data=student out=temp ;
by Student_ID gender;
var value;
id course;
run;

proc sql noprint;
select distinct Course into : list separated by ' '
 from student;
select count(distinct course) into : n
 from student;
quit;

data temp1;
 set temp;
 array x{*} &list ;
 array y{&n};
 length combination $ 2000;
 k=-1;
 do i=1 to 2**&n ;
  rc=graycode(k,of y{*});
  yes=1;
  do j=1 to &n;
   if y{j}=1 and x{j}=. then yes=0;
  end;
  call missing(combination);
  if yes then do;
    do j=1 to &n;
      if y{j}=1 then combination=catx(',',combination,vname(x{j})) ;
    end;
	if not missing(combination) then output;
  end;
 end;
 keep gender combination;
run;
proc freq data=temp1 noprint;
table  gender*combination/out=temp2 list;
run;
proc sort data=temp2;by combination;run;
proc transpose data=temp2 out=want prefix=count_;
by combination;
var count;
id gender;
run;

 

P.S. modified. It seems it is not so easy.

Regular Contributor
Posts: 243

Re: Produce and count all possible combinations of the values in a column

Posted in reply to vicky1229

Hi,

Here is yet another solution, While I have used some code from previous contributors, I've used Custom written function via Proc FCMP to produce and return all possible course combinations.

 

The  getCombos(values [*] $, sep $) $; user function, accepts an array and a separator character, and returns a separated String of course combinations. The logic implemented inside that function was inspired from an old SUGI 29 paper .

 


proc fcmp
    outlib=sasuser.functions.customFunctions;

    function getCombos(values [*] $, sep $) $;
        length combos $100 l_sep $1;
        length i vCount binaryDigit 4;
        length max_Type_  start i_tmp 8;
        length final_combo $1000;

        l_sep=ifc (sep='','I',sep) ;

        *put values[*];
        vCount=0;
        do i=1 to dim(values);
            vCount+ifc(values[i]='',0,1);
        end;

        max_Type_ = 2**vCount - 1;
        *put max_Type_;

        /*  Loop through till max_Type_ is reached*/
        do start=0 to max_Type_;
            i_tmp = start;

            /* Loop through all passed-in values */
            do j=vCount to 1 by -1;
                binaryDigit= int(i_tmp/(2**(j-1)) );
                if (binaryDigit = 1) then
                do;
                    combos = catx(',',combos,values[(vCount-j+1)]);

                    * Decrement i_tmp if binaryDigit is in types binary representation;
                    i_tmp = i_tmp - 2**(j-1);
                end;
            end; /*End of  j=vCount to 1 by -1*/

            /* Only maintain combinations of more than 1 value */
            if (index(combos,',') GT 0) then
            do;
                *put combos;
                final_combo = catx(l_sep,final_combo,combos);
                *put final_combo;
            end;
            combos='';
        end; /* End of start=0 to max_Type_*/

        return (final_combo);
    endsub;
run;

options cmplib=(sasuser.functions);

data student;
    input Student_ID$ Course$ Gender$;
    datalines;
1001 Math F
1001 Computer F
1001 Physics F
1002 Math M
1002 Physics M
1003 Math M
1003 Chemstry M
1003 Computer M
1003 Physics M
1004 Chemstry M
1004 Computer M
1004 Physics M
;
run;
proc sort data=student;
    by student_id course;
run;

data want(keep=student_id gender crse_combo);
    array _crse{10} $9 _temporary_;
    retain nc 0;
    set student;
    by student_id;
    length courses $1000;

    if first.student_id then
    do;
        call missing(of _crse{*});
        nc=0;
    end;

    nc+1;
    _crse{nc}=course;

    if ((last.student_id) and (nc > 1));
    courses = getCombos(_crse, "~");
    *put courses=;
    comboCount=countc(courses,'~')+1;
    do c=1 to comboCount;
        crse_combo=scan(courses,c,'~');
        combo_size=countc(crse_combo,',')+1;
        crse_combo= put(combo_size,2.) || ': ' || crse_combo;
        output;
    end;
run;

proc tabulate data=want noseps;
    class gender crse_combo;
    tables crse_combo=' ',gender=' '*N=' '*f=4. 
        / rts=100 box='Size / Course List' misstext='0';
run;

Hope this helps,

Ahmed

Ask a Question
Discussion stats
  • 13 replies
  • 317 views
  • 1 like
  • 8 in conversation