Re: Combinations Macro

jdserbon1 · Posted 06-09-2020 06:34 AM

I have a workable code that imputes the field name into a single var per row of data if the field has an indicator equal to 1. There are 190 values for the X_IND field and 190 associated MODEL vars. Not sure where to start here, but can this be automated/"macroized" - at least the MODEL vars? The Model vars are then concatenated and counted to get the number of all unique combinations.

In essence, what the code does is below - basic imputation if an indicator is present - this is the analytic dataset that will be used to tally combinations.

ID A_IND B_IND C_IND D_IND AA Model1 Model2 Model3 Model4 Combo

X 1 0 1 0 1 A C AC

Y 0 1 1 0 1 B C BC

Z 0 0 1 1 1 C D DC

PaigeMiller · Posted 06-09-2020 06:44 AM

Macros do not seem to be a solution here. Maybe ARRAYs would work better. But why do this? If the goal is to "this is the analytic dataset that will be used to tally combinations", I think PROC SUMMARY does this for you without even creating these specially derived columns, and saves you some pretty complicated programming. But explain that "tally combinations" part further, please.

--
Paige Miller

jdserbon1 · Posted 06-09-2020 06:51 AM

So, this is a request to produce a more automated approach as opposed to potentially copying and pasting the indicators into the code, as the names of the indicators will change each quarter (they are masked in the code, aliases). The current code works just fine. The tally portion is the third step in the code, it simply produces a count for the unique combinations.

PaigeMiller · Posted 06-09-2020 07:02 AM

@jdserbon1 wrote:

The tally portion is the third step in the code, it simply produces a count for the unique combinations.

Either PROC FREQ or PROC SUMMARY ought to do this for you without creating these combinations of variables, all you need is A_IND through D_IND.

--
Paige Miller

ballardw · Posted 06-09-2020 10:06 AM

@jdserbon1 wrote:

So, this is a request to produce a more automated approach as opposed to potentially copying and pasting the indicators into the code, as the names of the indicators will change each quarter (they are masked in the code, aliases). The current code works just fine. The tally portion is the third step in the code, it simply produces a count for the unique combinations.

If the "names of the indicators will change each quarter " is because the date is part of the variable name then this is an indication of poor data structuring, perhaps another attempt to force spreadsheet thinking to a different approach to working with data. If this is the case you are better off with a VARIABLE that indicates the time period and keeping all of the "model" variables the same. Then when doing the model you filter the data on the desired period by selecting with a WHERE clause or even using BY group processing to get results for each "quarter".

ed_sas_member · Posted 06-09-2020 06:47 AM

Hi @jdserbon1

Here is an approach to automatize steps1 and 2.

The idea is to leverage the use of VNAME(), which reference the variable name in an array:

%let variables = A_IND B_IND C_IND D_IND; /* To be adapted */
%let count_variables = %sysfunc(countw(&variables.));

data test1;
	set test;
	array _ind (*) &variables;
	array MODEL(&count_variables) $;
	do i=1 to dim(_ind);
		if AA=1 and _ind(i)=1 then MODEL(i) = substr(vname(_ind(i)),1,1);
	end;
	COMBO=CATX(',', OF MODEL:);
	MOD_NUM=COUNTW(COMBO,',');
	drop i &variables.;
run;

PROC SQL;
	CREATE TABLE Test3 AS SELECT
	COMBO,
	MOD_NUM,
	COUNT(*) AS AA_COUNT
	FROM Test1
	WHERE (COMBO NE ' ' AND MOD_NUM GE 2)
	GROUP BY 1, 2
	ORDER BY 3 DESC;
QUIT;

Kurt_Bremser · Posted 06-09-2020 07:10 AM

Once again, a case for Maxim 33: Intelligent Data Makes for Intelligent Programs, and Maxim 12: Long Beats Wide.

data have;
input ID $ A_IND B_IND C_IND D_IND;
datalines;
X 1 0 1 0   
Y 0 1 1 0   
Z 0 0 1 1 
;

proc transpose data=have out=long;
by id;
var _numeric_;
run;

data want;
set long (where=(col1 = 1));
by id;
length combo $10;
retain combo;
if first.id then combo = "";
combo = cats(combo,ifc(indexc(substr(_name_,1,1),combo),'',substr(_name_,1,1)));
if last.id;
keep id combo;
run;

The final data step makes only two assumptions: that the first character of the original variable name is the indicator, and there's maximally 10 observations. No literal names need to be written, no arrays, no macro.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 06-09-2020 07:33 AM

data have;
input ID $ A_IND B_IND C_IND D_IND;
datalines;
X 1 0 1 0
Y 0 1 1 0
Z 0 0 1 1
;

data want(drop=i);
    set have;
    array in {*} A_IND--D_IND;
    array Model {4} $; 
    do i = 1 to dim(in);
        if in[i] then Model[i] = char(vname(in[i]), 1);
    end;
    Combo = cats(of Model[*]);
run;

The DATA to DATA Step Macro
Blog: SASnrd

yabwon · Posted 06-09-2020 08:47 AM

If I may, since there are 190 columns `xx_IND` it seems to be more than alphabet letters, maybe a `scan(vname(in[i]), 1, "_");` would work better?
Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Kurt_Bremser · Posted 06-09-2020 08:57 AM

Missed the "190 columns" part. (Insert another comment about proper data modeling here)

What if there's columns A_IND, B_IND and AB_IND?

We surely need more information about the real column names to come up with viable suggestions.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RichardDeVen · Posted 06-09-2020 09:18 AM

How many rows does the data have ?

RichardDeVen · Posted 06-09-2020 11:23 AM

Use VNAME to obtain the variable corresponding to an array reference. If the arrayed variables have a consistent naming convention *_ID use SCAN to obtain the value to place in the corresponding model array element.

Create sample data (100k rows, 191 columns)

Spoiler

* create 190 identifiers (like excel sheet column names);

data ids(keep=id);
  length id $2;
  do code1 = 65 to 90;  
    id = byte(code1); k+1;
    output;
  end;

  do code1 = 65 to 90;  
  do code2 = 65 to 90;
    id = byte(code1)||byte(code2); k+1;
    output;
    if k = 190 then stop;
  end;
  end;
run;

* create variable names for the ids;

data names(keep=name);
  set ids;
  length name $32;
  name = cats(id,'_ID');
run;

* pivot to wide;

proc transpose data=names out=havebase(drop=_name_);
  id name;
run;

* fill the wide structure with some data;
* upto 7 random ids are 1;

data have(keep=id a_id--gh_id);
  length id 8;
  if 0 then set have havebase;

  call streaminit(2020);
  do id = 1 to 100000;
    array ids a_id--gh_id;
    call missing (of ids(*));
    do candidate = 1 to rand('integer',1,7);
      ids(rand('integer',1,dim(ids))) = 1;
    end;
    output;
  end;

  stop;
run;

* create 190 identifiers (like excel sheet column names); data ids(keep=id); length id $2; do code1 = 65 to 90; id = byte(code1); k+1; output; end; do code1 = 65 to 90; do code2 = 65 to 90; id = byte(code1)||byte(code2); k+1; output; if k = 190 then stop; end; end; run; * create variable names for the ids; data names(keep=name); set ids; length name $32; name = cats(id,'_ID'); run; * pivot to wide; proc transpose data=names out=havebase(drop=_name_); id name; run; * fill the wide structure with some data; * upto 7 random ids are 1; data have(keep=id a_id--gh_id); length id 8; if 0 then set have havebase; call streaminit(2020); do id = 1 to 100000; array ids a_id--gh_id; call missing (of ids(*)); do candidate = 1 to rand('integer',1,7); ids(rand('integer',1,dim(ids))) = 1; end; output; end; stop; run;

Sample code:

* compute csv string that lists 1 flagged ids as combo value;

data stage1;
  set have;
  array ids a_id--gh_id;

  length model1-model190 $10;
  array models model1-model190;

  do over ids;
    name = vname(ids); drop name;
    if length(name)-3 > lengthc(models) then do;
      put 'ERROR: Model needs more length ' name=;
      abort cancel;
    end;
    if ids then models=scan(vname(ids),1,'_');
  end;

  combo = catx(',', of models(*));
run;

* frequency table for combos;

proc sql;
  create table combo_freq as 
  select combo, count(*) as freq
  from stage1
  group by combo
  order by freq desc
  ;
quit;

Reeza · Posted 06-09-2020 11:30 AM

I would recommend transposing your data to a long format. Processing it as desired and then transposing back to your wide format for reporting if needed. In general wide data sets are clunky and a bad design for this exact reason. It's infinitely easier to add rows to a data set than columns.

Register Today!

SAS Training: Just a Click Away