I have a workable code that imputes the field name into a single var per row of data if the field has an indicator equal to 1. There are 190 values for the X_IND field and 190 associated MODEL vars. Not sure where to start here, but can this be automated/"macroized" - at least the MODEL vars? The Model vars are then concatenated and counted to get the number of all unique combinations.
In essence, what the code does is below - basic imputation if an indicator is present - this is the analytic dataset that will be used to tally combinations.
ID A_IND B_IND C_IND D_IND AA Model1 Model2 Model3 Model4 Combo
X 1 0 1 0 1 A C AC
Y 0 1 1 0 1 B C BC
Z 0 0 1 1 1 C D DC
Macros do not seem to be a solution here. Maybe ARRAYs would work better. But why do this? If the goal is to "this is the analytic dataset that will be used to tally combinations", I think PROC SUMMARY does this for you without even creating these specially derived columns, and saves you some pretty complicated programming. But explain that "tally combinations" part further, please.
So, this is a request to produce a more automated approach as opposed to potentially copying and pasting the indicators into the code, as the names of the indicators will change each quarter (they are masked in the code, aliases). The current code works just fine. The tally portion is the third step in the code, it simply produces a count for the unique combinations.
@jdserbon1 wrote:
The tally portion is the third step in the code, it simply produces a count for the unique combinations.
Either PROC FREQ or PROC SUMMARY ought to do this for you without creating these combinations of variables, all you need is A_IND through D_IND.
@jdserbon1 wrote:
So, this is a request to produce a more automated approach as opposed to potentially copying and pasting the indicators into the code, as the names of the indicators will change each quarter (they are masked in the code, aliases). The current code works just fine. The tally portion is the third step in the code, it simply produces a count for the unique combinations.
If the "names of the indicators will change each quarter " is because the date is part of the variable name then this is an indication of poor data structuring, perhaps another attempt to force spreadsheet thinking to a different approach to working with data. If this is the case you are better off with a VARIABLE that indicates the time period and keeping all of the "model" variables the same. Then when doing the model you filter the data on the desired period by selecting with a WHERE clause or even using BY group processing to get results for each "quarter".
Hi @jdserbon1
Here is an approach to automatize steps1 and 2.
The idea is to leverage the use of VNAME(), which reference the variable name in an array:
%let variables = A_IND B_IND C_IND D_IND; /* To be adapted */
%let count_variables = %sysfunc(countw(&variables.));
data test1;
set test;
array _ind (*) &variables;
array MODEL(&count_variables) $;
do i=1 to dim(_ind);
if AA=1 and _ind(i)=1 then MODEL(i) = substr(vname(_ind(i)),1,1);
end;
COMBO=CATX(',', OF MODEL:);
MOD_NUM=COUNTW(COMBO,',');
drop i &variables.;
run;
PROC SQL;
CREATE TABLE Test3 AS SELECT
COMBO,
MOD_NUM,
COUNT(*) AS AA_COUNT
FROM Test1
WHERE (COMBO NE ' ' AND MOD_NUM GE 2)
GROUP BY 1, 2
ORDER BY 3 DESC;
QUIT;
Once again, a case for Maxim 33: Intelligent Data Makes for Intelligent Programs, and Maxim 12: Long Beats Wide.
data have;
input ID $ A_IND B_IND C_IND D_IND;
datalines;
X 1 0 1 0
Y 0 1 1 0
Z 0 0 1 1
;
proc transpose data=have out=long;
by id;
var _numeric_;
run;
data want;
set long (where=(col1 = 1));
by id;
length combo $10;
retain combo;
if first.id then combo = "";
combo = cats(combo,ifc(indexc(substr(_name_,1,1),combo),'',substr(_name_,1,1)));
if last.id;
keep id combo;
run;
The final data step makes only two assumptions: that the first character of the original variable name is the indicator, and there's maximally 10 observations. No literal names need to be written, no arrays, no macro.
data have;
input ID $ A_IND B_IND C_IND D_IND;
datalines;
X 1 0 1 0
Y 0 1 1 0
Z 0 0 1 1
;
data want(drop=i);
set have;
array in {*} A_IND--D_IND;
array Model {4} $;
do i = 1 to dim(in);
if in[i] then Model[i] = char(vname(in[i]), 1);
end;
Combo = cats(of Model[*]);
run;
Missed the "190 columns" part. (Insert another comment about proper data modeling here)
What if there's columns A_IND, B_IND and AB_IND?
We surely need more information about the real column names to come up with viable suggestions.
Use VNAME to obtain the variable corresponding to an array reference. If the arrayed variables have a consistent naming convention *_ID use SCAN to obtain the value to place in the corresponding model array element.
Create sample data (100k rows, 191 columns)
* create 190 identifiers (like excel sheet column names);
data ids(keep=id); length id $2; do code1 = 65 to 90; id = byte(code1); k+1; output; end; do code1 = 65 to 90; do code2 = 65 to 90; id = byte(code1)||byte(code2); k+1; output; if k = 190 then stop; end; end; run; * create variable names for the ids;
data names(keep=name); set ids; length name $32; name = cats(id,'_ID'); run; * pivot to wide;
proc transpose data=names out=havebase(drop=_name_); id name; run; * fill the wide structure with some data; * upto 7 random ids are 1;
data have(keep=id a_id--gh_id); length id 8; if 0 then set have havebase; call streaminit(2020); do id = 1 to 100000; array ids a_id--gh_id; call missing (of ids(*)); do candidate = 1 to rand('integer',1,7); ids(rand('integer',1,dim(ids))) = 1; end; output; end; stop; run;
Sample code:
* compute csv string that lists 1 flagged ids as combo value;
data stage1; set have; array ids a_id--gh_id; length model1-model190 $10; array models model1-model190; do over ids; name = vname(ids); drop name; if length(name)-3 > lengthc(models) then do; put 'ERROR: Model needs more length ' name=; abort cancel; end; if ids then models=scan(vname(ids),1,'_'); end; combo = catx(',', of models(*)); run;
* frequency table for combos;
proc sql; create table combo_freq as select combo, count(*) as freq from stage1 group by combo order by freq desc ; quit;
I would recommend transposing your data to a long format. Processing it as desired and then transposing back to your wide format for reporting if needed. In general wide data sets are clunky and a bad design for this exact reason. It's infinitely easier to add rows to a data set than columns.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.