Hi folks,
I've got a large dataset of character values for which I want to create a category variable. What I decided would be the easiest approach would be to create arrays for each valid category and then create the variable based on if the value is found in a given array. A simplified example is shown below:
data categorize;
set originaldata;
category = "none";
array categoryone (50) $ ("a", "b", "c", "d",...);
array categorytwo (200) $ ("aaa", "aab", "aac",...);
if readvariable in categoryone then category = "1";
else if readvariable in categorytwo then category = "2";
run;
What I'm noticing when checking proc freq is that my category variable is "None" for the whole dataset. In trying to figure out why I decided to check my created arrays.
Upon checking the arrays I saw that the values were assigned incorrectly such as categoryone1 given a value of "s" instead of the expected value of "a" and that categoryone2-categoryone6 was also given a value of "s". I've tried looking around for what I'm doing wrong in my array assignment and have tried things such as removing the ","s, but to no avail. I realize that my issue is likely multifaceted, but I can't find documentation that states where or how. Help me SAS Community boards. You're my only hope.
Would you be able to provide a representative sample of your data and your required output?
Sure thing. The actual data looks close to this:
data sample1;
set sample;
array secgen (192) $ ("50580073001" "50580073005" "50580073006" "50580072105" "50580078201" "50580078212" "50580078224" ...)
array firgen (1318) $ firgen1-firgen1318 ("50580037001", "76413033224", "66715970602", "66715970603", "61786086636", "50580022650" ...)
array cromo (20) $ cromo1-cromo20 ("17478029111", "59779007513", "70556010260", "50090312500", "69784020096" ...)
if ndcnum in secgen then drug = "secgen";
else if ndcnum in firgen then drug = "firgen";
else if ndcnum in cromo then drug = "cromo";
run;
As I have it now I tried removing commas and not specifically naming array variables in the first array, but am still seeing the same issues as prior. The ndcnum variable is also a character variable of length 11 formatted to match the other values listed in the created arrays. There are several more categories as well, but they all have the same initialization and utilization as demonstrated above. Seriously scratching my head as to why I'm seeing the issues I'm seeing.
You are giving the code rather than the data. A sample(mock) data of what you have and what you want in your output would help to copy paste to our SAS environment and test your logic(code)
what does this set sample contain?
ndcnum var should be in set sample right?
Ah! Sorry. I have a tab deliminted text file attached. There are some other data, but this is all I'm actively using in my process. The array data is extracted from another file that has been verified as correct and the ndcnum data is what I'd like to check.
A couple of small pieces just to get them out of the way:
Since you need a set of constants, not a set of variables, make the array elements temporary:
array secgen {192} $ _temporary_ (..................................);
And confirm that NDCNUM is actually character. If it's numeric, the quotes around the array elements should be removed.
Your sample code works. You may not have all of the array values set in you arrays.
data originaldata;
input readvariable $;
cards;
a
b
c
d
f
g
aa1
aaa
aab
aac
aad
;
data categorize;
set originaldata;
category = "none";
array categoryone (50) $ ("a", "b", "c", "d");
array categorytwo (200) $ ("aaa", "aab", "aac");
if readvariable in categoryone then category = "1";
else if readvariable in categorytwo then category = "2";
run;
you have a lot of ….. that were in the way which I removed.
If you insist on arrays perhaps something like:
data example; input readvariable $; array categoryone (4) $ 1 _temporary_ ("a", "b", "c", "d"); array categorytwo (3) $ 3 _temporary_ ("aaa", "aab", "aac"); length category $ 4; if whichc(readvariable,of categoryone(*)) > 0 then category='1'; else if whichc(readvariable,of categorytwo(*)) > 0 then category='2'; else category='None'; datalines; a b q aaa aac bbb ; run;
Though I would be more likely to do something like
proc format library=work; value $cat "a", "b", "c", "d"= '1' "aaa", "aab", "aac"='2' other='None' ; run; data example2; input readvariable $; category = put(readvariable,$cat.); datalines; a b q aaa aac bbb ; run;
Especially if your values you are currently placing in the arrays are available in a data set as formats can be built from datasets.
Or even instead of assignment of a new variable just use the format as needed.
proc freq data=example2; tables readvariable; format readvariable $cat.; run;
Categories assigned by formats are honored by almost all of the SAS analysis procedures.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.