BookmarkSubscribeRSS Feed
jdchang
Fluorite | Level 6

Hi folks,

 

I've got a large dataset of character values for which I want to create a category variable. What I decided would be the easiest approach would be to create arrays for each valid category and then create the variable based on if the value is found in a given array. A simplified example is shown below:

 

 

data categorize;
   set originaldata;
   category = "none";

   array categoryone (50) $ ("a", "b", "c", "d",...);
   array categorytwo (200) $ ("aaa", "aab", "aac",...);

   if readvariable in categoryone then category = "1";
      else if readvariable in categorytwo then category = "2";
run;

 

 

What I'm noticing when checking proc freq is that my category variable is "None" for the whole dataset. In trying to figure out why I decided to check my created arrays.

 

Upon checking the arrays I saw that the values were assigned incorrectly such as categoryone1 given a value of "s" instead of the expected value of "a" and that categoryone2-categoryone6 was also given a value of "s". I've tried looking around for what I'm doing wrong in my array assignment and have tried things such as removing the ","s, but to no avail. I realize that my issue is likely multifaceted, but I can't find documentation that states where or how. Help me SAS Community boards. You're my only hope.

10 REPLIES 10
novinosrin
Tourmaline | Level 20

Would you be able to provide a representative sample of your data and your required output?

jdchang
Fluorite | Level 6

Sure thing. The actual data looks close to this:

 

data sample1; 
	set sample;

	array secgen (192) $ 	("50580073001" "50580073005" "50580073006" "50580072105" "50580078201" "50580078212" "50580078224" ...)
	array firgen (1318) $ firgen1-firgen1318 ("50580037001", "76413033224", "66715970602", "66715970603", "61786086636",	"50580022650" ...)
        array cromo (20) $ cromo1-cromo20 ("17478029111", "59779007513", "70556010260", "50090312500", "69784020096" ...)

	if ndcnum in secgen then drug = "secgen";
		else if ndcnum in firgen then drug = "firgen";
		else if ndcnum in cromo then drug = "cromo";

run;

 

As I have it now I tried removing commas and not specifically naming array variables in the first array, but am still seeing the same issues as prior. The ndcnum variable is also a character variable of length 11 formatted to match the other values listed in the created arrays. There are several more categories as well, but they all have the same initialization and utilization as demonstrated above. Seriously scratching my head as to why I'm seeing the issues I'm seeing.

novinosrin
Tourmaline | Level 20

You are giving the code rather than the data. A sample(mock) data of what you have and what you want in your output would help to copy paste to our SAS environment and test your logic(code)

 

what does this set sample contain?

ndcnum var should be in set sample right?

jdchang
Fluorite | Level 6

Ah! Sorry. I have a tab deliminted text file attached. There are some other data, but this is all I'm actively using in my process. The array data is extracted from another file that has been verified as correct and the ndcnum data is what I'd like to check.

Reeza
Super User
What's the type, length and format of the ndcnum variable?
Reeza
Super User
It would help if you could generate a full sample and clearly show the problematic output. I can't quite seem to understand what the issue is yet.
Astounding
PROC Star

A couple of small pieces just to get them out of the way:

 

Since you need a set of constants, not a set of variables, make the array elements temporary:

 

array secgen {192} $ _temporary_ (..................................);

 

And confirm that NDCNUM is actually character.  If it's numeric, the quotes around the array elements should be removed.

VDD
Ammonite | Level 13 VDD
Ammonite | Level 13

Your sample code works.  You may not have all of the array values set in you arrays.

 

data originaldata;
input readvariable $;
cards;
a
b
c
d 
f
g
aa1
aaa
aab
aac
aad
;


data categorize;
   set originaldata;
   category = "none";

   array categoryone (50) $ ("a", "b", "c", "d");
   array categorytwo (200) $ ("aaa", "aab", "aac");

   if readvariable in categoryone then category = "1";
   else if readvariable in categorytwo then category = "2";
run;

you have a lot of ….. that were in the way which I removed.  

Reeza
Super User
Which likely means it's either a source data issue or something we're not seeing (previous code) that's causing the issues.

FYI - you can speed this up by using temporary arrays and loading them from a data set if you have it or using a format which would be significantly faster.
ballardw
Super User

If you insist on arrays perhaps something like:

data example;
   input readvariable $;
   array categoryone (4) $ 1 _temporary_ ("a", "b", "c", "d");
   array categorytwo (3) $ 3 _temporary_  ("aaa", "aab", "aac");
   length category $ 4;
   if whichc(readvariable,of categoryone(*)) > 0 then category='1';
   else if whichc(readvariable,of categorytwo(*)) > 0 then category='2';
   else category='None';

datalines;
a
b
q
aaa
aac
bbb
;
run;

Though I would be more likely to do something like

 

proc format library=work;
value $cat
"a", "b", "c", "d"= '1'
"aaa", "aab", "aac"='2'
other='None'
;
run;
data example2;
   input readvariable $;
   category = put(readvariable,$cat.);
datalines;
a
b
q
aaa
aac
bbb
;
run;

Especially if your values you are currently placing in the arrays are available in a data set as formats can be built from datasets.

 

Or even instead of assignment of a new variable just use the format as needed.

proc freq data=example2;
   tables readvariable;
   format readvariable $cat.;
run;

Categories assigned by formats are honored by  almost all of the SAS analysis procedures.

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1023 views
  • 8 likes
  • 6 in conversation