SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Categorizing huge amount of variables

Reply
Occasional Contributor
Posts: 8

Categorizing huge amount of variables

Hi I have a dataset consiting of large amount of variables (50+)

Some of those variables are text and some of them are numeric. I want to categorize every of those variables into 2 or 3 categories. So lets say the second column of my dataset which is age would be categorized into young (age<25) older (25<age<40) oldest (age>40).

Another variable which is gender would be categorized into male and female. And the third is number of cars (0=none, 1 car= 1, 2+ cars= multiple) and so on.

So the thing is every variable woud require writing different categorization criteria which will result in a quite long code if done manually.

I was thinking of creating a table where I would put a criteria how every variable should be split and then create conditions based on that. But I am not sure if it is the simplest approach and maybe you can give me some insight how to deal with this problem.

BTW im on SAS 9.4.

PROC Star
Posts: 1,192

Re: Categorizing huge amount of variables

[ Edited ]

Take a look at formats. They can do a very nice job of recoding data.

 

Here's an example for Age, but you could use the same for all of the other variables.

 

Tom

 

data have;
	input Age;
	cards;
1
15
24
25
39
40
41
89
;
run;

proc format;
	value Agef
		low-24 = "Young"
		25-39 = "Older"
		40-high = "Oldest"
	;
run;

data want;
	set have;
	length AgeReformat $20;
	AgeReformat = put(Age, Agef.);
run;
Super User
Posts: 21,546

Re: Categorizing huge amount of variables

Tom, I edited your post only to see if the text highlighting could be corrected. Apparently not Smiley Sad
Occasional Contributor
Posts: 8

Re: Categorizing huge amount of variables

And lets say I have a table where I keep conditions using which i have to cateogrize?

Like I have table in this format:

Category Condition

1               not missing(variable_name) and variable_name<=20

2               20<variable_name <=40

3               40<variable_name

 

Then based on the column condition I would categorize those values.

Highlighted
PROC Star
Posts: 7,800

Re: Categorizing huge amount of variables

I agree with @TomKari but, depending upon what you want to do, you likely don't have to create new variables .. just create and apply the formats. e.g., using Tom's example:

 

data have;
	input Age;
	cards;
1
15
24
25
39
40
41
89
run;

proc format;
	value Agef
		low-24 = "Young"
		25-39 = "Older"
		40-high = "Oldest"
	;
run;

data want;
	set have;
	format Age Agef.;
run;

proc freq data=want;
  tables age;
run;

Art, CEO, AnalystFinder.com

 

Occasional Contributor
Posts: 8

Re: Categorizing huge amount of variables

Is it possible to parse conditions from another table and then use them to create format? As I have this kind of table created before and Im required to use it:

Category Condition
1 not missing(variable_name) and variable_name<=20
2 20<variable_name <=40
3 40<variable_name

Then based on the column condition I would categorize those values.
PROC Star
Posts: 7,800

Re: Categorizing huge amount of variables

Yes, you can create formats from a table. See, e.g., www2.sas.com/proceedings/forum2007/068-2007.pdf

 

and then you can apply the same format(s) to as many variables as needed.

 

Art, CEO, AnalystFinder.com

 

PROC Star
Posts: 1,935

Re: Categorizing huge amount of variables

You'll have to define the character categories yourself. Formats are a good option to do this as explained.

For numerics, this may be a good option to automate the intervals if you only want counts:


proc summary data=SASHELP.CLASS ; 
  var _numeric_  ;
  output out=SUM min= p25= p50= p75= max= /autoname;
run;

 

 

Ask a Question
Discussion stats
  • 7 replies
  • 312 views
  • 2 likes
  • 5 in conversation