BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Banke
Pyrite | Level 9

Hello all,

 

Please, I have a list of variables (about 11 of them) in a dataset, and i want to categorize the observations in each of the variables by creating another 11 variables with the categories.

 

I have used the proc format procedure to create the categories. If I dont want to assign the formats one at a time to each of the categories, does anyone know how i can go about it please, perhaps using macros or array 

I have shared my proc format codes and a macro i tried creating, but doesnt seem to answer my question and not correct. The table below explains what I want. Thank you very much

var 1 var 2 var 3 var11 var1_cat var2_cat var3_cat var11_cat
1 1 1 1 male single Hispanic northeast
2 2 2 2 female married nonHispanic southwest
1 3 1 3 male divorced Hispanic midwest
2 4 2 4 female widowed nonHispanic north

 

/***my formats**/
proc format library = meps_p;
value $sex_cat 1 = "Male" 2 = "Female" ;
value $race_cat 1 = 'White' 2 = 'Black' 3 = 'American Indian/Alaska Native' 4 = 'Asian/Native Hawaiin/Pacific Islander' 6 = 'Mixed Race';
value $eth_cat 1 = 'Hispanic' 2 = 'non-Hispanic';
value $marr_cat 1 = 'Married' 2 = 'Widowed' 3 = 'Divorced/Seperated' 5 = 'Single' 6 = 'Under 16';
value $edu_cat '1' , '2', '9' = 'No High School degree'  '13', '14' = 'High School degree' 15 = 'Bachelors degree' 16 = 'Postgraduate degree' others = 'Unknown';
value $region_cat 1 = 'North-east' 2 = 'Mid-west' 3 = 'South' 4 = 'West' others = 'Unknown';
value $insur_cat 1 = 'Private' 2 = 'Public' others = 'Uninsured';
value $age_cat low - 17 = 'Below 18' 18 - 39 = 'Young Adults' 40- 64 = 'Middle aged' 65 - high = 'Elderly';
value $income_cat low - 0 = 'No income' 1 - 37999 = 'Less than $38,000' 38000 - 47999 = '$38,000 - $47,999' 48000 - 62999 = '$48,000 - $62,999' 63000 - high = '$63,000+';
value $bmi_cat  0 - 18.4 = 'Underweight'  18.5 - 24.9 = 'Healthy Weight' 25 - 29.9 = 'Overweight' 30 - high = 'Obesity'  '-9', '-1' = 'Unknown';
value $pov_cat  1 = 'Poor/Negative'  2 = 'Near Poor' 3 = 'Low income' 4 = 'Middle income'  5 = 'High income';
;
run;

/**the macros i tried to create **/
%macro fmt(newdatafile, datafile, newfmtvar, var)  ;
%do;
data &newdatafile.;
set &datafile.;
length &newfmtvar._fmt $37.;
options fmtsearch = (meps_p.formats); 
&newfmtvar._fmt = put(&var.,$sex_cat.);        /*newdatafile is dataset that is created with formatted variables */
run;                                           /**datafile is dataset from which the the dataset is read**/
%end;                                          /**newfmtvar is the new variable created containing the formatted obs**/
%mend fmt;                                   /**var is the variable common to all datasets e.g., sex_cat created using regular coding**/
                                             /** var_cat is format applied**/
%fmt (meps_p.trial, meps_p.csdpooled2, sex, sex);

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

After cleaning up the format code, it works for me.

 

 109        
 110        data want;
 111        set sample.csdpooled2;
 NOTE: Data file SAMPLE.CSDPOOLED2.DATA is in a format that is native to another host, or the file encoding does not match the 
       session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce 
       performance.
 112        length sex_cat $8. age_cat $25. race_cat $50.;
 113        *original variables;
 114        array orig(*) sex age race;
 115        *new variables;
 116        array new(*)  sex_cat age_cat race_cat;
 117        *formats to be applied;
 118        array var_fmts(3) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
 119        
 120        do i=1 to dim(orig);
 121        new(i)=putn(orig(i), var_fmts(i));
 122        *may need to be putc();
 123        end;
 124        run;
 
 NOTE: There were 433143 observations read from the data set SAMPLE.CSDPOOLED2.
 NOTE: The data set WORK.WANT has 433143 observations and 23 variables.
 NOTE: DATA statement used (Total process time):
       real time           1.70 seconds
       user cpu time       1.60 seconds
       system cpu time     0.11 seconds
       memory              2973.12k
       OS Memory           29608.00k
       Timestamp           11/24/2021 10:49:14 PM
       Step Count                        91  Switch Count  2
       Page Faults                       0
       Page Reclaims                     448
       Page Swaps                        0
       Voluntary Context Switches        14
       Involuntary Context Switches      3
       Block Input Operations            0
       Block Output Operations           189960

@Reeza wrote:

Two or three arrays. 

Three if you don't have a consistent naming convention for the variables. 

Use PUTN() or PUTC() Can't recall which to apply the formats.

 

data want;
set have;
*original variables; array orig(*) sex age race;
*new variables;
array new(*) $ sex_cat age_cat race_cat;
*formats to be applied;
array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');


do i=1 to dim(orig);
new(i) = putn(orig(i), var_fmts(i)); *may need to be putc();
end;

run;

@Banke wrote:
Yeah, I figured too. would i need two arrays? i dont know how to construct it

 


 

View solution in original post

15 REPLIES 15
PaigeMiller
Diamond | Level 26

Please, I have a list of variables (about 11 of them) in a dataset, and i want to categorize the observations in each of the variables by creating another 11 variables with the categories.


Show us a portion of your data set. Show us the desired output.

--
Paige Miller
Banke
Pyrite | Level 9
data have;

input sex, marital_status, ethnicity, region;

datalines;

1,1,1,1

2,2,2,2

1,3,1,3

2,4,2,4

;

run;

/**output i want**/
data want;
datalines sex, sex_cat ,marital_status ,maritalstatus_cat, ethnicity ,ethnicity_cat ,region, region_cat;

1,  male, 1, single, 1, Hispanic, 1, north

2, female, 2, married, 2, non-Hispanic, 2, south

1, male, 3, divorced, 1, Hispanic, 3, east

2 female 4 widowed 2 non-Hispanic 4 west

sorry, here it is.Apologies if its not very correct

Reeza
Super User

I think all should be numeric formats not character formats and ensure your data is also numeric.

proc format library = meps_p;
value sex_cat 1 = "Male" 2 = "Female" ;
value race_cat 1 = 'White' 2 = 'Black' 3 = 'American Indian/Alaska Native' 4 = 'Asian/Native Hawaiin/Pacific Islander' 6 = 'Mixed Race';
value eth_cat 1 = 'Hispanic' 2 = 'non-Hispanic';
value marr_cat 1 = 'Married' 2 = 'Widowed' 3 = 'Divorced/Seperated' 5 = 'Single' 6 = 'Under 16';
value edu_cat '1' , '2', '9' = 'No High School degree' '13', '14' = 'High School degree' 15 = 'Bachelors degree' 16 = 'Postgraduate degree' others = 'Unknown';
value region_cat 1 = 'North-east' 2 = 'Mid-west' 3 = 'South' 4 = 'West' others = 'Unknown';
value insur_cat 1 = 'Private' 2 = 'Public' others = 'Uninsured';
value age_cat low - 17 = 'Below 18' 18 - 39 = 'Young Adults' 40- 64 = 'Middle aged' 65 - high = 'Elderly';
value income_cat low - 0 = 'No income' 1 - 37999 = 'Less than $38,000' 38000 - 47999 = '$38,000 - $47,999' 48000 - 62999 = '$48,000 - $62,999' 63000 - high = '$63,000+';
value bmi_cat 0 - 18.4 = 'Underweight' 18.5 - 24.9 = 'Healthy Weight' 25 - 29.9 = 'Overweight' 30 - high = 'Obesity' '-9', '-1' = 'Unknown';
value pov_cat 1 = 'Poor/Negative' 2 = 'Near Poor' 3 = 'Low income' 4 = 'Middle income' 5 = 'High income';
;
run;
Reeza
Super User
FYI - there isn't a huge value to using a macro here. A single data step is probably as much code, or an array would also work to do the conversion.

Banke
Pyrite | Level 9
Yeah, I figured too. would i need two arrays? i dont know how to construct it
Reeza
Super User

Two or three arrays. 

Three if you don't have a consistent naming convention for the variables. 

Use PUTN() or PUTC() Can't recall which to apply the formats.

 

data want;
set have;
*original variables; array orig(*) sex age race;
*new variables;
array new(*) $ sex_cat age_cat race_cat;
*formats to be applied;
array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');


do i=1 to dim(orig);
new(i) = putn(orig(i), var_fmts(i)); *may need to be putc();
end;

run;

@Banke wrote:
Yeah, I figured too. would i need two arrays? i dont know how to construct it

 

Banke
Pyrite | Level 9
/**log**/
data meps_p.trial3;
670  set meps_p.csdpooled2;
671  *original variables;
672  array orig(*) sex age race;
673  *new variables;
674  array new(*) $ sex_cat age_cat race_cat;
675  *formats to be applied;
676  array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
ERROR: The non-variable based array var_fmts has been defined with zero elements.
677  do i=1 to dim(orig);
678     new(i) = putn(orig(i), var_fmts(i)); *may need to be putc();
ERROR: Too many array subscripts specified for array var_fmts.
679  end;
680
681  run;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set MEPS_P.TRIAL3 may be incomplete.  When this step was stopped there were 0 observations and 23 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
data meps_p.trial3;
set meps_p.csdpooled2;
*original variables;
array orig(*) sex age race;
*new variables;
array new(*) $ sex_cat age_cat race_cat;
*formats to be applied;
array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
do i=1 to dim(orig);
   new(i) = putn(orig(i), var_fmts(i)); *may need to be putc();
end;

run;

Thank you so much,

when I used putc or putn, I got the same errors ("The non-variable based array var_fmts has been defined with zero elements" AND "Too many array subscripts specified for array var_fmts" ).

 

When I tried to define the elements by adding the number e.g. array var_fmts(3) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');

It processes for a really long time with errors that i have to abort it. i have copied a part of the really long errors in the log

/**error when i define array elements for the format array**/

data meps_p.trial3;
710  set meps_p.csdpooled2;
711  *original variables;
712  array orig(*) sex age race;
713  *new variables;
714  array new(*) $ sex_cat age_cat race_cat;
715  *formats to be applied;
716  options fmtsearch = (meps_p.formats);
717  array var_fmts(3) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
718  do i=1 to dim(orig);
719     new(i) = putc(orig(i), var_fmts(i)); *may need to be putc();
720  end;
721
722  run;

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
      719:18
NOTE: Argument 2 to function PUTC('           1','sex_cat ') at line 719 column 13 is invalid.
NOTE: Argument 2 to function PUTC('          39','age_cat ') at line 719 column 13 is invalid.
NOTE: Argument 2 to function PUTC('           1','race_cat') at line 719 column 13 is invalid.
PID=101 DUPERSID=00002018 PANEL=1 SEX=1 HISPANX=1 EDRECODE=14 BMI=35.5 race=1 marry=1 age=39 ttlp=53709 inscov=1 region=3 povcat=5
povlev=457.19 perwt=6594.801768 STRA9619=23 PSU9619=4 perwt_av=2198.267256 sex_cat=  age_cat=  race_cat=  i=4 _ERROR_=1 _N_=1
NOTE: Argument 2 to function PUTC('           2','sex_cat ') at line 719 column 13 is invalid.
NOTE: Argument 2 to function PUTC('          40','age_cat ') at line 719 column 13 is invalid.
NOTE: Argument 2 to function PUTC('           1','race_cat') at line 719 column 13 is invalid.
PID=102 DUPERSID=00002025 PANEL=1 SEX=2 HISPANX=1 EDRECODE=13 BMI=28.5 race=1 marry=1 age=40 ttlp=32019 inscov=1 region=3 povcat=5
povlev=457.19 perwt=7144.183132 STRA9619=23 PSU9619=4 perwt_av=2381.3943773 sex_cat=  age_cat=  race_cat=  i=4 _ERROR_=1 _N_=2
NOTE: Argument 2 to function PUTC('           2','sex_cat ') at line 719 column 13 is invalid.
NOTE: Argument 2 to function PUTC('          10','age_cat ') at line 719 column 13 is invalid.
NOTE: Argument 2 to function PUTC('           1','race_cat') at line 719 column 13 is invalid.
PID=103 DUPERSID=00002032 PANEL=1 SEX=2 HISPANX=1 EDRECODE=1 BMI=-1 race=1 marry=6 age=10 ttlp=0 inscov=1 region=3 povcat=5
povlev=457.19 perwt=5616.278666 STRA9619=23 P

 

I've been trying to twick the code but still getting errors. when I put "options fmtsearch" , I still got the same errors. 

Pl

Reeza
Super User
If you provide data I can test it.
Otherwise you can either try and debug or try a different approach.
Banke
Pyrite | Level 9
Hi Reeza, thank you. I tried debugging but It didn't work. I tried hard coding but it did. I would however like to know how to use an array or other fancy methods for doing this. That is the essence of the exercise. I am pleased to share the data with you. Can I do that here? Thank you.
Banke
Pyrite | Level 9
Hi Reeza, thank you. I tried debugging but It didn't work. I tried hard coding but it did. I would however like to know how to use an array or other fancy methods for doing this. That is the essence of the exercise. I am pleased to share the data with you. Can I do that here? Thank you.
Tom
Super User Tom
Super User

The error message is pretty clear.  You tried to create a TEMPORARY array (non variable based array) but you did not tell it how many elements the array should have.  

array var_fmts [3]  $32 _temporary_ ('sex_cat', 'age_cat', 'race_cat');

You don't need the have the (*) in the other array definitions.  SAS knows it has to count them without you telling it to count them, but for a _TEMPORARY_ array there aren't any variables to count so you have to tell it the dimension(s). 

 

Also if you are using the array to define the variable then set the length you want for the variables in the ARRAY statement. Otherwise if you just use $ with no length if the variables are not already defined by the time the compiler sees the ARRAY statement they will be defined as length $8. 

array orig sex age race;
array new $20 sex_cat age_cat race_cat;

 

Banke
Pyrite | Level 9
Thanks for the information Tom, I didnt know that. I applied it and it worked perfectly! Thanks
Reeza
Super User

After cleaning up the format code, it works for me.

 

 109        
 110        data want;
 111        set sample.csdpooled2;
 NOTE: Data file SAMPLE.CSDPOOLED2.DATA is in a format that is native to another host, or the file encoding does not match the 
       session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce 
       performance.
 112        length sex_cat $8. age_cat $25. race_cat $50.;
 113        *original variables;
 114        array orig(*) sex age race;
 115        *new variables;
 116        array new(*)  sex_cat age_cat race_cat;
 117        *formats to be applied;
 118        array var_fmts(3) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
 119        
 120        do i=1 to dim(orig);
 121        new(i)=putn(orig(i), var_fmts(i));
 122        *may need to be putc();
 123        end;
 124        run;
 
 NOTE: There were 433143 observations read from the data set SAMPLE.CSDPOOLED2.
 NOTE: The data set WORK.WANT has 433143 observations and 23 variables.
 NOTE: DATA statement used (Total process time):
       real time           1.70 seconds
       user cpu time       1.60 seconds
       system cpu time     0.11 seconds
       memory              2973.12k
       OS Memory           29608.00k
       Timestamp           11/24/2021 10:49:14 PM
       Step Count                        91  Switch Count  2
       Page Faults                       0
       Page Reclaims                     448
       Page Swaps                        0
       Voluntary Context Switches        14
       Involuntary Context Switches      3
       Block Input Operations            0
       Block Output Operations           189960

@Reeza wrote:

Two or three arrays. 

Three if you don't have a consistent naming convention for the variables. 

Use PUTN() or PUTC() Can't recall which to apply the formats.

 

data want;
set have;
*original variables; array orig(*) sex age race;
*new variables;
array new(*) $ sex_cat age_cat race_cat;
*formats to be applied;
array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');


do i=1 to dim(orig);
new(i) = putn(orig(i), var_fmts(i)); *may need to be putc();
end;

run;

@Banke wrote:
Yeah, I figured too. would i need two arrays? i dont know how to construct it

 


 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 1041 views
  • 2 likes
  • 4 in conversation