Hello all,
Please, I have a list of variables (about 11 of them) in a dataset, and i want to categorize the observations in each of the variables by creating another 11 variables with the categories.
I have used the proc format procedure to create the categories. If I dont want to assign the formats one at a time to each of the categories, does anyone know how i can go about it please, perhaps using macros or array
?
I have shared my proc format codes and a macro i tried creating, but doesnt seem to answer my question and not correct. The table below explains what I want. Thank you very much
var 1 | var 2 | var 3 | var11 | var1_cat | var2_cat | var3_cat | var11_cat |
1 | 1 | 1 | 1 | male | single | Hispanic | northeast |
2 | 2 | 2 | 2 | female | married | nonHispanic | southwest |
1 | 3 | 1 | 3 | male | divorced | Hispanic | midwest |
2 | 4 | 2 | 4 | female | widowed | nonHispanic | north |
/***my formats**/
proc format library = meps_p;
value $sex_cat 1 = "Male" 2 = "Female" ;
value $race_cat 1 = 'White' 2 = 'Black' 3 = 'American Indian/Alaska Native' 4 = 'Asian/Native Hawaiin/Pacific Islander' 6 = 'Mixed Race';
value $eth_cat 1 = 'Hispanic' 2 = 'non-Hispanic';
value $marr_cat 1 = 'Married' 2 = 'Widowed' 3 = 'Divorced/Seperated' 5 = 'Single' 6 = 'Under 16';
value $edu_cat '1' , '2', '9' = 'No High School degree' '13', '14' = 'High School degree' 15 = 'Bachelors degree' 16 = 'Postgraduate degree' others = 'Unknown';
value $region_cat 1 = 'North-east' 2 = 'Mid-west' 3 = 'South' 4 = 'West' others = 'Unknown';
value $insur_cat 1 = 'Private' 2 = 'Public' others = 'Uninsured';
value $age_cat low - 17 = 'Below 18' 18 - 39 = 'Young Adults' 40- 64 = 'Middle aged' 65 - high = 'Elderly';
value $income_cat low - 0 = 'No income' 1 - 37999 = 'Less than $38,000' 38000 - 47999 = '$38,000 - $47,999' 48000 - 62999 = '$48,000 - $62,999' 63000 - high = '$63,000+';
value $bmi_cat 0 - 18.4 = 'Underweight' 18.5 - 24.9 = 'Healthy Weight' 25 - 29.9 = 'Overweight' 30 - high = 'Obesity' '-9', '-1' = 'Unknown';
value $pov_cat 1 = 'Poor/Negative' 2 = 'Near Poor' 3 = 'Low income' 4 = 'Middle income' 5 = 'High income';
;
run;
/**the macros i tried to create **/
%macro fmt(newdatafile, datafile, newfmtvar, var) ;
%do;
data &newdatafile.;
set &datafile.;
length &newfmtvar._fmt $37.;
options fmtsearch = (meps_p.formats);
&newfmtvar._fmt = put(&var.,$sex_cat.); /*newdatafile is dataset that is created with formatted variables */
run; /**datafile is dataset from which the the dataset is read**/
%end; /**newfmtvar is the new variable created containing the formatted obs**/
%mend fmt; /**var is the variable common to all datasets e.g., sex_cat created using regular coding**/
/** var_cat is format applied**/
%fmt (meps_p.trial, meps_p.csdpooled2, sex, sex);
After cleaning up the format code, it works for me.
109
110 data want;
111 set sample.csdpooled2;
NOTE: Data file SAMPLE.CSDPOOLED2.DATA is in a format that is native to another host, or the file encoding does not match the
session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
112 length sex_cat $8. age_cat $25. race_cat $50.;
113 *original variables;
114 array orig(*) sex age race;
115 *new variables;
116 array new(*) sex_cat age_cat race_cat;
117 *formats to be applied;
118 array var_fmts(3) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
119
120 do i=1 to dim(orig);
121 new(i)=putn(orig(i), var_fmts(i));
122 *may need to be putc();
123 end;
124 run;
NOTE: There were 433143 observations read from the data set SAMPLE.CSDPOOLED2.
NOTE: The data set WORK.WANT has 433143 observations and 23 variables.
NOTE: DATA statement used (Total process time):
real time 1.70 seconds
user cpu time 1.60 seconds
system cpu time 0.11 seconds
memory 2973.12k
OS Memory 29608.00k
Timestamp 11/24/2021 10:49:14 PM
Step Count 91 Switch Count 2
Page Faults 0
Page Reclaims 448
Page Swaps 0
Voluntary Context Switches 14
Involuntary Context Switches 3
Block Input Operations 0
Block Output Operations 189960
@Reeza wrote:
Two or three arrays.
Three if you don't have a consistent naming convention for the variables.
Use PUTN() or PUTC() Can't recall which to apply the formats.
data want; set have;
*original variables; array orig(*) sex age race;
*new variables;
array new(*) $ sex_cat age_cat race_cat;
*formats to be applied;
array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
do i=1 to dim(orig);
new(i) = putn(orig(i), var_fmts(i)); *may need to be putc();
end;
run;
@Banke wrote:
Yeah, I figured too. would i need two arrays? i dont know how to construct it
Please, I have a list of variables (about 11 of them) in a dataset, and i want to categorize the observations in each of the variables by creating another 11 variables with the categories.
Show us a portion of your data set. Show us the desired output.
data have;
input sex, marital_status, ethnicity, region;
datalines;
1,1,1,1
2,2,2,2
1,3,1,3
2,4,2,4
;
run;
/**output i want**/
data want;
datalines sex, sex_cat ,marital_status ,maritalstatus_cat, ethnicity ,ethnicity_cat ,region, region_cat;
1, male, 1, single, 1, Hispanic, 1, north
2, female, 2, married, 2, non-Hispanic, 2, south
1, male, 3, divorced, 1, Hispanic, 3, east
2 female 4 widowed 2 non-Hispanic 4 west
sorry, here it is.Apologies if its not very correct
I think all should be numeric formats not character formats and ensure your data is also numeric.
proc format library = meps_p;
value sex_cat 1 = "Male" 2 = "Female" ;
value race_cat 1 = 'White' 2 = 'Black' 3 = 'American Indian/Alaska Native' 4 = 'Asian/Native Hawaiin/Pacific Islander' 6 = 'Mixed Race';
value eth_cat 1 = 'Hispanic' 2 = 'non-Hispanic';
value marr_cat 1 = 'Married' 2 = 'Widowed' 3 = 'Divorced/Seperated' 5 = 'Single' 6 = 'Under 16';
value edu_cat '1' , '2', '9' = 'No High School degree' '13', '14' = 'High School degree' 15 = 'Bachelors degree' 16 = 'Postgraduate degree' others = 'Unknown';
value region_cat 1 = 'North-east' 2 = 'Mid-west' 3 = 'South' 4 = 'West' others = 'Unknown';
value insur_cat 1 = 'Private' 2 = 'Public' others = 'Uninsured';
value age_cat low - 17 = 'Below 18' 18 - 39 = 'Young Adults' 40- 64 = 'Middle aged' 65 - high = 'Elderly';
value income_cat low - 0 = 'No income' 1 - 37999 = 'Less than $38,000' 38000 - 47999 = '$38,000 - $47,999' 48000 - 62999 = '$48,000 - $62,999' 63000 - high = '$63,000+';
value bmi_cat 0 - 18.4 = 'Underweight' 18.5 - 24.9 = 'Healthy Weight' 25 - 29.9 = 'Overweight' 30 - high = 'Obesity' '-9', '-1' = 'Unknown';
value pov_cat 1 = 'Poor/Negative' 2 = 'Near Poor' 3 = 'Low income' 4 = 'Middle income' 5 = 'High income';
;
run;
Two or three arrays.
Three if you don't have a consistent naming convention for the variables.
Use PUTN() or PUTC() Can't recall which to apply the formats.
data want;
set have;
*original variables;
array orig(*) sex age race;
*new variables;
array new(*) $ sex_cat age_cat race_cat;
*formats to be applied;
array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
do i=1 to dim(orig);
new(i) = putn(orig(i), var_fmts(i)); *may need to be putc();
end;
run;
@Banke wrote:
Yeah, I figured too. would i need two arrays? i dont know how to construct it
/**log**/ data meps_p.trial3; 670 set meps_p.csdpooled2; 671 *original variables; 672 array orig(*) sex age race; 673 *new variables; 674 array new(*) $ sex_cat age_cat race_cat; 675 *formats to be applied; 676 array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat'); ERROR: The non-variable based array var_fmts has been defined with zero elements. 677 do i=1 to dim(orig); 678 new(i) = putn(orig(i), var_fmts(i)); *may need to be putc(); ERROR: Too many array subscripts specified for array var_fmts. 679 end; 680 681 run; NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set MEPS_P.TRIAL3 may be incomplete. When this step was stopped there were 0 observations and 23 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
data meps_p.trial3;
set meps_p.csdpooled2;
*original variables;
array orig(*) sex age race;
*new variables;
array new(*) $ sex_cat age_cat race_cat;
*formats to be applied;
array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
do i=1 to dim(orig);
new(i) = putn(orig(i), var_fmts(i)); *may need to be putc();
end;
run;
Thank you so much,
when I used putc or putn, I got the same errors ("The non-variable based array var_fmts has been defined with zero elements" AND "Too many array subscripts specified for array var_fmts" ).
When I tried to define the elements by adding the number e.g. array var_fmts(3) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
It processes for a really long time with errors that i have to abort it. i have copied a part of the really long errors in the log
/**error when i define array elements for the format array**/ data meps_p.trial3; 710 set meps_p.csdpooled2; 711 *original variables; 712 array orig(*) sex age race; 713 *new variables; 714 array new(*) $ sex_cat age_cat race_cat; 715 *formats to be applied; 716 options fmtsearch = (meps_p.formats); 717 array var_fmts(3) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat'); 718 do i=1 to dim(orig); 719 new(i) = putc(orig(i), var_fmts(i)); *may need to be putc(); 720 end; 721 722 run; NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 719:18 NOTE: Argument 2 to function PUTC(' 1','sex_cat ') at line 719 column 13 is invalid. NOTE: Argument 2 to function PUTC(' 39','age_cat ') at line 719 column 13 is invalid. NOTE: Argument 2 to function PUTC(' 1','race_cat') at line 719 column 13 is invalid. PID=101 DUPERSID=00002018 PANEL=1 SEX=1 HISPANX=1 EDRECODE=14 BMI=35.5 race=1 marry=1 age=39 ttlp=53709 inscov=1 region=3 povcat=5 povlev=457.19 perwt=6594.801768 STRA9619=23 PSU9619=4 perwt_av=2198.267256 sex_cat= age_cat= race_cat= i=4 _ERROR_=1 _N_=1 NOTE: Argument 2 to function PUTC(' 2','sex_cat ') at line 719 column 13 is invalid. NOTE: Argument 2 to function PUTC(' 40','age_cat ') at line 719 column 13 is invalid. NOTE: Argument 2 to function PUTC(' 1','race_cat') at line 719 column 13 is invalid. PID=102 DUPERSID=00002025 PANEL=1 SEX=2 HISPANX=1 EDRECODE=13 BMI=28.5 race=1 marry=1 age=40 ttlp=32019 inscov=1 region=3 povcat=5 povlev=457.19 perwt=7144.183132 STRA9619=23 PSU9619=4 perwt_av=2381.3943773 sex_cat= age_cat= race_cat= i=4 _ERROR_=1 _N_=2 NOTE: Argument 2 to function PUTC(' 2','sex_cat ') at line 719 column 13 is invalid. NOTE: Argument 2 to function PUTC(' 10','age_cat ') at line 719 column 13 is invalid. NOTE: Argument 2 to function PUTC(' 1','race_cat') at line 719 column 13 is invalid. PID=103 DUPERSID=00002032 PANEL=1 SEX=2 HISPANX=1 EDRECODE=1 BMI=-1 race=1 marry=6 age=10 ttlp=0 inscov=1 region=3 povcat=5 povlev=457.19 perwt=5616.278666 STRA9619=23 P
I've been trying to twick the code but still getting errors. when I put "options fmtsearch" , I still got the same errors.
Pl
The error message is pretty clear. You tried to create a TEMPORARY array (non variable based array) but you did not tell it how many elements the array should have.
array var_fmts [3] $32 _temporary_ ('sex_cat', 'age_cat', 'race_cat');
You don't need the have the (*) in the other array definitions. SAS knows it has to count them without you telling it to count them, but for a _TEMPORARY_ array there aren't any variables to count so you have to tell it the dimension(s).
Also if you are using the array to define the variable then set the length you want for the variables in the ARRAY statement. Otherwise if you just use $ with no length if the variables are not already defined by the time the compiler sees the ARRAY statement they will be defined as length $8.
array orig sex age race;
array new $20 sex_cat age_cat race_cat;
After cleaning up the format code, it works for me.
109
110 data want;
111 set sample.csdpooled2;
NOTE: Data file SAMPLE.CSDPOOLED2.DATA is in a format that is native to another host, or the file encoding does not match the
session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce
performance.
112 length sex_cat $8. age_cat $25. race_cat $50.;
113 *original variables;
114 array orig(*) sex age race;
115 *new variables;
116 array new(*) sex_cat age_cat race_cat;
117 *formats to be applied;
118 array var_fmts(3) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
119
120 do i=1 to dim(orig);
121 new(i)=putn(orig(i), var_fmts(i));
122 *may need to be putc();
123 end;
124 run;
NOTE: There were 433143 observations read from the data set SAMPLE.CSDPOOLED2.
NOTE: The data set WORK.WANT has 433143 observations and 23 variables.
NOTE: DATA statement used (Total process time):
real time 1.70 seconds
user cpu time 1.60 seconds
system cpu time 0.11 seconds
memory 2973.12k
OS Memory 29608.00k
Timestamp 11/24/2021 10:49:14 PM
Step Count 91 Switch Count 2
Page Faults 0
Page Reclaims 448
Page Swaps 0
Voluntary Context Switches 14
Involuntary Context Switches 3
Block Input Operations 0
Block Output Operations 189960
@Reeza wrote:
Two or three arrays.
Three if you don't have a consistent naming convention for the variables.
Use PUTN() or PUTC() Can't recall which to apply the formats.
data want; set have;
*original variables; array orig(*) sex age race;
*new variables;
array new(*) $ sex_cat age_cat race_cat;
*formats to be applied;
array var_fmts(*) $ _Temporary_ ('sex_cat', 'age_cat', 'race_cat');
do i=1 to dim(orig);
new(i) = putn(orig(i), var_fmts(i)); *may need to be putc();
end;
run;
@Banke wrote:
Yeah, I figured too. would i need two arrays? i dont know how to construct it
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.