I'm trying to create a new column by grouping variables from a different column to a new output. I'm also trying to identify those variables just by their first 2 digits using a wildcard. When I run the following, the output just puts "1" in the new column for all rows, even if doesn't abide by the intended "if" statement. I'm not quite sure why it does that and fails to have the intended output in the new column.
Any help would be greatly appreciated!!
data test;
set test;
if HOSPSTCO=:'23' or '33' or '50' or '25' or '44' or '09' then HOSP_DIVISION=1;
else if HOSPSTCO=:'36' or '42' or '34' then HOSP_DIVISION=2;
else if HOSPSTCO=:'55' or '26' or '17' or '18' or '39' then HOSP_DIVISION=3;
else if HOSPSTCO=:'29' or '38' or '46' or '31' or '20' or '27' or '19' then HOSP_DIVISION=4;
else if HOSPSTCO=:'10' or '24' or '11' or '51' or '54' or '37' or '45' or '13' or '12' then HOSP_DIVISION=5;
else if HOSPSTCO=:'21' or '47' or '28' or '01' then HOSP_DIVISION=6;
else if HOSPSTCO=:'40' or '48' or '05' or '22' then HOSP_DIVISION=7;
else if HOSPSTCO=:'16' or '30' or '56' or '32' or '49' or '08' or '04' or '35' then HOSP_DIVISION=8;
else if HOSPSTCO=:'02' or '53' or '41' or '06' or '15' then HOSP_DIVISION=9;
run;
The proper syntax is:
if HOSPSTCO=:'23' or HOSPSTCO=:'33' or HOSPSTCO=:'50' or
HOSPSTCO=:'25' or HOSPSTCO=:'44' or HOSPSTCO=:'09'
then HOSP_DIVISION=1;
Alternative
if substr(hospstco,1,2) in ('23','33','50','25','44','09') then hosp_division=1;
@lturian wrote:
When I try this, it doesn't output any variable into HOSP_DIVISION. Every row just has "." instead of the desired output. I wonder if the problem lies with the HOSPSTCO data
Then LOOK AT the data in data set TEST with your own eyes.
The proper operator to use is IN .
if substr(HOSPSTCO,1,2) in ('23' '33' '50' '25' '44' '09') then HOSP_DIVISION=1;
Or use the SELECT When for doing lots of comparisons instead of If/then/else
data test2; set test; select (substr(HOSPSTCO,1,2)); When ('23' , '33' , '50' , '25' , '44' , '09' ) HOSP_DIVISION=1; When ('36' , '42' , '34' ) HOSP_DIVISION=2; When ('55' , '26' , '17' , '18' , '39' ) HOSP_DIVISION=3; When ('29' , '38' , '46' , '31' , '20' , '27' , '19' ) HOSP_DIVISION=4; When ('10' , '24' , '11' , '51' , '54' , '37' , '45' , '13' , '12' ) HOSP_DIVISION=5; When ('21' , '47' , '28' , '01' ) HOSP_DIVISION=6; When ('40' , '48' , '05' , '22' ) HOSP_DIVISION=7; When ('16' , '30' , '56' , '32' , '49' , '08' , '04' , '35' ) HOSP_DIVISION=8; When ('02' , '53' , '41' , '06' , '15' ) HOSP_DIVISION=9; otherwise; end; run;
The otherwise is not required but there will be a warning if not present. It is to do what ever you might want if none of the values appear in the other When conditions. Such as perhaps put a warning to the log about an unexpected value or missing value.
Another alternative if you have a data set with the code values and categories would be to create a format to assign the division code and just use the format instead of adding another variable.
This takes advantage of an oddity in range comparisons of character values:
proc format; value $hospdiv '23' - '2399'= 'Division 1' '33' - '3399'= 'Division 1' '36' - '3699'= 'Division 2' '42' - '4299'= 'Division 2' ; run; data junk; input x $; put x= $hospdiv.; datalines; 23 231 2345 33 3356 36 367 3678 42 4244 ;
I have several formats that are used to turn location codes into regions, nested region/subregion (multilabel formats), long and short descriptions.
Groups created with formats will be honored by most analysis, reporting and graphing procedures. They have a big advantage that if you need to make a small "tweek" you only need to modify Format code and do not have to go back to the data set to add yet another variable.
Formats are your friend.
Untested but something like this is a lot cleaner and easier to maintain and test. My nickel at least.
proc format;
invalue hosp_division
'23' '33' '50' '25' '44' '09' = 1
'36' '42' '34' = 2
'55' '26' '17' '18' '39' =3;
run;
data test2;*use a different name otherwise it's hard to track changes over time and harder to debug;
set test;
hosp_division = input(substr(hospstco, 1, 2), hosp_division.);
run;
@lturian wrote:
I'm trying to create a new column by grouping variables from a different column to a new output. I'm also trying to identify those variables just by their first 2 digits using a wildcard. When I run the following, the output just puts "1" in the new column for all rows, even if doesn't abide by the intended "if" statement. I'm not quite sure why it does that and fails to have the intended output in the new column.
Any help would be greatly appreciated!!
data test; set test; if HOSPSTCO=:'23' or '33' or '50' or '25' or '44' or '09' then HOSP_DIVISION=1; else if HOSPSTCO=:'36' or '42' or '34' then HOSP_DIVISION=2; else if HOSPSTCO=:'55' or '26' or '17' or '18' or '39' then HOSP_DIVISION=3; else if HOSPSTCO=:'29' or '38' or '46' or '31' or '20' or '27' or '19' then HOSP_DIVISION=4; else if HOSPSTCO=:'10' or '24' or '11' or '51' or '54' or '37' or '45' or '13' or '12' then HOSP_DIVISION=5; else if HOSPSTCO=:'21' or '47' or '28' or '01' then HOSP_DIVISION=6; else if HOSPSTCO=:'40' or '48' or '05' or '22' then HOSP_DIVISION=7; else if HOSPSTCO=:'16' or '30' or '56' or '32' or '49' or '08' or '04' or '35' then HOSP_DIVISION=8; else if HOSPSTCO=:'02' or '53' or '41' or '06' or '15' then HOSP_DIVISION=9; run;
All good replies.
And if in the real-world example there are cases where you sometimes want the first two characters of HOSPSTCO and other cases where you want the first three (or four or ... ) characters to match, then I think all of the solutions fail except the IF/THEN construct. But maybe I'm wrong, maybe there's a way to get the SUBSTR or FORMAT solution to work there, I just can think of it right now.
Just curious whether you tried to replace this:
hosp_division = input(substr(hospstco, 1, 2), hosp_division.);
Is it possible this would work?
hosp_division = input(hospstco, hosp_division2.);
@Astounding wrote:
Just curious whether you tried to replace this:
hosp_division = input(substr(hospstco, 1, 2), hosp_division.);
Is it possible this would work?
hosp_division = input(hospstco, hosp_division2.);
That will work.
@lturian wrote:
I do like formats. But for some reason I'm not getting anything the formatted variables in the new column. I'm just getting "." in all rows.
42
Values? Format / informat definition? Code used?
I used the invalue format you provided:
proc format;
invalue FIPS
'23' '33' '50' '25' '44' '09' = 1
'36' '42' '34' = 2
'55' '26' '17' '18' '39' = 3
'29' '38' '46' '31' '20' '27' '19' = 4
'10' '24' '11' '51' '54' '37' '45' '13' '12' = 5
'21' '47' '28' '01' = 6
'40' '48' '05' '22' = 7
'16' '30' '56' '32' '49' '08' '04' '35' = 8
'02' '53' '41' '06' '15' = 9
;
run;
data test2;
set test;
HOSP_DIVISION = input(substr(HOSPSTCO,1,2), FIPS.);
run;
@lturian wrote:
I used the invalue format you provided:
proc format; invalue FIPS '23' '33' '50' '25' '44' '09' = 1 '36' '42' '34' = 2 '55' '26' '17' '18' '39' = 3 '29' '38' '46' '31' '20' '27' '19' = 4 '10' '24' '11' '51' '54' '37' '45' '13' '12' = 5 '21' '47' '28' '01' = 6 '40' '48' '05' '22' = 7 '16' '30' '56' '32' '49' '08' '04' '35' = 8 '02' '53' '41' '06' '15' = 9 ; run; data test2; set test; HOSP_DIVISION = input(substr(HOSPSTCO,1,2), FIPS.); run;
And the result was good? bad? ugly? What?
Are you sure the variable HOSPSTCO is defined as character?
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.