I am trying to create a new variable “Languages” fro.m the variable “Demographics__hc_languages”
However the created variable does not come out well despite the code ran without error
data CFI1;
set CFI;
if Demographics__hc_languages in ("English Spanish" "Spanish English") then languages="EnglishSpanish";
else if Demographics__hc_languages in ("English") then languages="EnglishOnly";
else if Demographics__hc_languages in ("Spanish") then languages="SpanishOnly";
else if Demographics__hc_languages in ("English Creole Other","English Creole","English Other",
"English Spanish Creole","English Spanish Other","Other English", "Other", "Spanish Other") then languages="Other";
run;
Original Variable
The FREQ Procedure
Demographics::hc_languages |
||||
Demographics__hc_languages |
Frequency |
Percent |
Cumulative |
Cumulative |
English Creole Other |
2 |
0.37 |
2 |
0.37 |
English Creole |
7 |
1.29 |
9 |
1.66 |
English Other |
10 |
1.84 |
19 |
3.50 |
English Spanish Creole |
2 |
0.37 |
21 |
3.87 |
English Spanish Other |
8 |
1.47 |
29 |
5.34 |
English Spanish |
139 |
25.60 |
168 |
30.94 |
English |
206 |
37.94 |
374 |
68.88 |
Other English |
1 |
0.18 |
375 |
69.06 |
Other |
6 |
1.10 |
381 |
70.17 |
Spanish English |
19 |
3.50 |
400 |
73.66 |
Spanish Other |
1 |
0.18 |
401 |
73.85 |
Spanish |
142 |
26.15 |
543 |
100.00 |
Frequency Missing = 273 |
New Variable Not COMING OUT WELL??????????????????????????????
The SAS System |
The FREQ Procedure
languages |
Frequency |
Percent |
Cumulative |
Cumulative |
EnglishOnly |
206 |
58.19 |
206 |
58.19 |
Other |
6 |
1.69 |
212 |
59.89 |
SpanishOnly |
142 |
40.11 |
354 |
100.00 |
Frequency Missing = 462 |
Any chance of leading or trailing blank that may need stripping?
if strip(Demographics__hc_languages )
data CFI1;
set CFI;
if Demographics__hc_languages in ("English Spanish" ,"Spanish English") then languages="EnglishSpanish";
else if Demographics__hc_languages in ("English") then languages="EnglishOnly";
else if Demographics__hc_languages in ("Spanish") then languages="SpanishOnly";
else languages="Other";
run;
Your eyes might be sharper 🙂
what's the correction?
I am confused because I do not see anything wrong with the code but the results if not coming as aspected;
if Demographics__hc_languages in ("English Spanish" ,"Spanish English") then languages="SpanishEnglish";
else if Demographics__hc_languages in ("English") then languages="EnglishOnly";
else if Demographics__hc_languages in ("Spanish") then languages="SpanishOnly";
else if Demographics__hc_languages^="" then languages="Other";
THE SPANISHENGLISH IS NOT COMING OUT;
Demographics::hc_languages | ||||
---|---|---|---|---|
Demographics__hc_languages | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
English Creole Other | 2 | 0.37 | 2 | 0.37 |
English Creole | 7 | 1.29 | 9 | 1.66 |
English Other | 10 | 1.84 | 19 | 3.50 |
English Spanish Creole | 2 | 0.37 | 21 | 3.87 |
English Spanish Other | 8 | 1.47 | 29 | 5.34 |
English Spanish | 139 | 25.60 | 168 | 30.94 |
English | 206 | 37.94 | 374 | 68.88 |
Other English | 1 | 0.18 | 375 | 69.06 |
Other | 6 | 1.10 | 381 | 70.17 |
Spanish English | 19 | 3.50 | 400 | 73.66 |
Spanish Other | 1 | 0.18 | 401 | 73.85 |
Spanish | 142 | 26.15 | 543 | 100.00 |
Frequency Missing = 273 |
is my suggestion failing too?
Nope, not working. Also your suggestion add missing data to the 'Other" group
Oh sorry no idea
if Demographics__hc_languages in ("English Spanish" ,"Spanish English") then languages="SpanishEnglish";
else if Demographics__hc_languages in ("English") then languages="EnglishOnly";
else if Demographics__hc_languages in ("Spanish") then languages="SpanishOnly";
else if Demographics__hc_languages^="" then languages="Other";
else languages = "MISSING";
how about if you account for the missing values.
The mismatched is on "English Spanish" and "Spanish English" that are classified under "Others"
languages | Demographics__hc_languages | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|---|
273 | 33.46 | 273 | 33.46 | ||
EnglishOnly | English | 206 | 25.25 | 479 | 58.70 |
Other | English Creole Other | 2 | 0.25 | 481 | 58.95 |
Other | English Creole | 7 | 0.86 | 488 | 59.80 |
Other | English Other | 10 | 1.23 | 498 | 61.03 |
Other | English Spanish Creole | 2 | 0.25 | 500 | 61.27 |
Other | English Spanish Other | 8 | 0.98 | 508 | 62.25 |
Other | English Spanish | 139 | 17.03 | 647 | 79.29 |
Other | Other English | 1 | 0.12 | 648 | 79.41 |
Other | Other | 6 | 0.74 | 654 | 80.15 |
Other | Spanish English | 19 | 2.33 | 673 | 82.48 |
Other | Spanish Other | 1 | 0.12 | 674 | 82.60 |
SpanishOnly | Spanish | 142 | 17.40 | 816 | 100.00 |
OK, a good start. Now it's time to roll up your sleeves and dig into the mismatches.
Do the data values contain two blanks between "English" and "Spanish" ?
Do the data values contain a tab character instead of a space?
Do the data values contain a leading blank?
I'm assuming here that the program is correct, and the program uses one blank between "English" and "Spanish".
SAS normally removes leading spaces when printing values for values like " English Spanish" might look in the print-out like "English Spanish".
There are also other characters that could be in the variable that look like spaces (or do not print at all) but are not.
Or it could be the variable is coded and what you are seeing is the formatted value and not what is actually in the variable.
Languages=strip(compbl(vvalue(Demographics__hc_languages)));
if languages = "English" then languages="EnglishOnly";
else if languages = "Spanish" then languages="SpanishOnly";
else if languages in ("English Spanish" "Spanish English") then languages="EnglishSpanish";
else languages="Other";
In order to know your data you should scrutinize these character values with the $HEXw. format:
proc freq data=cfi1;
format _char_ $hex60.;
tables Demographics__hc_languages;
run;
This should reveal the true contents of this variable, provided that its length is <=30. (A likely culprit is the word separator, so format length 60 should be more than sufficient.)
If you're not curious about what exactly causes the issue, you may want to try and modify your IF conditions like
if compress(Demographics__hc_languages,'','ka') in: ('EnglishSpanish' 'SpanishEnglish')
This excludes all non-alphabetic characters from the comparison, which is why you need to remove the embedded blanks from the value lists as well. (Edit: Inserted the colon after IN just in case there are truncated alphabetic characters. But my recommendation is to apply the $HEXw. format. Edit 2: With the in: operator you'd need to check for the longest, i.e. three-word values first, then continue with the two-word values etc.!)
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.