BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
sasgorilla
Pyrite | Level 9

I am creating indicator variables for character variables with a dataset imported from excel. In this dataset there are multiple misspellings, extra spaces, etc. so I copied and pasted each value of the variable from a PROC FREQ to get something like the below: 

 

 

IF oldvar='oldvaluex' OR IF oldvar='olddvaluex' OR IF oldvar='oldvaluex ' ...etc. then newvar=0;

ELSE if oldvar='oldvaluey' OR IF oldvar='old valuey' OR IF oldvar=' oldvaluey '... then newvar=1;

ELSE newvar='.';

 

 

In comparing oldvar with newvar using PROC FREQ I noticed some variations of 'oldvar' were being coded as missing. In searching through these forums I found that it was likely a spacing problem and by using oldvar=compress(oldvar,,'C') prior to creating my indicator variable, these values were now appropriately coded to newvar. 

 

However, I am now working on another variable which is still producing missing values after using oldvar2=compress (oldvar2,,'C') and also attempting oldvar2= Compress(oldvar2,'0D0A'x). 

 

Any ideas on what else I could try? 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

I will place a wager that the "code" is not what you are actually running.

One IF per statement please:

IF oldvar='oldvaluex' OR IF oldvar='olddvaluex' OR IF oldvar='oldvaluex ' ...etc. then newvar=0;

Will throw errors (even without the ... etc). Read the LOG.

The proper code would look like:

    IF oldvar='oldvaluex' OR  oldvar='olddvaluex' OR  oldvar='oldvaluex '  then newvar=0;

Or if you want to check if a variable has one of a list of values use the IN operator:

    IF oldvar in ('oldvaluex' 'olddvaluex' 'oldvaluex ')  then newvar=0;

 

You apparently know that some of your character values have leading spaces. Strongly suggest running STRIP function on those to permanently remove the leading spaces:

 

oldvar=strip(oldvar);

That will remove the leading spaces so you don't need to code around them.

It is very likely that when you have one leading space sometimes that you may have two or more. It can be very hard to tell just looking at output. Strip function would remove all of them.

 

Also, if a character value has trailing spaces SAS pretty much ignores them for comparisons.

Example:

data example;
    x='abc         ';
    if x='abc' then put "ERROR: X is abc";
    else put "X is not abc";
 run;

Run the code and read the log.

 

So you do not need all the extra values with trailing spaces in the condition.

 


@sasgorilla wrote:

I am creating indicator variables for character variables with a dataset imported from excel. In this dataset there are multiple misspellings, extra spaces, etc. so I copied and pasted each value of the variable from a PROC FREQ to get something like the below: 

 

 

IF oldvar='oldvaluex' OR IF oldvar='olddvaluex' OR IF oldvar='oldvaluex ' ...etc. then newvar=0;

ELSE if oldvar='oldvaluey' OR IF oldvar='old valuey' OR IF oldvar=' oldvaluey '... then newvar=1;

ELSE newvar='.';

 

 

In comparing oldvar with newvar using PROC FREQ I noticed some variations of 'oldvar' were being coded as missing. In searching through these forums I found that it was likely a spacing problem and by using oldvar=compress(oldvar,,'C') prior to creating my indicator variable, these values were now appropriately coded to newvar. 

 

However, I am now working on another variable which is still producing missing values after using oldvar2=compress (oldvar2,,'C') and also attempting oldvar2= Compress(oldvar2,'0D0A'x). 

 

Any ideas on what else I could try? 


 

View solution in original post

12 REPLIES 12
PaigeMiller
Diamond | Level 26

However, I am now working on another variable which is still producing missing values after using oldvar2=compress (oldvar2,,'C') and also attempting oldvar2= Compress(oldvar2,'0D0A'x). 

 

What is different about this variable? Please provide example data.

--
Paige Miller
sasgorilla
Pyrite | Level 9

I am not sure. They are both character variables, one a variable for race and another for ethnicity. The one getting missed is a variation of an ethnicity variable with a value of "Hispanic or Latino". Other values for that variable are appropriately getting coded, such as "Hispanic or Latno" and "Hispanic or latino". 

PaigeMiller
Diamond | Level 26

Please provide example data. Please show us your existing code for this ethnicity variable (in fact, show us the entire data step or SQL where you try to recode ethnicity).

--
Paige Miller
sasgorilla
Pyrite | Level 9

Code to capture all values of variable: 

IF ethnicity='Hispanic or Latino' OR ethnicity='Hispanic or Latino' OR ethnicity='Hispanic or Latino' OR ethnicity="HIspanic or Latino" OR ethnicity='Hispanic or Latino' OR ethnicity='Hispanic or Latno' OR ethnicity='Hispanic or latino' OR ethnicity='hispanic or latino' then ethnew=1;
	ELSE IF ethnicity='Non Hispanic or Latino' OR ethnicity='Not Hispanic or Latino' OR ethnicity='Not Hispanic or Latio' OR ethnicity='Not Hispanic or Latno' OR ethnicity='Not hispanic or latino' OR ethnicity='not Hispanic or Latino' OR ethnicity='not hispanic or latino' then ethnew=0;
	ELSE ethnew=".";
run;

Sample data as captured in original excel. *Note variations in case, spelling, and spacing : 

 

ObsEthnicity
1Not Hispanic or Latino
2not hispanic or latino
3 Hispanic or Latino
4 
5Hispanic or Latno
6HIspanic or Latino
7Hispanic or Latino
8Not Hispanic or Latino
9Not Hispanic or Latno
etc.etc.

 

 

PaigeMiller
Diamond | Level 26
if upcase(strip(ethnicity))='HISPANIC OR LATINO' or upcase(strip(ethnicity))='HISPANIC OR LATNO' then ethnew=1;
else if not missing(ethnicity) then ethnew=0;
else ethnew=.;

 

By the way, @sasgorilla  , for your future benefit when posting new problems, there was no possible way we could provide decent answers from your original post. You need to provide details about what isn't working, but instead you provided details about what was working. 

--
Paige Miller
ballardw
Super User

I will place a wager that the "code" is not what you are actually running.

One IF per statement please:

IF oldvar='oldvaluex' OR IF oldvar='olddvaluex' OR IF oldvar='oldvaluex ' ...etc. then newvar=0;

Will throw errors (even without the ... etc). Read the LOG.

The proper code would look like:

    IF oldvar='oldvaluex' OR  oldvar='olddvaluex' OR  oldvar='oldvaluex '  then newvar=0;

Or if you want to check if a variable has one of a list of values use the IN operator:

    IF oldvar in ('oldvaluex' 'olddvaluex' 'oldvaluex ')  then newvar=0;

 

You apparently know that some of your character values have leading spaces. Strongly suggest running STRIP function on those to permanently remove the leading spaces:

 

oldvar=strip(oldvar);

That will remove the leading spaces so you don't need to code around them.

It is very likely that when you have one leading space sometimes that you may have two or more. It can be very hard to tell just looking at output. Strip function would remove all of them.

 

Also, if a character value has trailing spaces SAS pretty much ignores them for comparisons.

Example:

data example;
    x='abc         ';
    if x='abc' then put "ERROR: X is abc";
    else put "X is not abc";
 run;

Run the code and read the log.

 

So you do not need all the extra values with trailing spaces in the condition.

 


@sasgorilla wrote:

I am creating indicator variables for character variables with a dataset imported from excel. In this dataset there are multiple misspellings, extra spaces, etc. so I copied and pasted each value of the variable from a PROC FREQ to get something like the below: 

 

 

IF oldvar='oldvaluex' OR IF oldvar='olddvaluex' OR IF oldvar='oldvaluex ' ...etc. then newvar=0;

ELSE if oldvar='oldvaluey' OR IF oldvar='old valuey' OR IF oldvar=' oldvaluey '... then newvar=1;

ELSE newvar='.';

 

 

In comparing oldvar with newvar using PROC FREQ I noticed some variations of 'oldvar' were being coded as missing. In searching through these forums I found that it was likely a spacing problem and by using oldvar=compress(oldvar,,'C') prior to creating my indicator variable, these values were now appropriately coded to newvar. 

 

However, I am now working on another variable which is still producing missing values after using oldvar2=compress (oldvar2,,'C') and also attempting oldvar2= Compress(oldvar2,'0D0A'x). 

 

Any ideas on what else I could try? 


 

sasgorilla
Pyrite | Level 9

@ballardw The STRIP function resolved my issue in this case, thanks.  Also, thank you for the insight on the 'in' operator. I will experiment with that for a cleaner code. I will also experiment with the trailing zeroes example you provided. Thank you. 

 

And you were right about the multiple IF's not being in my code, sorry about that.

 

 

ballardw
Super User

@sasgorilla wrote:

@ballardw The STRIP function resolved my issue in this case, thanks.  Also, thank you for the insight on the 'in' operator. I will experiment with that for a cleaner code. I will also experiment with the trailing zeroes example you provided. Thank you. 

 

And you were right about the multiple IF's not being in my code, sorry about that.

 

 


You example with ethnicity also brings up modifying the CASE of words. For comparison purposes with problematic data entry you might try using UPCASE, LOWCASE or PROPCASE on the variable in a comparison.

data example;
   input ethnicity :$15.;
   Uindicator = (upcase(ethnicity)='HISPANIC');
   Lindicator = (lowcase(ethnicity)='hispanic');
   Pindicator = (propcase(ethnicity)='Hispanic');

datalines;
Hispanic
HIspanic
hISPANIC
HisPanic
nonhispanic
NONHISPANIC
noNhISpanic
;

Propcase will make the first letter of each "word" of the value capital and the rest of the word lower case.

 

Can't fix stupid data entry but can reduce the code involved.

Personally I often use custom informats to read values that should be limited values of text:

proc format library=work;
invalue $myethnic (Upcase)
'HISPANIC' = 'Hispanic'
'NONHISPANIC'= 'non-Hispanic'
other = _error_
;
run;

data example;
   input ethnicity :$myethnic.;
datalines;
Hispanic
hispnaic
HIspanic
hISPANIC
HisPanic
nonhispanic
NONHISPANIC
noNhISpanic
nohispanic
;

The above will show invalid data for two of the values where the spelling is incorrect.

Instead of a bunch of IF/THEN/ELSE code the invalid data messages show me the values with problems and I can add them to the Invalue statement:

proc format library=work;
invalue $myethnic (Upcase)
'HISPANIC','HISPNAIC' = 'Hispanic'
'NONHISPANIC',
'NOHISPANIC'= 'non-Hispanic'
other = _error_
;
run;

data example;
   input ethnicity :$myethnic.;
datalines;
Hispanic
hispnaic
HIspanic
hISPANIC
HisPanic
nonhispanic
NONHISPANIC
noNhISpanic
nohispanic
;

Note that the UPCASE option on the Invalue means the values are all converted to upper case before comparing to the values on the left of the = .

 

I have one project where supposedly identical "lists" are provided for values to enter.

I get

Masters Degree

Master's Degree

Master's  Degree (extra space)

Master#33&s Degree

And then added Spanish spellings...

The informat means the values in my data set are consistent AND with the Other=_error_ option I don't even have to run Proc Freq or other proc to find them.

Admittedly this works best when there should be some standard spelling involved.

 

Some of my custom informats like this have over 100 values because the source keeps coming with stupid ways to say the same thing but they insist on inserting spaces, commas, "and", "and/or" and such. I just keep adding the extra values to the invalue as the expected values haven't changed( coding to a 1 to 5 numeric scale in this case for reporting).

sasgorilla
Pyrite | Level 9
This is very helpful. Thank you!!
dxiao2017
Pyrite | Level 9

Hi, I found some of your post and questions very interesting. Are you working on some multi-national clinical datasets? Below is my steps to solve your question. First you can use PROC FREQ to create dataset that has all distinct values of the ethnic column, and then assign an indicator of 0 or 1 to the values. Second you generate a table that can be used as format through PROC FORMAT. And then you apply this format to the original dataset. You can further create macros to process the similar steps base on these codes. The codes and output are as follows.

data test_ethnic;
input id ethnicity $3-24;
datalines;
1 Not Hispanic or Latino
2 not hispanic or latino
3 Hispanic or Latino
4
5 Hispanic or Latno
6 HIspanic or Latino
7 Hispanic or Latino
8 Not Hispanic or Latino
9 Not Hispanic or Latno
;
run;
proc freq data=test_ethnic noprint;
tables ethnicity/nocum nopercent
out=ethlevel;
run;
proc print data=ethlevel;run;

data ethlevel1;
set ethlevel;
n=_n_;
run;
proc print data=ethlevel1;run;

data ethlevel2;
set ethlevel1;
if ethnicity=' ' then indicator=.;
if 2<=n<=4 then indicator=1;
else if 5<=n<=7 then indicator=0;
drop count percent n;
run;
proc print data=ethlevel2;run;

data ethfmt;
retain fmtname '$ethfmt';
set ethlevel2 (rename=(ethnicity=start
indicator=label)
);
keep fmtname start label;
run;
proc print data=ethfmt;run;

proc format cntlin=ethfmt;
data test_ethnic_final;
set test_ethnic;
indicator=ethnicity;
format indicator $ethfmt.;
run;
proc print data=test_ethnic_final;run;

 

Untitled12.png

Untitled34.png

Untitled05.png

SAS help cars; we are cars; that is why my default image;
dxiao2017
Pyrite | Level 9
PS: the reference codes of creating tables to be used as formats were on page 195 (part no.4 of the question) and page 184 (part no.4 of demo p204d02) of SAS Programming2: Data Manipulation Techniques course note PDF
SAS help cars; we are cars; that is why my default image;

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 3470 views
  • 2 likes
  • 4 in conversation