Hello,
I am trying to convert a variable from a character to a numeric variable.
The original character variable has numeric values, special symbols ":, *), numeric values with symbols ">3, -3, >=10", and text such as "see comment" and " ____MARKED". Ultimately, I want to convert this to a numeric variable because I will need to do summary statistics on it later. However, I do need to keep the "-3" the other values such as ">3" can become "3".
I wrote the following program:
DATA L1;
LENGTH RV $ 100;
SET Labs;
RESULT_VALUE = COMPRESS(RESULT_VALUE,'<> >=+');
IF RESULT_VALUE IN('' ':') THEN RV = '';
ELSE RV=RESULT_VALUE;
IF anyalpha(RV) THEN DELETE;
ELSE RV = INPUT(RV, 12.);
RUN;
I was able to successfully remove the text values and symbols, but, "-3" got converted to "3" and the variable is still character. Additionally, on doing a subsequent PROC FREQ it the original variable "result_value" is missing its original text and special values which I thought should not happen.
Could you please help me understand how to do this differently?
Thank you.
I would use TRANSLATE() instead of COMPRESS() to get rid of the characters you don't want. That will reduce the risk of forming a number from separate digit strings in the string. But probably remove the commas so strings like 1,234,567 will be recognized.
data have;
input string $32. ;
cards;
2.8
25
Slight
>=60
0.3
4.56E2
Moderate
37
Marked
7.0000000000000007-2
1,230
+45
-3
100-200
34>56
;
data want;
set have;
clean=left(translate(compress(string,','),' ',compress(string,'+-. E','d')));
number=input(clean,??32.);
run;
proc print;
run;
Result
Obs string clean number 1 2.8 2.8 2.8 2 25 25 25.0 3 Slight . 4 >=60 60 60.0 5 0.3 0.3 0.3 6 4.56E2 4.56E2 456.0 7 Moderate . 8 37 37 37.0 9 Marked . 10 7.0000000000000007-2 7.0000000000000007-2 . 11 1,230 1230 1230.0 12 +45 +45 45.0 13 -3 -3 -3.0 14 100-200 100-200 . 15 34>56 34 56 .
@moni2 wrote:
The original character variable has numeric values, special symbols ":, *), numeric values with symbols ">3, -3, >=10", and text such as "see comment" and " ____MARKED". Ultimately, I want to convert this to a numeric variable because I will need to do summary statistics on it later. However, I do need to keep the "-3" the other values such as ">3" can become "3".
I think we would need a complete set of rules about what to do with non-numeric characters that are found in the data. The above discusses just one such rule.
Add some sample data in a working data step along with desired results
First, this makes no sense:
ELSE RV = INPUT(RV, 12.);
because you have defined RV to be a 100 length character variable. So you are going to input something, that may actually be numeric in appearance and then use internal default conversions to make it fit into a character value.
Your -3 disappearing is likely because of the COMPRESS you have chosen. You are removing the SPACE character. So if there is anything before the -3 (or similar) then "a -3" after compress becomes "a-3" which will not read as numeric with a 12. informat.
So look at the full value of variable with the disappearing -3. Subtle issues might be the presence of TAB or NULL characters that appear to be spaces but are not actually.
I might consider, lacking anything resembling many examples
RESULT_VALUE = COMPRESS(RESULT_VALUE,'0123456789-.','k');
to KEEP digits, the - and . (assumes some decimal values) and discard everything else.
thank you to everyone who offered suggestions;
Adding the " RESULT_VALUE = COMPRESS(RESULT_VALUE,'0123456789-.','k'); " was helpful but then it created a value which was not there before "7.0000000000000007-2" in either then old or the new variable and the old and the text values from the old variable are missing.
Here is a sample of the values in the variable for those who asked:
RESULT_VALUE |
2.8 |
8.5 |
8.3 |
25 |
26.3 |
33.4 |
Slight |
140 |
138 |
248 |
0.96 |
9.4 |
78.6 |
15.3 |
43.8 |
100 |
20 |
14.6 |
4.8 |
0 |
19 |
>=60 |
5.57 |
0.3 |
59 |
15 |
3.7 |
0.2 |
2 |
16.5 |
47.1 |
0.5 |
79.9 |
15.4 |
0.1 |
75.5 |
4 |
Moderate |
2 |
3.41 |
0.3 |
Slight |
37 |
Marked |
Moderate |
27.9 |
2 |
What is the numeric value for "Moderate"?
I would use TRANSLATE() instead of COMPRESS() to get rid of the characters you don't want. That will reduce the risk of forming a number from separate digit strings in the string. But probably remove the commas so strings like 1,234,567 will be recognized.
data have;
input string $32. ;
cards;
2.8
25
Slight
>=60
0.3
4.56E2
Moderate
37
Marked
7.0000000000000007-2
1,230
+45
-3
100-200
34>56
;
data want;
set have;
clean=left(translate(compress(string,','),' ',compress(string,'+-. E','d')));
number=input(clean,??32.);
run;
proc print;
run;
Result
Obs string clean number 1 2.8 2.8 2.8 2 25 25 25.0 3 Slight . 4 >=60 60 60.0 5 0.3 0.3 0.3 6 4.56E2 4.56E2 456.0 7 Moderate . 8 37 37 37.0 9 Marked . 10 7.0000000000000007-2 7.0000000000000007-2 . 11 1,230 1230 1230.0 12 +45 +45 45.0 13 -3 -3 -3.0 14 100-200 100-200 . 15 34>56 34 56 .
Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.
Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.