- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am trying to convert a variable from a character to a numeric variable.
The original character variable has numeric values, special symbols ":, *), numeric values with symbols ">3, -3, >=10", and text such as "see comment" and " ____MARKED". Ultimately, I want to convert this to a numeric variable because I will need to do summary statistics on it later. However, I do need to keep the "-3" the other values such as ">3" can become "3".
I wrote the following program:
DATA L1;
LENGTH RV $ 100;
SET Labs;
RESULT_VALUE = COMPRESS(RESULT_VALUE,'<> >=+');
IF RESULT_VALUE IN('' ':') THEN RV = '';
ELSE RV=RESULT_VALUE;
IF anyalpha(RV) THEN DELETE;
ELSE RV = INPUT(RV, 12.);
RUN;
I was able to successfully remove the text values and symbols, but, "-3" got converted to "3" and the variable is still character. Additionally, on doing a subsequent PROC FREQ it the original variable "result_value" is missing its original text and special values which I thought should not happen.
Could you please help me understand how to do this differently?
Thank you.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I would use TRANSLATE() instead of COMPRESS() to get rid of the characters you don't want. That will reduce the risk of forming a number from separate digit strings in the string. But probably remove the commas so strings like 1,234,567 will be recognized.
data have;
input string $32. ;
cards;
2.8
25
Slight
>=60
0.3
4.56E2
Moderate
37
Marked
7.0000000000000007-2
1,230
+45
-3
100-200
34>56
;
data want;
set have;
clean=left(translate(compress(string,','),' ',compress(string,'+-. E','d')));
number=input(clean,??32.);
run;
proc print;
run;
Result
Obs string clean number 1 2.8 2.8 2.8 2 25 25 25.0 3 Slight . 4 >=60 60 60.0 5 0.3 0.3 0.3 6 4.56E2 4.56E2 456.0 7 Moderate . 8 37 37 37.0 9 Marked . 10 7.0000000000000007-2 7.0000000000000007-2 . 11 1,230 1230 1230.0 12 +45 +45 45.0 13 -3 -3 -3.0 14 100-200 100-200 . 15 34>56 34 56 .
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@moni2 wrote:
The original character variable has numeric values, special symbols ":, *), numeric values with symbols ">3, -3, >=10", and text such as "see comment" and " ____MARKED". Ultimately, I want to convert this to a numeric variable because I will need to do summary statistics on it later. However, I do need to keep the "-3" the other values such as ">3" can become "3".
I think we would need a complete set of rules about what to do with non-numeric characters that are found in the data. The above discusses just one such rule.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Add some sample data in a working data step along with desired results
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
First, this makes no sense:
ELSE RV = INPUT(RV, 12.);
because you have defined RV to be a 100 length character variable. So you are going to input something, that may actually be numeric in appearance and then use internal default conversions to make it fit into a character value.
Your -3 disappearing is likely because of the COMPRESS you have chosen. You are removing the SPACE character. So if there is anything before the -3 (or similar) then "a -3" after compress becomes "a-3" which will not read as numeric with a 12. informat.
So look at the full value of variable with the disappearing -3. Subtle issues might be the presence of TAB or NULL characters that appear to be spaces but are not actually.
I might consider, lacking anything resembling many examples
RESULT_VALUE = COMPRESS(RESULT_VALUE,'0123456789-.','k');
to KEEP digits, the - and . (assumes some decimal values) and discard everything else.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
thank you to everyone who offered suggestions;
Adding the " RESULT_VALUE = COMPRESS(RESULT_VALUE,'0123456789-.','k'); " was helpful but then it created a value which was not there before "7.0000000000000007-2" in either then old or the new variable and the old and the text values from the old variable are missing.
Here is a sample of the values in the variable for those who asked:
RESULT_VALUE |
2.8 |
8.5 |
8.3 |
25 |
26.3 |
33.4 |
Slight |
140 |
138 |
248 |
0.96 |
9.4 |
78.6 |
15.3 |
43.8 |
100 |
20 |
14.6 |
4.8 |
0 |
19 |
>=60 |
5.57 |
0.3 |
59 |
15 |
3.7 |
0.2 |
2 |
16.5 |
47.1 |
0.5 |
79.9 |
15.4 |
0.1 |
75.5 |
4 |
Moderate |
2 |
3.41 |
0.3 |
Slight |
37 |
Marked |
Moderate |
27.9 |
2 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What is the numeric value for "Moderate"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I would use TRANSLATE() instead of COMPRESS() to get rid of the characters you don't want. That will reduce the risk of forming a number from separate digit strings in the string. But probably remove the commas so strings like 1,234,567 will be recognized.
data have;
input string $32. ;
cards;
2.8
25
Slight
>=60
0.3
4.56E2
Moderate
37
Marked
7.0000000000000007-2
1,230
+45
-3
100-200
34>56
;
data want;
set have;
clean=left(translate(compress(string,','),' ',compress(string,'+-. E','d')));
number=input(clean,??32.);
run;
proc print;
run;
Result
Obs string clean number 1 2.8 2.8 2.8 2 25 25 25.0 3 Slight . 4 >=60 60 60.0 5 0.3 0.3 0.3 6 4.56E2 4.56E2 456.0 7 Moderate . 8 37 37 37.0 9 Marked . 10 7.0000000000000007-2 7.0000000000000007-2 . 11 1,230 1230 1230.0 12 +45 +45 45.0 13 -3 -3 -3.0 14 100-200 100-200 . 15 34>56 34 56 .