BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
moni2
Calcite | Level 5

Hello,

I am trying to convert a variable from a character to a numeric variable. 

The original character variable has numeric values, special symbols ":, *), numeric values with symbols ">3, -3, >=10", and text such as "see comment" and " ____MARKED". Ultimately, I want to convert this to a numeric variable because I will need to do summary statistics on it later. However, I do need to keep the "-3" the other values such as ">3" can become "3". 

 

I wrote the following program:

DATA L1;
LENGTH RV $ 100; 
SET Labs;
RESULT_VALUE = COMPRESS(RESULT_VALUE,'<> >=+'); 
IF RESULT_VALUE IN('' ':') THEN RV = ''; 
ELSE RV=RESULT_VALUE;
IF anyalpha(RV) THEN DELETE;
ELSE RV = INPUT(RV, 12.); 
RUN;

I was able to successfully remove the text values and symbols, but, "-3" got converted to "3" and the variable is still character. Additionally, on doing a subsequent  PROC FREQ it the original variable "result_value" is missing its original text and special values which I thought should not happen. 

 

Could you please help me understand how to do this differently?

 

Thank you. 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

I would use TRANSLATE() instead of COMPRESS() to get rid of the characters you don't want.  That will reduce the risk of forming a number from separate digit strings in the string.  But probably remove the commas so strings like 1,234,567 will be recognized.

 

data have;
  input string $32. ;
cards;
2.8
25
Slight
>=60
0.3
4.56E2
Moderate
37
Marked
7.0000000000000007-2
1,230
+45
-3
100-200
34>56
;

data want;
  set have;
  clean=left(translate(compress(string,','),' ',compress(string,'+-. E','d')));
  number=input(clean,??32.);
run;

proc print;
run;

Result

Obs    string                  clean                    number

  1    2.8                     2.8                         2.8
  2    25                      25                         25.0
  3    Slight                                               .
  4    >=60                    60                         60.0
  5    0.3                     0.3                         0.3
  6    4.56E2                  4.56E2                    456.0
  7    Moderate                                             .
  8    37                      37                         37.0
  9    Marked                                               .
 10    7.0000000000000007-2    7.0000000000000007-2         .
 11    1,230                   1230                     1230.0
 12    +45                     +45                        45.0
 13    -3                      -3                         -3.0
 14    100-200                 100-200                      .
 15    34>56                   34 56                        .

 

 

 

View solution in original post

6 REPLIES 6
PaigeMiller
Diamond | Level 26

@moni2 wrote:

The original character variable has numeric values, special symbols ":, *), numeric values with symbols ">3, -3, >=10", and text such as "see comment" and " ____MARKED". Ultimately, I want to convert this to a numeric variable because I will need to do summary statistics on it later. However, I do need to keep the "-3" the other values such as ">3" can become "3". 


I think we would need a complete set of rules about what to do with non-numeric characters that are found in the data. The above discusses just one such rule.

--
Paige Miller
HB
Barite | Level 11 HB
Barite | Level 11

Add some sample data in a working data step along with desired results

ballardw
Super User

First, this makes no sense:

ELSE RV = INPUT(RV, 12.); 

because you have defined RV to be a 100 length character variable. So you are going to input something, that may actually be numeric in appearance and then use internal default conversions to make it fit into a character value.

 

Your -3 disappearing is likely because of the COMPRESS you have chosen. You are removing the SPACE character. So if there is anything before the -3 (or similar) then  "a    -3" after compress becomes "a-3" which will not read as numeric with a 12. informat.

So look at the full value of variable with the disappearing -3. Subtle issues might be the presence of TAB or NULL characters that appear to be spaces but are not actually.

 

I might consider, lacking anything resembling many examples

RESULT_VALUE = COMPRESS(RESULT_VALUE,'0123456789-.','k'); 

to KEEP digits, the - and . (assumes some decimal values) and discard everything else.

 

 

moni2
Calcite | Level 5

thank you to everyone who offered suggestions; 

 

Adding the " RESULT_VALUE = COMPRESS(RESULT_VALUE,'0123456789-.','k'); " was helpful but then it created a value which was not there before "7.0000000000000007-2" in either then old or the new variable and the old and the text values from the old variable are missing. 

 

 Here is a sample of the values in the variable for those who asked:

RESULT_VALUE
2.8
8.5
8.3
25
26.3
33.4
Slight
140
138
248
0.96
9.4
78.6
15.3
43.8
100
20
14.6
4.8
0
19
>=60
5.57
0.3
59
15
3.7
0.2
2
16.5
47.1
0.5
79.9
15.4
0.1
75.5
4
Moderate
2
3.41
0.3
Slight
37
Marked
Moderate
27.9
2

 

Again, I don't need the text values in my new variable, "-3" should be preserved, and "=>60" can be just 60. And ultimately, I want to create a new numeric variable free from text. 
thank you everyone for your patience.
HB
Barite | Level 11 HB
Barite | Level 11

What is the numeric value for "Moderate"?

Tom
Super User Tom
Super User

I would use TRANSLATE() instead of COMPRESS() to get rid of the characters you don't want.  That will reduce the risk of forming a number from separate digit strings in the string.  But probably remove the commas so strings like 1,234,567 will be recognized.

 

data have;
  input string $32. ;
cards;
2.8
25
Slight
>=60
0.3
4.56E2
Moderate
37
Marked
7.0000000000000007-2
1,230
+45
-3
100-200
34>56
;

data want;
  set have;
  clean=left(translate(compress(string,','),' ',compress(string,'+-. E','d')));
  number=input(clean,??32.);
run;

proc print;
run;

Result

Obs    string                  clean                    number

  1    2.8                     2.8                         2.8
  2    25                      25                         25.0
  3    Slight                                               .
  4    >=60                    60                         60.0
  5    0.3                     0.3                         0.3
  6    4.56E2                  4.56E2                    456.0
  7    Moderate                                             .
  8    37                      37                         37.0
  9    Marked                                               .
 10    7.0000000000000007-2    7.0000000000000007-2         .
 11    1,230                   1230                     1230.0
 12    +45                     +45                        45.0
 13    -3                      -3                         -3.0
 14    100-200                 100-200                      .
 15    34>56                   34 56                        .