Solved: Using an array to convert character variables to numeric

_maldini_ · Posted 01-11-2022 04:59 PM

I’m trying to understand some basic DATA step functions using fabricated data (See below). Among other things, I’m trying to write an array that converts character variables to numeric.

DATA new;

	LENGTH var_5 $ 2;
	/*  Change the length of var_5 */
	/*  LENGTH statment will change the length of var_5 to 2 characters by truncating any characters after 2 */
	
	SET output (DROP=var_2);
	/* 	Drop var_2 */
	/* 	DROP option (not statement). var_2 will not be read into the "new" data set */

/* 	Change character variables to numeric */
/* 	var_3_new = INPUT(var_3, 2.); */
/* 	var_4_new = INPUT(var_4, 2.); */
	
/* 	Trying to change character to numeric using an array */
	ARRAY array_from(*)
	var_3
	var_4
	;
	
	ARRAY array_to(*) 
	var_3_new 
	var_4_new
	;
	
	DO i=1 to dim(array_from);
    array_to(i) = INPUT(array_from(i),2.); 
/*     What does the "2." refer to here? */
    END;
	
	DROP 
	var_3
	var_4
	i;
	
	RENAME
	var_5 = state
/* 	Rename var_5 to "state" */
	
/* 	Renaming the new variables so they simply replace the old variables */
	var_3_new = var_3
	var_4_new = var_4;
	
RUN;

The array seems to work, but I would appreciate any feedback. If it works, it works because I got lucky.

Also, the newly converted numeric variables do not have Formats or Informats (please screen PROC CONTENTS screenshot). Is this a problem? How would I add them?

Original PROC CONTENTS:

New PROC CONTENTS:

The log states: “NOTE: Invalid argument to function INPUT at line 95 column 19.” (See below)

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
68
69 DATA new;
70
71 LENGTH var_5 $ 2;
72 /* Change the length of var_5 */
73 /* LENGTH statment will change the length of var_5 to 2 characters by truncating any characters after 2 */
74
75 SET output (DROP=var_2);
76 /* Drop var_2 */
77 /* DROP option (not statement). var_2 will not be read into the "new" data set */
78
79 /* Change character variables to numeric */
80 /* var_3_new = INPUT(var_3, 2.); */
81 /* var_4_new = INPUT(var_4, 2.); */
82
83 /* Trying to change character to numeric using an array */
84 ARRAY array_from(*)
85 var_3
86 var_4
87 ;
88
89 ARRAY array_to(*)
90 var_3_new
91 var_4_new
92 ;
93
94 DO i=1 to dim(array_from);
95 array_to(i) = INPUT(array_from(i),2.);
96 /* What does the "2." refer to here? */
97 END;
98
99 DROP
100 var_3
101 var_4
102 i;
103
104 RENAME
105 var_5 = state
106 /* Rename var_5 to "state" */
107
108 /* Renaming the new variables so they simply replace the old variables */
109 var_3_new = var_3
110 var_4_new = var_4;
111
112 RUN;

WARNING: Multiple lengths were specified for the variable var_5 by input data set(s). This can cause truncation of data.
NOTE: Invalid argument to function INPUT at line 95 column 19.
var_5=Mi id=1 var_1=1.2 var_3=NA var_4=1 var_3_new=. var_4_new=1 i=3 _ERROR_=1 _N_=1
NOTE: Invalid argument to function INPUT at line 95 column 19.
var_5=NY id=3 var_1=. var_3= var_4=NA var_3_new=. var_4_new=. i=3 _ERROR_=1 _N_=3
NOTE: Invalid argument to function INPUT at line 95 column 19.
var_5=CA id=4 var_1=4.3445 var_3=NA var_4=0 var_3_new=. var_4_new=0 i=3 _ERROR_=1 _N_=4
NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to
missing values.
Each place is given by: (Number of times) at (Line):(Column).
3 at 95:19
NOTE: There were 5 observations read from the data set WORK.OUTPUT.
NOTE: The data set WORK.NEW has 5 observations and 5 variables.

This statement is repeated 3 times. Each time for an observation that has a character value.

What is this message telling me?

The log also states: “Mathematical operations could not be performed at the following places. The results of the operations have been set to missing values.”

This message occurs for the same 3 observations.

What mathematical operation is being referred to here? Is this an error?

Thanks for any assistance you can provide.

Create table output(id float, var_1 float, var_2 varchar(1), var_3 varchar(3), var_4 varchar(2), var_5 varchar(11));
Insert into output(id, var_1, var_2, var_3, var_4, var_5) Values(1, 1.2, 'M', 'NA', '1', 'Mississippi');
Insert into output(id, var_1, var_2, var_3, var_4, var_5) Values(2, 2.388888, 'F', '3', '0', 'NJ');
Insert into output(id, var_1, var_2, var_3, var_4, var_5) Values(3, ., '', '', 'NA', 'NY');
Insert into output(id, var_1, var_2, var_3, var_4, var_5) Values(4, 4.3445, 'F', 'NA', '0', 'CA');
Insert into output(id, var_1, var_2, var_3, var_4, var_5) Values(5, 3.1, 'M', '3.1', '1', 'OR');

ballardw · Posted 01-11-2022 05:24 PM

The error messages are rather straight-forward:

WARNING: Multiple lengths were specified for the variable var_5 by input data set(s). This can cause truncation of data.
NOTE: Invalid argument to function INPUT at line 95 column 19.
var_5=Mi id=1 var_1=1.2 var_3=NA var_4=1 var_3_new=. var_4_new=1 i=3 _ERROR_=1 _N_=1

You are using the INPUT function on variables Var_3 and Var_4 (Your array_from definition). In the line above it show VAR_3=NA. NA is not by default going to turn into a numeric value. The statement fragment "input(array_from, 2.)" uses the SAS supplied informat that expects to read 2 characters that represent a number. So NA, not being a number throws the error.

The line 95 refers to the line of code with the number 95 preceding. Column 19 refers to the position in that line with an issue. Since you pasted that into the main message window on the forum the window may have reformatted your text so that the position isn't as expected but the 19 should be the start of the INPUT function as that is what fails.

If you know that all of the errors thrown by a Input function call are caused by this specific issue, such as = NA, you can suppress the messages by adding either the ? or ?? input modifier:

array_to(i) = INPUT(array_from(i), ? 2.);

or

array_to(i) = INPUT(array_from(i), ?? 2.);

The first probably is sufficient for your purpose.

The better method is to use a custom informat that knows how to handle NA as text input for text-to-numeric conversion.

The mathematical operation that couldn't be performed is the assignment to the numeric variable since the conversion failed at Input.

View solution in original post

ballardw · Posted 01-11-2022 05:24 PM

The error messages are rather straight-forward:

WARNING: Multiple lengths were specified for the variable var_5 by input data set(s). This can cause truncation of data.
NOTE: Invalid argument to function INPUT at line 95 column 19.
var_5=Mi id=1 var_1=1.2 var_3=NA var_4=1 var_3_new=. var_4_new=1 i=3 _ERROR_=1 _N_=1

You are using the INPUT function on variables Var_3 and Var_4 (Your array_from definition). In the line above it show VAR_3=NA. NA is not by default going to turn into a numeric value. The statement fragment "input(array_from, 2.)" uses the SAS supplied informat that expects to read 2 characters that represent a number. So NA, not being a number throws the error.

The line 95 refers to the line of code with the number 95 preceding. Column 19 refers to the position in that line with an issue. Since you pasted that into the main message window on the forum the window may have reformatted your text so that the position isn't as expected but the 19 should be the start of the INPUT function as that is what fails.

If you know that all of the errors thrown by a Input function call are caused by this specific issue, such as = NA, you can suppress the messages by adding either the ? or ?? input modifier:

array_to(i) = INPUT(array_from(i), ? 2.);

or

array_to(i) = INPUT(array_from(i), ?? 2.);

The first probably is sufficient for your purpose.

The better method is to use a custom informat that knows how to handle NA as text input for text-to-numeric conversion.

The mathematical operation that couldn't be performed is the assignment to the numeric variable since the conversion failed at Input.

_maldini_ · Posted 01-12-2022 12:20 PM

@ballardw Thank you.

Does column 19 mean 19 spaces/characters from the left? So to locate the error I would go to line 95 and count 19 spaces/characters to the right?

ballardw · Posted 01-12-2022 12:27 PM

@_maldini_ wrote:

@ballardw Thank you.

<Column 19 refers to the position in that line with an issue.>

Does column 19 mean 19 spaces/characters from the left? So to locate the error I would go to line 95 and count 19 spaces/characters to the right?

That is correct.

So likely just emphasizing the the INPUT function is the cause of the error when the I of Input is at position 19. Not much else on line 95 to go wrong but knowing this is helpful when you have a line of code that includes multiple, and especially nested, functions.

Tom · Posted 01-11-2022 05:57 PM

Where did the data come from? Why not just read those variables as numeric to begin with instead of having to try to convert them from character to numeric.

The is no need to attach either a FORMAT or an INFORMAT to any of those variables. A FORMAT is special instructions for how to print the value. SAS's already does a good job of printing character strings and numbers without any need for special instructions. INFORMAT is special instructions for how to read character strings into values. Your values are already in the variables, so attaching an INFORMAT to any of those variables would not serve any purpose, plus SAS does not need any special instructions on how to read normal number or character strings.

In your INPUT() function call the `2.` is the INFORMAT to use to convert the string in the character variable into a new value. Since '2.' is a numeric informat the result will be a number. But note there is no need to limit the INPUT() function to reading just the 2 two characters from the character variable. The INPUT() function does not care if the width used on the informat is more than the length of the string being read. So just use the maximum allowed width for the numeric informat.

array_to(i) = INPUT(array_from(i),32.);

The string NA does not represent a numeric value to SAS. Why does your variable VAR_3 have the string NA in it? Do you want that to mean that the value is missing? If so you could ignore the note, since the result is a missing value. Or you could add logic to prevent the note. For example by testing the value first.

if upcase(array_form(i)) ne 'NA' then array_to(i) = INPUT(array_from(i),32.);

_maldini_ · Posted 01-12-2022 12:33 PM

@Tom Thank you.

The data is fabricated, but represents some of the situations I have faced in the past (e.g., "NA", truncated character strings). I'm trying learn how to change attributes using a DATA step so that I can fix errors that arise by using PROC IMPORT. I've received several data sets in the past w/ "NA" as a value for an otherwise numeric variable.

I realize you can use a DATA step to read-in the data in the first place, but I don't really know how to do that. Also, some of my data sets in the past have had many variables. It seems like setting the attributes of 86 variables, as an example, might be more work than fixing those that weren't read-in correctly. I'm open to better ideas if you have an opinion on that, however.

I'm not sure I understand this. I thought INFORMATS were how the values were stored. What do you mean by "read character strings into values"?

Got it. Thanks.

Tom · Posted 01-12-2022 12:56 PM

How the values are stored is determined by the TYPE of the variable.
SAS has only two types of variables: floating point numbers and fixed length character strings.

Formats convert values into text. Numeric formats convert numbers into text. Character formats convert character strings into text.

Informats convert text into values. Numeric informats create floating point numbers. Character informats create character strings.

In general you really only need to attach formats to DATE, TIME or DATETIME value so that the way the numbers used to represent those things will print in a way that humans will understand. DATE values are stored as the number of days since 1960, Time values as the number of seconds since midnight. DATETIME values as the number of seconds since 1960. Without a special format being used to display the value humans would not understand that the number 22,657 means the date 12JAN2022.

515   data _null_;
516     today=date();
517     put today = comma12. +1 today date9.;
518   run;

today=22,657  12JAN2022

Similarly when reading text into values in general you only need to worry about DATE, TIME and DATETIME values. Otherwise SAS does not know to interpret the string "12JAN2022" as meaning 22,657 days. But you might also find the COMMA informat useful as it will read strings that contain commas and/or dollar signs by just ignoring those characters and reading the remaining digits and optional decimal point as a number.

_maldini_ · Posted 01-12-2022 11:45 PM

@Tom

What about the other direction (i.e., numeric to character)?

NewID = PUT(StudentID, 3.);

What does '3.' represent here? Is it also a numeric informat that would be better off being set '32.'?

Tom · Posted 01-13-2022 07:08 AM

The PUT() function use FORMATs, not INFORMATs.

Formats convert values to text. Informats convert text to values.

The width used to read a character string into a number does not make a difference since there is only one way to store any given number. (Note if the width used on the informat is shorter than the string being read there could be an impact when converting text into numbers, since then the trailing characters in the string are ignored when deciding what number they represent. If you read the string '1234' with the informat 3. you get the number 123 instead of the number 1,234 since the '4' is not read.)

The width used to write a number makes a big difference in the value being stored because there are many ways to write the same number as a character string. The value 10 written with 5 characters is a different string than the same number written using 6 characters, there will be an another space character written. Or you could write the number with leading zeros instead of leading spaces. Or trailing spaces instead of leading spaces.

If the values can only be integers (or at least you want the value printed into the character variable to be rounded to an integer) then using a format like 32. might be appropriate.

But if you know the values are integers that are always between 100 and 999 then using the format specification of 3. makes sense. If the the values can be less then 100 do you want leading zeros instead of leading spaces? Use the format specification Z3. instead to get leading zeros. Of use the format modifier -L to remove the leading spaces.

CharID = PUT(StudentID, Z3.);
CharID = PUT(StudentID, 3.-L);

If the values contain fractional parts then you might want to use a fixed number of decimal places by using a format specification like 10.3. Or if the range of values is extreme you might want to use the BEST format so that SAS will try to use the best format for the value that can fit into the number of characters specified by the width used in the format specification.

Kurt_Bremser · Posted 01-13-2022 04:21 AM

NOTE: Invalid argument to function INPUT at line 95 column 19.
var_5=Mi id=1 var_1=1.2 var_3=NA var_4=1 var_3_new=. var_4_new=1 i=3 _ERROR_=1 _N_=1

This tells you where in the code the problem happened, so let's look at log line 95:

95 array_to(i) = INPUT(array_from(i),2.);

To find the offending value, we need to know in which data line it happened, and in which member of the array. The first log excerpt tells us that:

_N_=1

It happened in the first iteration of the data step, so in the first observation of the incoming dataset.

i=3

It happened with the third member of the array, which is

var_3=NA

Oh yeah, "NA" is surely not a valid numeric value. You can work around this by defining a custom informat:

proc format;
invalue mynum
  "NA" = .
  other = [2.]
;
run;

and use that instead of the simple 2. informat in the INPUT function:

array_to(i) = INPUT(array_from(i),mynum2.);

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

Re: Using an array to convert character variables to numeric

SAS Innovate 2025: Call for Content