BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Indescribled
Obsidian | Level 7

 

I am working through Tidy Tuesday data to learn how to do stuff I know in R in SAS.

 

I am having trouble converting values from characters to numeric. After doing the conversion, there are still periods in the data. I tried filtering to values greater than 0 or filtering where the values are not ".", but neither worked.

 

How can I filter these issue values out so I can make a boxplot?  Also, if you have any efficiency tips to make my code smoother/better I would also welcome those. 

 

Below code is should be reproducible. 

* Get data 1;
filename test1234 temp;
proc http
 url="https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-07/pit_stops.csv"
 method="GET"
 out=test1234;
run;

proc import out=pit_stops datafile=test1234 dbms=csv replace; 
	guessingrows = max; 
run;

* Get data 2;
filename racecsv temp;
proc http
 url="https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-07/races.csv"
 method="GET"
 out=racecsv;
run;

proc import out=races datafile=racecsv dbms=csv replace; 
	guessingrows = max; 
run;

* Merge/join the data;
* Sorting is required in SAS to merge correctly;
proc sort data=pit_stops;
	by raceId;
run;

proc sort data=races;
	by raceId;
run;

data pit_stopdf;
	merge pit_stops races;
	by raceId;
run;

* Drop all(?) blank rows;
data pit_stopdf;
	set pit_stopdf;
	where year > 2010;
run;

* Change duration to numeric;
* Something is still wrong, there are values with just . in the duration data;
data pit_stopdf;
	set pit_stopdf;
	duration = input(duration, comma5.3);
run;

* Attempt 1 to remove "." from duration;
* This did not work;
data pit_stopsdf;
	set pit_stopsdf;
	where duration > 0;
run;

* Attempt 2 to remove ".";
* Also did not work;
data pit_stopsdf;
	set pit_stopsdf;
	where duration ^= ".";
run;

* Plot the data, currently errors;
proc sgplot data=pit_stopdf;
	vbox duration / category=year;
	title 'Formula1 pit stop duration';
	xaxis label = "Year";
	yaxis label = "Duration";
run;
title;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

The "period" in a numeric variable is the default displayed character to indicate missing.

You can change the display character with the Option missing=' '; put whichever single character you want to see between the quotes.

 

Before attempting to convert variables from character to numeric you should first confirm that the variable is indeed character. Proc Contents is your friend.

 

 

data pit_stopdf;
	set pit_stopdf;
	duration = input(duration, comma5.3);
run;

The LOG for the data step above, if duration is character will show some sort of interesting message like this:

233  data want;
234     set junk;
235     duration = input(duration,comma8.);
236  run;

NOTE: Numeric values have been converted to character values at the places given by:
      (Line):(Column).
      235:15

The line is 235 in the log, the column 15 is the start of Input. Which means that something input converted to numeric was placed into a character (because that is the type of Duration in this case).

 

When using numeric formats you seldom want to use a decimal position because it will imply a decimal exists if there is not actually one in the text read. Example:

data junk;
   input duration $;
datalines;
123,456
123456
123.456
;

data want;
   set junk;
   num_duration = input(duration,comma8.3);
run;

If you use the Comma8. informat you will get the expected values.

 

 

Second, SAS will not let you change the type of a variable from character to numeric or vice versa. To get a numeric value, assuming duration is character, you should create a new variable as you can't use the old one.

 

Also "missing" is less than any actual value.

 

Anything using the numeric value will ignore those values except for specific places that allow you to specify "use missing" in one form or another. For a box plot your data would be just fine without any adjustments.

 

The notion of missing is important as you would typically not want those values used in calculations.

 

There are also "special" missing values that you can assign as .A, .B, ... .Z and ._  See that each of those has a preceding dot. These are also not used for calculations but could be used to indicate different reasons why data is missing such as "Not asked", "respondent refused", "instrument failure", "biologically implausible value" (such as person with a height supposed of 9 feet. You would then see letters or the _ character in your data (Hint: don't use letters or _ as the missing character) in the data but all would not be used for calculation.

 

You could then create custom formats to associate specific text with the meaning for the missing values.

View solution in original post

2 REPLIES 2
ballardw
Super User

The "period" in a numeric variable is the default displayed character to indicate missing.

You can change the display character with the Option missing=' '; put whichever single character you want to see between the quotes.

 

Before attempting to convert variables from character to numeric you should first confirm that the variable is indeed character. Proc Contents is your friend.

 

 

data pit_stopdf;
	set pit_stopdf;
	duration = input(duration, comma5.3);
run;

The LOG for the data step above, if duration is character will show some sort of interesting message like this:

233  data want;
234     set junk;
235     duration = input(duration,comma8.);
236  run;

NOTE: Numeric values have been converted to character values at the places given by:
      (Line):(Column).
      235:15

The line is 235 in the log, the column 15 is the start of Input. Which means that something input converted to numeric was placed into a character (because that is the type of Duration in this case).

 

When using numeric formats you seldom want to use a decimal position because it will imply a decimal exists if there is not actually one in the text read. Example:

data junk;
   input duration $;
datalines;
123,456
123456
123.456
;

data want;
   set junk;
   num_duration = input(duration,comma8.3);
run;

If you use the Comma8. informat you will get the expected values.

 

 

Second, SAS will not let you change the type of a variable from character to numeric or vice versa. To get a numeric value, assuming duration is character, you should create a new variable as you can't use the old one.

 

Also "missing" is less than any actual value.

 

Anything using the numeric value will ignore those values except for specific places that allow you to specify "use missing" in one form or another. For a box plot your data would be just fine without any adjustments.

 

The notion of missing is important as you would typically not want those values used in calculations.

 

There are also "special" missing values that you can assign as .A, .B, ... .Z and ._  See that each of those has a preceding dot. These are also not used for calculations but could be used to indicate different reasons why data is missing such as "Not asked", "respondent refused", "instrument failure", "biologically implausible value" (such as person with a height supposed of 9 feet. You would then see letters or the _ character in your data (Hint: don't use letters or _ as the missing character) in the data but all would not be used for calculation.

 

You could then create custom formats to associate specific text with the meaning for the missing values.

Indescribled
Obsidian | Level 7

Thanks for the explanation, that helps me understand how missing values work.

I tried running the code without any filtering for the missing values, but I still get an error. Specifically: ERROR: The VBOX variable must be numeric.

I assumed this error was from the missing values. Perhaps there is some other error that is causing it?

 

EDIT: I found the solution after the responder edited his post. This line is the correct solution "Second, SAS will not let you change the type of a variable from character to numeric or vice versa. To get a numeric value, assuming duration is character, you should create a new variable as you can't use the old one." As someone coming from R you can change one back and forth, so this was not intuitive for me in SAS.

 

Here is the corrected code that works if anyone stumbles on this in the future:

* Get data 1;
filename test1234 temp;
proc http
 url="https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-07/pit_stops.csv"
 method="GET"
 out=test1234;
run;

proc import out=pit_stops datafile=test1234 dbms=csv replace; 
	guessingrows = max; 
run;

* Get data 2;
filename racecsv temp;
proc http
 url="https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-07/races.csv"
 method="GET"
 out=racecsv;
run;

proc import out=races datafile=racecsv dbms=csv replace; 
	guessingrows = max; 
run;

* Merge/join the data;
* Sorting is required in SAS to merge correctly;
proc sort data=pit_stops;
	by raceId;
run;

proc sort data=races;
	by raceId;
run;

data pit_stopdf;
	merge pit_stops races;
	by raceId;
run;

* Drop all(?) blank rows;
data pit_stopdf;
	set pit_stopdf;
	where year > 2010;
run;

* Change duration to numeric;
data pit_stopdf;
	set pit_stopdf;
	duration_numeric = input(duration, comma8.);
run;

* Plot the data;
proc sgplot data=pit_stopdf;
	vbox duration_numeric / category=year;
	title 'Formula1 pit stop duration';
	xaxis label = "Year";
	yaxis label = "Duration";
run;
title;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 2 replies
  • 2948 views
  • 1 like
  • 2 in conversation