BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Vicente95
Obsidian | Level 7

Hello Team,

 

I need assistance with performing the following task.

  • I have a string/char column called cylinders in my used cars dataset that is formatted as a string with a length of 12 characters. It resembles the following:

Vicente95_2-1624814962084.png

 

Vicente95_1-1624814824759.png

 

  • I need to strip the last ten characters (space followed by cylinders) where the last character is the number of cylinders (i.e., 8, 6, or 4). The new column name for this task should be called number_of_cylinders.
  • I also need to preserve the old cylinders column with all its original values and properties so it remains intact. 

Any help or guidance is much appreciated. 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

I would instead just take the first word in the string and convert that to a number.

data want;
  set have;
  if cylinders ne ' ' then number_of_cylinders=input(scan(cylinders,1,' '),32.);
run;

View solution in original post

15 REPLIES 15
Vicente95
Obsidian | Level 7
Note, I need to convert all the observations in the new column, number_of_cylinders, to numeric as well. The cylinders column has observations such as 10 cylinders, 8 cylinders, 6 cylinders, 4 cylinders, other, blank (i.e., missing values), etc.
Tom
Super User Tom
Super User

I would instead just take the first word in the string and convert that to a number.

data want;
  set have;
  if cylinders ne ' ' then number_of_cylinders=input(scan(cylinders,1,' '),32.);
run;
Vicente95
Obsidian | Level 7

Thank you, Tom! I will give this a try and see if it works. 

Vicente95
Obsidian | Level 7
Ok, your SAS code worked. Thank you very much 😊.

A couple more follow up questions if you don't mind:

1. There are thousands of cars where their cylinders are marked as "other". After creating a new column, how would the values with "other" be handled in number_of_cylinders? Ideally, I think they should be left as "other" right? I am thinking if I should replace them with the mean number of cylinders as shown below.

2. How would I fill in the missing values in the column, number_of_cylinders with the mean of the number of cylinders? I calculated the mean as 6 (rounded up to the nearest while number) using PROC MEANS.
SASKiwi
PROC Star

What proportion of your data has "other" for number of cylinders? Just applying a mean will likely skew your results particularly if the "other" category is large. Why don't you profile cylinders by model? There are likely models which only have 4 or 6 or 8 cylinders so you can assign these accurately. 

Vicente95
Obsidian | Level 7

So, I have a total of 426,880 observations with missing values. After filtering the extreme observations from the price, year, and odometer variables, I am left with 349,188 observations with missing values. To answer your question, 

 

What proportion of your data has "other" for the number of cylinders?

I have 784 observations or 0.38% as "other" in the cylinders column.

 

Vicente95_0-1624840464487.png

 

However, in the number_of_cylinders column, I have a total of 144,722 missing values.

 

Vicente95_1-1624840518430.png

 

I think it is OK to impute the mean number of cylinders for the 144,722 missing values since there is a small number of observations as "other."

 

Vicente95
Obsidian | Level 7

As for the question, "why don't you profile cylinders by model," I am using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

 

I would not know how I would profile each cylinder by the model where there are hundred of thousands of them.

 

 

SASKiwi
PROC Star

What I mean is for example Ford F-150 models might be all 8 cylinder so you can assign 8 to all of those. A PROC FREQ like this will allow you to discover what models have what cylinders:

proc freq data = have;
  table model * cylinders / missing list;
run;
Vicente95
Obsidian | Level 7

That makes sense. However, is there a "simpler" or more optimal process on how to do this for all 349K observations? I get that the Ford F-150 is 8 cylinders, but how I would possibly accurately fill in all the remaining cylinders for each and every model in the dataset? 

SASKiwi
PROC Star

Did you run the program I posted? That will list all unique models by all unique cylinders within each model. From that you could derive a list of all models that only have a certain number of cylinders, and then apply that to your data.

 

However, from what you have posted of your data, model appears to have a lot of inconsistencies. I suggest it would be time well spent to standardise your model text before proceeding with any logic or analysis using it. 

Vicente95
Obsidian | Level 7

Yes, I did. It worked just fine however there are hundreds of thousands of results that my PC lags when scrolling down to view the results. 

 

P.S. I am no SAS expert or statistician let alone possess the knowledge that you have. I am a novice who started learning at the start of this year. Hence, why I asked since I am stuck in the woodworks. 

SASKiwi
PROC Star

That suggests to me that you have a lot of model inconsistencies. You need to standardise these first before making further progress. If you try this:

proc freq data = have;
  where upcase(model) contains 'F150' 
  or upcase(model) contains 'F-150'
  or upcase(model) contains 'F-150';
  table model / missing;
run;

If the result is hundreds of rows then you need to clean model up by assigning a new model category that is probably just a few rows. Checking a Ford website should get you a list of the official mode names.

ballardw
Super User

@Vicente95 wrote:

As for the question, "why don't you profile cylinders by model," I am using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

 

I would not know how I would profile each cylinder by the model where there are hundred of thousands of them.

 

 


Craigslist is not what I would call a reliable data source. There is no requirement for anyone to post much of anything about any a particular item listed. You don't even know if the engine in the car when offered for sale is the original engine. You may well have a Chevrolet engine in a Ford.

Vicente95
Obsidian | Level 7

I agree with what you said, however I was assigned this dataset as my personal project. There is not much that I could do other than use it. 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1119 views
  • 2 likes
  • 4 in conversation