Solved: Re: Removing the last 10 characters from a string and converting it to...

Vicente95 · Posted 06-27-2021 01:29 PM

Hello Team,

I need assistance with performing the following task.

I have a string/char column called cylinders in my used cars dataset that is formatted as a string with a length of 12 characters. It resembles the following:

I need to strip the last ten characters (space followed by cylinders) where the last character is the number of cylinders (i.e., 8, 6, or 4). The new column name for this task should be called number_of_cylinders.
I also need to preserve the old cylinders column with all its original values and properties so it remains intact.

Any help or guidance is much appreciated.

Tom · Posted 06-27-2021 01:40 PM

I would instead just take the first word in the string and convert that to a number.

data want;
  set have;
  if cylinders ne ' ' then number_of_cylinders=input(scan(cylinders,1,' '),32.);
run;

View solution in original post

Vicente95 · Posted 06-27-2021 01:38 PM

Note, I need to convert all the observations in the new column, number_of_cylinders, to numeric as well. The cylinders column has observations such as 10 cylinders, 8 cylinders, 6 cylinders, 4 cylinders, other, blank (i.e., missing values), etc.

Tom · Posted 06-27-2021 01:40 PM

I would instead just take the first word in the string and convert that to a number.

data want;
  set have;
  if cylinders ne ' ' then number_of_cylinders=input(scan(cylinders,1,' '),32.);
run;

Vicente95 · Posted 06-27-2021 01:42 PM

Thank you, Tom! I will give this a try and see if it works.

Vicente95 · Posted 06-27-2021 02:50 PM

Ok, your SAS code worked. Thank you very much 😊.

A couple more follow up questions if you don't mind:

1. There are thousands of cars where their cylinders are marked as "other". After creating a new column, how would the values with "other" be handled in number_of_cylinders? Ideally, I think they should be left as "other" right? I am thinking if I should replace them with the mean number of cylinders as shown below.

2. How would I fill in the missing values in the column, number_of_cylinders with the mean of the number of cylinders? I calculated the mean as 6 (rounded up to the nearest while number) using PROC MEANS.

SASKiwi · Posted 06-27-2021 04:41 PM

What proportion of your data has "other" for number of cylinders? Just applying a mean will likely skew your results particularly if the "other" category is large. Why don't you profile cylinders by model? There are likely models which only have 4 or 6 or 8 cylinders so you can assign these accurately.

Vicente95 · Posted 06-27-2021 08:36 PM

So, I have a total of 426,880 observations with missing values. After filtering the extreme observations from the price, year, and odometer variables, I am left with 349,188 observations with missing values. To answer your question,

What proportion of your data has "other" for the number of cylinders?

I have 784 observations or 0.38% as "other" in the cylinders column.

However, in the number_of_cylinders column, I have a total of 144,722 missing values.

I think it is OK to impute the mean number of cylinders for the 144,722 missing values since there is a small number of observations as "other."

Vicente95 · Posted 06-27-2021 08:38 PM

As for the question, "why don't you profile cylinders by model," I am using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

I would not know how I would profile each cylinder by the model where there are hundred of thousands of them.

SASKiwi · Posted 06-27-2021 09:05 PM

What I mean is for example Ford F-150 models might be all 8 cylinder so you can assign 8 to all of those. A PROC FREQ like this will allow you to discover what models have what cylinders:

proc freq data = have;
  table model * cylinders / missing list;
run;

Vicente95 · Posted 06-28-2021 12:36 AM

That makes sense. However, is there a "simpler" or more optimal process on how to do this for all 349K observations? I get that the Ford F-150 is 8 cylinders, but how I would possibly accurately fill in all the remaining cylinders for each and every model in the dataset?

SASKiwi · Posted 06-28-2021 02:25 AM

Did you run the program I posted? That will list all unique models by all unique cylinders within each model. From that you could derive a list of all models that only have a certain number of cylinders, and then apply that to your data.

However, from what you have posted of your data, model appears to have a lot of inconsistencies. I suggest it would be time well spent to standardise your model text before proceeding with any logic or analysis using it.

Vicente95 · Posted 06-28-2021 11:45 AM

Yes, I did. It worked just fine however there are hundreds of thousands of results that my PC lags when scrolling down to view the results.

P.S. I am no SAS expert or statistician let alone possess the knowledge that you have. I am a novice who started learning at the start of this year. Hence, why I asked since I am stuck in the woodworks.

SASKiwi · Posted 06-28-2021 06:52 PM

That suggests to me that you have a lot of model inconsistencies. You need to standardise these first before making further progress. If you try this:

proc freq data = have;
  where upcase(model) contains 'F150' 
  or upcase(model) contains 'F-150'
  or upcase(model) contains 'F-150';
  table model / missing;
run;

If the result is hundreds of rows then you need to clean model up by assigning a new model category that is probably just a few rows. Checking a Ford website should get you a list of the official mode names.

ballardw · Posted 06-28-2021 05:34 AM

@Vicente95 wrote:

As for the question, "why don't you profile cylinders by model," I am using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

I would not know how I would profile each cylinder by the model where there are hundred of thousands of them.

Craigslist is not what I would call a reliable data source. There is no requirement for anyone to post much of anything about any a particular item listed. You don't even know if the engine in the car when offered for sale is the original engine. You may well have a Chevrolet engine in a Ford.

Vicente95 · Posted 06-28-2021 11:47 AM

I agree with what you said, however I was assigned this dataset as my personal project. There is not much that I could do other than use it.

Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Re: Removing the last 10 characters from a string and converting it to a numeric variable

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!