Hello Team,
I need assistance with performing the following task.
Any help or guidance is much appreciated.
I would instead just take the first word in the string and convert that to a number.
data want;
set have;
if cylinders ne ' ' then number_of_cylinders=input(scan(cylinders,1,' '),32.);
run;
I would instead just take the first word in the string and convert that to a number.
data want;
set have;
if cylinders ne ' ' then number_of_cylinders=input(scan(cylinders,1,' '),32.);
run;
Thank you, Tom! I will give this a try and see if it works.
What proportion of your data has "other" for number of cylinders? Just applying a mean will likely skew your results particularly if the "other" category is large. Why don't you profile cylinders by model? There are likely models which only have 4 or 6 or 8 cylinders so you can assign these accurately.
So, I have a total of 426,880 observations with missing values. After filtering the extreme observations from the price, year, and odometer variables, I am left with 349,188 observations with missing values. To answer your question,
What proportion of your data has "other" for the number of cylinders?
I have 784 observations or 0.38% as "other" in the cylinders column.
However, in the number_of_cylinders column, I have a total of 144,722 missing values.
I think it is OK to impute the mean number of cylinders for the 144,722 missing values since there is a small number of observations as "other."
As for the question, "why don't you profile cylinders by model," I am using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data
I would not know how I would profile each cylinder by the model where there are hundred of thousands of them.
What I mean is for example Ford F-150 models might be all 8 cylinder so you can assign 8 to all of those. A PROC FREQ like this will allow you to discover what models have what cylinders:
proc freq data = have;
table model * cylinders / missing list;
run;
That makes sense. However, is there a "simpler" or more optimal process on how to do this for all 349K observations? I get that the Ford F-150 is 8 cylinders, but how I would possibly accurately fill in all the remaining cylinders for each and every model in the dataset?
Did you run the program I posted? That will list all unique models by all unique cylinders within each model. From that you could derive a list of all models that only have a certain number of cylinders, and then apply that to your data.
However, from what you have posted of your data, model appears to have a lot of inconsistencies. I suggest it would be time well spent to standardise your model text before proceeding with any logic or analysis using it.
Yes, I did. It worked just fine however there are hundreds of thousands of results that my PC lags when scrolling down to view the results.
P.S. I am no SAS expert or statistician let alone possess the knowledge that you have. I am a novice who started learning at the start of this year. Hence, why I asked since I am stuck in the woodworks.
That suggests to me that you have a lot of model inconsistencies. You need to standardise these first before making further progress. If you try this:
proc freq data = have;
where upcase(model) contains 'F150'
or upcase(model) contains 'F-150'
or upcase(model) contains 'F-150';
table model / missing;
run;
If the result is hundreds of rows then you need to clean model up by assigning a new model category that is probably just a few rows. Checking a Ford website should get you a list of the official mode names.
@Vicente95 wrote:
As for the question, "why don't you profile cylinders by model," I am using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data
I would not know how I would profile each cylinder by the model where there are hundred of thousands of them.
Craigslist is not what I would call a reliable data source. There is no requirement for anyone to post much of anything about any a particular item listed. You don't even know if the engine in the car when offered for sale is the original engine. You may well have a Chevrolet engine in a Ford.
I agree with what you said, however I was assigned this dataset as my personal project. There is not much that I could do other than use it.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.