Hello Team,
I need assistance with performing the following task.
Any help or guidance is much appreciated.
I would instead just take the first word in the string and convert that to a number.
data want;
set have;
if cylinders ne ' ' then number_of_cylinders=input(scan(cylinders,1,' '),32.);
run;
I would instead just take the first word in the string and convert that to a number.
data want;
set have;
if cylinders ne ' ' then number_of_cylinders=input(scan(cylinders,1,' '),32.);
run;
Thank you, Tom! I will give this a try and see if it works.
What proportion of your data has "other" for number of cylinders? Just applying a mean will likely skew your results particularly if the "other" category is large. Why don't you profile cylinders by model? There are likely models which only have 4 or 6 or 8 cylinders so you can assign these accurately.
So, I have a total of 426,880 observations with missing values. After filtering the extreme observations from the price, year, and odometer variables, I am left with 349,188 observations with missing values. To answer your question,
What proportion of your data has "other" for the number of cylinders?
I have 784 observations or 0.38% as "other" in the cylinders column.
However, in the number_of_cylinders column, I have a total of 144,722 missing values.
I think it is OK to impute the mean number of cylinders for the 144,722 missing values since there is a small number of observations as "other."
As for the question, "why don't you profile cylinders by model," I am using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data
I would not know how I would profile each cylinder by the model where there are hundred of thousands of them.
What I mean is for example Ford F-150 models might be all 8 cylinder so you can assign 8 to all of those. A PROC FREQ like this will allow you to discover what models have what cylinders:
proc freq data = have;
table model * cylinders / missing list;
run;
That makes sense. However, is there a "simpler" or more optimal process on how to do this for all 349K observations? I get that the Ford F-150 is 8 cylinders, but how I would possibly accurately fill in all the remaining cylinders for each and every model in the dataset?
Did you run the program I posted? That will list all unique models by all unique cylinders within each model. From that you could derive a list of all models that only have a certain number of cylinders, and then apply that to your data.
However, from what you have posted of your data, model appears to have a lot of inconsistencies. I suggest it would be time well spent to standardise your model text before proceeding with any logic or analysis using it.
Yes, I did. It worked just fine however there are hundreds of thousands of results that my PC lags when scrolling down to view the results.
P.S. I am no SAS expert or statistician let alone possess the knowledge that you have. I am a novice who started learning at the start of this year. Hence, why I asked since I am stuck in the woodworks.
That suggests to me that you have a lot of model inconsistencies. You need to standardise these first before making further progress. If you try this:
proc freq data = have;
where upcase(model) contains 'F150'
or upcase(model) contains 'F-150'
or upcase(model) contains 'F-150';
table model / missing;
run;
If the result is hundreds of rows then you need to clean model up by assigning a new model category that is probably just a few rows. Checking a Ford website should get you a list of the official mode names.
@Vicente95 wrote:
As for the question, "why don't you profile cylinders by model," I am using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data
I would not know how I would profile each cylinder by the model where there are hundred of thousands of them.
Craigslist is not what I would call a reliable data source. There is no requirement for anyone to post much of anything about any a particular item listed. You don't even know if the engine in the car when offered for sale is the original engine. You may well have a Chevrolet engine in a Ford.
I agree with what you said, however I was assigned this dataset as my personal project. There is not much that I could do other than use it.
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: