BookmarkSubscribeRSS Feed
simmaee
Calcite | Level 5

Hello all, 

Need help with this class assignment to better understanding the two types of variables:

 

Please help answer the question below:

 

Screen Shot 2023-01-26 at 2.01.36 PM.png

 

4 REPLIES 4
ballardw
Super User

Which question, there are three.

 

In general I want  to know what the units, if any of a measure are, and how the data is collected before deciding.

I ask this sort of thing because I am aware of many surveys that collect data in a Yes/No/ Refused/Don't know category but the values are, for one example 1, 2, 7 and 9. So summary statistics for that would/could show a minimum of 1, median of 2 and maximum of 9.

 

If I were not allowed to have information about collection I would use a tool such as Proc Freq with the NLEVELS to see just how many values are involved. If the number of distinct values comes back with 4, such as in my example above, that might push a decision towards categorical.

 

Some variables could be treated as either depending on the specific analysis attempted or question asked.

 

sbxkoenk
SAS Super FREQ

"Smoking"-var has the label "Weight" which is weird.
I would expect "Smoking"-var to be a YES vs. NO variable (1 versus 0).
That might be the case, but the < Maximum > equals 60??
The < median > however equals 1.
This means that :

  • 50% of the 5173 non-missing values (records) has 1 or higher as a value for "Smoking"-var and
  • 50% of the 5173 non-missing values (records) has 1 or lower as a value for "Smoking"-var.

I think "Smoking"-var should be categorical (i.e. a CLASS effect should you make a model or a binary target if you want to explain / predict it).

 

Check with PROC FREQ and nlevels option to know about the cardinality of "Smoking"-var.

 

PROC FREQ data=have NLEVELS;
tables Smoking / missing;
run;

ballardw
Super User

@sbxkoenk wrote:

"Smoking"-var has the label "Weight" which is weird.

 

That's just some poor rendition. I looked at that several time before realizing the label that appears to be for Smoking is the last word for the variable above smoking, MRW, where the label Metropolitan Relative Weight for the variable makes some sense.

The "Age at Death" label for Cholesterol shows the same lack of care of alignment by whoever prepared that image.

PaigeMiller
Diamond | Level 26

@simmaee wrote:

Hello all, 

Need help with this class assignment to better understanding the two types of variables:

 

Please help answer the question below:

 

Screen Shot 2023-01-26 at 2.01.36 PM.png

 


So what do you think about Smoking? You didn't tell us.

 

Perhaps you should go back to whoever provided this example output and ask why the label for SMOKING is WEIGHT. Or ask how SMOKING can have a mean of 9.366. I think that is the next step.

 

This isn't our data, and I doubt anyone here can explain what variable SMOKING represents or why the mean is 9.366. As it is, I think this is an extremely poor class example.

--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1174 views
  • 0 likes
  • 4 in conversation