Re: How do I correct for missing values?

trash · Posted 10-10-2017 11:09 AM

So here's a data set:

This is a set of observations on 27 people. Several things were noted. The first variable is the hospital in which the data

was collected, the second variable is the gender of the person, then a measure of the amount of iron in the blood called

hemoglobin (Haemo), then packed cell volume (Pcv), white blood cell count (Wbc), the number of lymphocytes (Lympho), neutrophil (Neutro), and serum lead concentration (Lead).

hosp	gender	haemo	pcv	wbc	lympho	neutro	lead
A	M	13.4	39	4100	14	25	17
B	f	14.6	46	5000	15	30	20
a	F	13.5	42	4500	19	21	18
A	M	15	46	4600	23	16	18
b	m	14.6	44	5100	17	.	19
B	m	14	44	4900	20	24	19
b	f	16.4	49	4300	21	17	18
A	f	14.8	44	4400	16	26	29
a	F	15.2	46	4100	27	13	27
B	M	15.5	48	8400	34	42	36
A	M	15.2	47	.	26	27	22
B	F	16.9	50	5100	28	17	23
a	F	14.8	44	4700	24	20	23
A	M	16.2	45	5600	26	25	19
b	M	14.7	43	4000	23	13	17
A	F	14.7	42	3400	9	22	13
B	m	16.5	45	5400	18	32	17
B	f	15.4	45	6900	28	36	24
B	M	15.1	45	4600	.	29	17
a	M	14.2	46	4200	14	25	28
b	F	15.9	46	5200	8	34	16
B	f	16	47	4700	25	14	18
A	F	17.4	50	8600	37	.	17
A	M	14.3	43	5500	20	31	19
B	F	14.8	44	4200	15	24	19
a	f	15.5	.	4900	17	27	16
a	m	26.5	.	5200	19	32	21

1) the missing data is noted with a period (.). How do write a statement for this? I would import data as a .csv, after would I write the missing data statement, then run it? I also need to account for the upper case letters vs lower case in the first two columns. Would I write a if-else statement? or would I just need to proc format or proc label? and what order should it be in?

here's my start:

data Hospital;
infile 'D:\hospital.csv'
delimiter= ','
firstobs=2;
input hosp $ gender $ haemo pcv wbc lympho neutro lead;
run;

mkeintz · Posted 10-10-2017 11:18 AM

What else did the log say, aside from 0 observations?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Reeza · Posted 10-10-2017 11:19 AM

@trash wrote:

1) the missing data is noted with a period (.). How do write a statement for this?

When you import the data, check your imported data set. Either use PROC PRINT to display it or open the data set and explore it. What you want to 'write a statement' for is not clear.

I also need to account for the upper case letters vs lower case in the first two columns. Would I write a if-else statement? or would I just need to proc format or proc label? (what even is the difference?) and what order should it be in?

Look at the UPCASE() and LOWCASE() function that will convert all to the same case. Yes, this matters if they are equivalent and you want to treat them the same.

http://video.sas.com/detail/videos/sas-analytics-u/video/4573023399001/creating-a-new-column-in-sas?...

I had trouble importing data (kept saying 0 observations) so I couldn't work much farther.

Please see this video on SAS UE on how to import your data:

http://video.sas.com/detail/videos/sas-analytics-u/video/4573016758001/creating-a-sas-table-from-a-c...

Without the log we have no idea what went wrong. Always read your log. If your code doesn't work and you need help from here make sure to include your code and log.

In general, it sounds like you're just starting with SAS. The first SAS programming course is free and there are many videos here:

http://video.sas.com/#category/videos/sas-analytics-u

And a lot of great tutorials are here:

https://stats.idre.ucla.edu/sas/modules/

ballardw · Posted 10-10-2017 11:23 AM

It isn't clear if you successfully read the data or not. It might help to post the LOG with the code for the data step and any messages. Post into a code box opened with the forum {I} menu icon.

If the data is not sensitive, (containing personally identifiable info for example) you might copy a few lines from the CSV file using a text editor like Notepad or Word Pad (NOT Excel it reformats data) and post that into a code box as well.

Easiest way to address case issues like "m", "M", "f" and "F" in your gender field would be to add this like of code to the read program:

gender = upcase(gender); if you want consistent "M" and "F". If you want "m" and "f" use the lowcase function. You may want to do the same thing for your hospital variable unless "a" is a different hospital than "A". The way you are reading hospital will limit the value to 8 characters. If it should be longer you need to specify a longer length with one of an informat, length or attrib statement for hospital prior to the input statement.

In a csv file a missing value would most likely look like two commas together or two commas and a space: ,, or , ,

and SAS will treat either of those a missing for numeric variable.

trash · Posted 10-11-2017 07:11 PM

when you say 8 characters is that a default? It doesn't need to be longer. A=a for this data set.

here is the data set again from a notepad:

A,M,13.4,39,4100,14,25,17
B,f,14.6,46,5000,15,30,20
a,F,13.5,42,4500,19,21,18
A,M,15,46,4600,23,16,18
b,m,14.6,44,5100,17,,19
B,m,14,44,4900,20,24,19
b,f,16.4,49,4300,21,17,18
A,f,14.8,44,4400,16,26,29
a,F,15.2,46,4100,27,13,27
B,M,15.5,48,8400,34,42,36
A,M,15.2,47,,26,27,22
B,F,16.9,50,5100,28,17,23
a,F,14.8,44,4700,24,20,23
A,M,16.2,45,5600,26,25,19
b,M,14.7,43,4000,23,13,17
A,F,14.7,42,3400,9,22,13
B,m,16.5,45,5400,18,32,17
B,f,15.4,45,6900,28,36,24
B,M,15.1,45,4600,,29,17
a,M,14.2,46,4200,14,25,28
b,F,15.9,46,5200,8,34,16
B,f,16,47,4700,25,14,18
A,F,17.4,50,8600,37,,17
A,M,14.3,43,5500,20,31,19
B,F,14.8,44,4200,15,24,19
a,f,15.5,,4900,17,27,16
a,m,26.5,,5200,19,32,21

I see what you now mean with the double commas, but I'm not sure how to handle that when it comes to importing the data.

s_lassen · Posted 10-13-2017 04:53 AM

@trash:

The "double commas" you can account for by using the dsd option on your INFILE statement.

The easy way to account for lower/upper cases is to use the $UPCASE. informat.

That way, your program would be something like:

data Hospital;
  infile 'D:\hospital.csv' delimiter= ',' dsd firstobs=2;
  length hosp gender $1;
  informat hosp gender $upcase.;
  input hosp $ gender $ haemo pcv wbc lympho neutro lead;
run;

How do I correct for missing values?