Arrays in SAS- missing cases

Ashwini_uci · Posted 03-13-2012 11:23 AM

Hi,

in my dataset I have 15 variables from PR1, PR2, PR3..........- PR15. They contain data about medical conditions, in the form of codes. I want to create a new variable "PCI", that would be coded as 1 or 0. If any of the variables from PR1......PR15 contains any of these codes ('0066', '3604', '3606', '3607') , then PCI will be coded as 1 otherwise 0.

I wrote the following array which is working fine.

data library.nismicathcabg4;

set library.nismicathcabg4;

array dxtrialpci{15} pr1-pr15;

pci=0;

do i=1 to 15;

if dxtrialpci{i} in ('0066', '3604', '3606', '3607')

then pci=1;

end;

But NOW i need to include a statement that will take care of missing cases. I want PCI, to be coded as " . ", if all 15 variables PR1-PR15 have no data listed or have "blanks".

I am not sure where and what staement to include in this array, which will help me with missing cases..

Any idea about how to do this?

Any help will be appreciated..

Thanks

Ashwini

Haikuo · Posted 03-13-2012 11:31 AM

Try this:

data library.nismicathcabg4;

set library.nismicathcabg4;

array dxtrialpci{15} pr1-pr15;

if cmiss(of pr:)=15 then call missing(pci);

else do;

do i=1 to 15;

if dxtrialpci{i} in ('0066', '3604', '3606', '3607')

then do;

pci=1;

return;

end;

else pci=0;

end;

run;

Regards,

Haikuo

Ashwini_uci · Posted 03-13-2012 11:46 AM

Hi Haikuo,

This works..thanks a lot. But I have 2 questions:

1. What does cmiss in the 4th line mean and what is "pr"? could you explain to me the meaning of this staement in the program..

2. This data that i am working on is huge, with 0ver 80000 cases so it is not manually possible for me to check of the numbers i got after running the above program are right. What should be an easy way to verify the numbers.. atleast the numbers for missing data. For my sanity, i just want to verify if the I am getting right numbers after running the above program.

Thanks,

Ashwini

Doc_Duke · Posted 03-13-2012 11:55 AM

On 1), CMISS is a function and you can look up how it is used. "pr" is not correct, it is "pr:", as Hai.kou showed. That is a shorthand to include all of the variables that start with the letters "pr". (That is also in the doucumentation, but a lot harder to find.)

On 2, you could do a couple of things.

a) you should select a random sample for manual checking.

b) you could do some frequency tables with the original variables and the derived ones. I find that the /LIST MISSING options make them a quick read.

Doc Muhlbaier

Duke

Ashwini_uci · Posted 03-13-2012 12:04 PM

Thanks Doc.

But in my data, there are other variables too, that start with Pr... such as procedure1, procedure2, procedure3 etc and procedure class 1 and procedure class2 etc.. so are these too included in the above program?

Haikuo · Posted 03-13-2012 12:06 PM

In that case, use cmiss(of pr1-pr15).

Haikuo

Haikuo · Posted 03-13-2012 12:03 PM

Hi Ashwini,

1. CMISS() returns the counts of missing values within its arguments. 'pr' basically represents any variables start with 'pr'. in your case,'pr:' is the short way to address your variables from pr1 to pr15. So if cmiss() returns 15, which will mean that all of your 15 variables from pr1 to pr15 are missing.

2. you can check it by limiting the input table to a managable number, such as 100:

data library.nismicathcabg4;

set library.nismicathcabg4 (obs=100);

.....

3. Addtional point: When you doing test run, I suggest that you assign different names to input and output tables, so it is easier for you to track if something goes wrong.

my 2 cents,

Haikuo

Ashwini_uci · Posted 03-13-2012 12:14 PM

Thanks again Haikuo. But after chaging the if cmiss(of pr:)=15 to if cmiss(of pr1-pr15)=15, the output is not showing any missing cases anymore. Earlier it showed 2910 cases with missing data. But now no missing data. And I am sure there got to be some cases with missing data...so what is going wrong.?..

Haikuo · Posted 03-13-2012 12:21 PM

Well, all I can guess is that you don't have a record that all of pr1-pr15 are missing at the same time. The reason you had hits before is probably because you have included more variables and there happen to be cases in which 15 of them missing.

The straightforward way to figure it out is using the previous version to single out some obs, then check them out:

if cmiss(of pr:)=15 then output;

Haikuo

Ashwini_uci · Posted 03-13-2012 12:58 PM

Yes, you are right. i just checked using some if- then statements.. there is not a single case with missing data for PR1.

i would like to know if you think,, the following program is right for the same above purpose about the missing data?

option missing= ' ';

However this shows me one missing case..

data library.nismicathcabg4;

set library.nismicathcabg4;

array dxtrialpci{15} pr1-pr15;

pcitrial=0;

do i=1 to 15;

if dxtrialpci{i} in ('0066', '3604', '3606', '3607') then pcitrial=1;

end;

if n (pr1, pr1, pr3, pr4, pr5, pr6, pr7, pr8, pr9, pr10, pr11, pr12, pr13, pr14, pr15)=0 then pcitrial=. ;

run;

Thanks,

Ashwini

Haikuo · Posted 03-13-2012 01:07 PM

It will work, however, the efficiency is not optimized.

1. you could use n(of pr1-pr15)=0 then ..., well this does not affect your running efficiency though.

2. you evaluate twice on every record of yours by using parallel two 'if', instead of 'if-else'.

3. in your array do-loop, you should utilize 'leave' statement when you get a hit, so you don't have to exhaust each loop unneccesarily.

Kindly Regards,

Haikuo

Ashwini_uci · Posted 03-13-2012 01:31 PM

I had created some arrays before which now seem to be not useful anymore. Should I just replace those earlier arrays with the one above( suggested by you), but keeping the OLD array and variables names?

Would that cause any conflict or error or its just to do that?

I just donot want to create any more new variables, for the original variables....

Haikuo · Posted 03-13-2012 01:38 PM

Safely speaking, you would have to evaluate them on case by case basis.

But, what I have suggested do not modify your array() set up, instead, just some flow change to help efficiency.

Haikuo

Edit: 80k records does not seem a lot to me ( if not too wide), so I doubt you can tell too much of difference any way.

Ashwini_uci · Posted 03-13-2012 01:56 PM

Sorry for the confusion here.

Actually the earlier arrays which didnot account for cases with missing data, have been already used and the variables are already created in the datasets.

Now that I need to use the correct arrays with account for missing cases, and I am thinking of using the one you posted in your first response to this post. And that is why I am thinking of NOT creating alltogether a new array with new names bust just replacing the nams of arrays and the variables created with the ones that are alreay created by the earlier flawed arrays.

But then i was doubtful if that would cause any conflict with the program and give me any wrong output..

Ashwini_uci · Posted 03-13-2012 01:49 PM

where exactly does the 'leave ' statement go?