Hi,
in my dataset I have 15 variables from PR1, PR2, PR3..........- PR15. They contain data about medical conditions, in the form of codes. I want to create a new variable "PCI", that would be coded as 1 or 0. If any of the variables from PR1......PR15 contains any of these codes ('0066', '3604', '3606', '3607') , then PCI will be coded as 1 otherwise 0.
I wrote the following array which is working fine.
data library.nismicathcabg4;
set library.nismicathcabg4;
array dxtrialpci{15} pr1-pr15;
pci=0;
do i=1 to 15;
if dxtrialpci{i} in ('0066', '3604', '3606', '3607')
then pci=1;
end;
But NOW i need to include a statement that will take care of missing cases. I want PCI, to be coded as " . ", if all 15 variables PR1-PR15 have no data listed or have "blanks".
I am not sure where and what staement to include in this array, which will help me with missing cases..
Any idea about how to do this?
Any help will be appreciated..
Thanks
Ashwini
Try this:
data library.nismicathcabg4;
set library.nismicathcabg4;
array dxtrialpci{15} pr1-pr15;
if cmiss(of pr:)=15 then call missing(pci);
else do;
do i=1 to 15;
if dxtrialpci{i} in ('0066', '3604', '3606', '3607')
then do;
pci=1;
return;
end;
else pci=0;
end;
end;
run;
Regards,
Haikuo
Hi Haikuo,
This works..thanks a lot. But I have 2 questions:
1. What does cmiss in the 4th line mean and what is "pr"? could you explain to me the meaning of this staement in the program..
2. This data that i am working on is huge, with 0ver 80000 cases so it is not manually possible for me to check of the numbers i got after running the above program are right. What should be an easy way to verify the numbers.. atleast the numbers for missing data. For my sanity, i just want to verify if the I am getting right numbers after running the above program.
Thanks,
Ashwini
On 1), CMISS is a function and you can look up how it is used. "pr" is not correct, it is "pr:", as Hai.kou showed. That is a shorthand to include all of the variables that start with the letters "pr". (That is also in the doucumentation, but a lot harder to find.)
On 2, you could do a couple of things.
a) you should select a random sample for manual checking.
b) you could do some frequency tables with the original variables and the derived ones. I find that the /LIST MISSING options make them a quick read.
Doc Muhlbaier
Duke
Thanks Doc.
But in my data, there are other variables too, that start with Pr... such as procedure1, procedure2, procedure3 etc and procedure class 1 and procedure class2 etc.. so are these too included in the above program?
In that case, use cmiss(of pr1-pr15).
Haikuo
Hi Ashwini,
1. CMISS() returns the counts of missing values within its arguments. 'pr' basically represents any variables start with 'pr'. in your case,'pr:' is the short way to address your variables from pr1 to pr15. So if cmiss() returns 15, which will mean that all of your 15 variables from pr1 to pr15 are missing.
2. you can check it by limiting the input table to a managable number, such as 100:
data library.nismicathcabg4;
set library.nismicathcabg4 (obs=100);
.....
3. Addtional point: When you doing test run, I suggest that you assign different names to input and output tables, so it is easier for you to track if something goes wrong.
my 2 cents,
Haikuo
Thanks again Haikuo. But after chaging the if cmiss(of pr:)=15 to if cmiss(of pr1-pr15)=15, the output is not showing any missing cases anymore. Earlier it showed 2910 cases with missing data. But now no missing data. And I am sure there got to be some cases with missing data...so what is going wrong.?..
Well, all I can guess is that you don't have a record that all of pr1-pr15 are missing at the same time. The reason you had hits before is probably because you have included more variables and there happen to be cases in which 15 of them missing.
The straightforward way to figure it out is using the previous version to single out some obs, then check them out:
if cmiss(of pr:)=15 then output;
Haikuo
Yes, you are right. i just checked using some if- then statements.. there is not a single case with missing data for PR1.
i would like to know if you think,, the following program is right for the same above purpose about the missing data?
option missing= ' ';
However this shows me one missing case..
data library.nismicathcabg4;
set library.nismicathcabg4;
array dxtrialpci{15} pr1-pr15;
pcitrial=0;
do i=1 to 15;
if dxtrialpci{i} in ('0066', '3604', '3606', '3607') then pcitrial=1;
end;
if n (pr1, pr1, pr3, pr4, pr5, pr6, pr7, pr8, pr9, pr10, pr11, pr12, pr13, pr14, pr15)=0 then pcitrial=. ;
run;
Thanks,
Ashwini
It will work, however, the efficiency is not optimized.
1. you could use n(of pr1-pr15)=0 then ..., well this does not affect your running efficiency though.
2. you evaluate twice on every record of yours by using parallel two 'if', instead of 'if-else'.
3. in your array do-loop, you should utilize 'leave' statement when you get a hit, so you don't have to exhaust each loop unneccesarily.
Kindly Regards,
Haikuo
I had created some arrays before which now seem to be not useful anymore. Should I just replace those earlier arrays with the one above( suggested by you), but keeping the OLD array and variables names?
Would that cause any conflict or error or its just to do that?
I just donot want to create any more new variables, for the original variables....
Safely speaking, you would have to evaluate them on case by case basis.
But, what I have suggested do not modify your array() set up, instead, just some flow change to help efficiency.
Haikuo
Edit: 80k records does not seem a lot to me ( if not too wide), so I doubt you can tell too much of difference any way.
Sorry for the confusion here.
Actually the earlier arrays which didnot account for cases with missing data, have been already used and the variables are already created in the datasets.
Now that I need to use the correct arrays with account for missing cases, and I am thinking of using the one you posted in your first response to this post. And that is why I am thinking of NOT creating alltogether a new array with new names bust just replacing the nams of arrays and the variables created with the ones that are alreay created by the earlier flawed arrays.
But then i was doubtful if that would cause any conflict with the program and give me any wrong output..
where exactly does the 'leave ' statement go?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.