Solved: Re: Calculate error with variable known and unknown scenarios using re... - Page 3

novinosrin · Posted 11-16-2018 11:59 AM

Ah ok, I was thinking it's only CODE that did it. Well, this has made us learn not to edit the sql, but do it in an exclusive proc sort. Of course i understand and acknowledge, another pass or step is not optimal from a performance point of view, but on the other hand helps us debug easily. For example, we can always go back to the sql and identify the cause but an edited sql may make it more tedious to debug

PS

We shall get to high performance computing when we deal with atleast 100 million++ records. Until then small few extra steps can only help

Cruise · Posted 11-16-2018 08:36 AM

@novinosrin

Good morning Novinosrin,

I just ran two proc sqls on the HAVE1 (N=90,000) attached to this post. WGT_DX_FULL_SAS created, with 4326 rows and P.WGT_DX_PARTIAL_SAS created, with 4205 rows while HAVE1 data has 3,977 distinct IDs(patients). Yet, we intended to produce one row per ID(patient). I'm trying to understand why each proc sql output different number of rows and both being greater then actual number of distinct ID in the data.

Please let me know if you find out why.

Thanks.

Cruise · Posted 11-16-2018 09:37 AM

@novinosrin

I keep investigating and found triplet output by ID in WGT_DX_DULL. Please don't get overwhelmed by my multiple posts here. I'm just posting as I go.

Cruise · Posted 11-16-2018 09:45 AM

I added 'distinct' every where I can 🙂 with no change in the final output.


proc sql;
create table WGT_DX_FULL as 
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select DISTINCT *, min(DISTINCT(DOD_DIFF)) as min_DOD_DIFF
from HAVE1
group by id 
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(DISTINCT(wgt_dx))=wgt_dx;
quit;

novinosrin · Posted 11-16-2018 09:57 AM

Just made it to my lab. Train delays

Let me test thoroughly

novinosrin · Posted 11-16-2018 10:42 AM

Here is a test

Problem stmt: There are duplicate ID's in the results

Check 1:

Find how many distinct id'sexist and of which how many duplicates are produced in the results

proc sql;
create table dup_check as
select id, count(id) as count
from WGT_DX_FULL
group by id
having count>1;
quit;

proc print noobs;run;

Quite a lot of duplicates happens to be in results

ID count
43 2
52 2
59 2
67 2
75 2
106 2
116 2
126 2
128 2
131 2
153 2
161 2

and others

Now, lets take ID=43


proc sql;
create table test as
select *,min(dod_diff)as min_DOD_DIFF
from have1(where=(id=43))
group by id
having dod_diff=min_DOD_DIFF
order by WGT_DX;
quit;
proc print noobs;run;

SER_DATE_DX DOD_DIFF CODE ID WGT_DX min_DOD_DIFF
07OCT2014 0 153 43 1 0
07OCT2014 0 153 43 1 0
07OCT2014 0 154 43 1 0
07OCT2014 0 154 43 1 0
07OCT2014 0 V76 43 4 0
07OCT2014 0 V76 43 4 0
07OCT2014 0 V76 43 4 0
07OCT2014 0 V76 43 4 0

In the above you would notice there are more than once distinct code, for the lowest WGT_DX , therefore even if you choose lowest WGT_DX and distinct, the code 153 and 154 both satisfy and hence you see duplicates.

Now it is up to you to take a call on that

Cruise · Posted 11-16-2018 10:46 AM

Yes, I noticed that as well. CODE 153 and 154 take identical score 1, for example. However, that is fine and we're telling SAS to pick the distinct of WGT_DX regardless of CODEs, right? How to tell SAS pick the dictinct values for WGT_DX even if CODES took the same values?

Some codes take identical scores because they offer same level of evidence on the diagnosis even they're different in values

IF CODE IN ('153') THEN WGT_DX=1; ELSE
IF CODE IN ('154') THEN WGT_DX=1; ELSE
IF CODE IN ('159') THEN WGT_DX=2; ELSE
IF CODE IN ('197') THEN WGT_DX=3; ELSE
IF CODE IN ('199') THEN WGT_DX=3; ELSE
IF CODE IN ('209') THEN WGT_DX=5; ELSE
IF CODE IN ('211') THEN WGT_DX=6; ELSE
IF CODE IN ('230') THEN WGT_DX=2; ELSE
IF CODE IN ('235') THEN WGT_DX=3; ELSE
IF CODE IN ('239') THEN WGT_DX=3; ELSE
IF CODE IN ('V10') THEN WGT_DX=4; ELSE
IF CODE IN ('V76') THEN WGT_DX=4;

novinosrin · Posted 11-16-2018 10:51 AM

The distinct applies to the entire record I am afraid , so "we're telling SAS no matter CODEs " will not serve here coz SAS will not know which one to pick 153 or 154.

You could do a proc sort nodupkey but it's unethical for me to recommend such blatant ideas without knowing business requirement. Hence I dropped the ball back in your court as 153 could mean something imp and vice versa for 154

Cruise · Posted 11-16-2018 10:58 AM

Thank you very much for explaining.

The health condition I'm interested here is the colorectal cancer. It can be either colon or rectum or both colon and rectum cancer.

If patient takes 153 then colorectal

if patients takes 154 then colorectal

if patient takes 153 and 154 then colorectal too. Because colon and rectum are anatomically connected. Similar logic applies to the CODEs that take same values. I'm not able to score 153 and 154 in hierarchical order because they're are both valid evidence for colorectal cancer with the same weight of evidence. proc nodupkey should work for me. However, I'm not sure about the triplet outputs for the WGT_DX_FULL. Image attached below. I think this also because of the same problem.

novinosrin · Posted 11-16-2018 11:02 AM

What i am by duplicates is not just count of 2, merely count greater than 1.

So you may any number of duplicates depending upon varying code. Since it appears that code are the same

I think you may apply

proc sort data=results out=new_results nodupkey;

by id;

run;

This would pick the first where you written select

Cruise · Posted 11-16-2018 11:02 AM

I think, i can go around this problem if i specify the variables I want instead select *

novinosrin · Posted 11-16-2018 11:04 AM

Yes, if you choose to specify variables and exclude code in select list, you will get distinct records. The problem is only when you want to keep code column

Bear in mind, you would have to do it everywhere. Rather, I would just go for a proc sort dupkey.

Cruise · Posted 11-16-2018 11:08 AM

Thank you so much for reassuring on this. I don't need codes at this level. But i will need CODEs at the next round when I want to estimate the error for each CODEs.

novinosrin · Posted 11-16-2018 11:10 AM

I find editing select clauses all over is boring than to safely do proc sort

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

SAS Innovate 2025: Call for Content

Classroom Training Available!