Ah ok, I was thinking it's only CODE that did it. Well, this has made us learn not to edit the sql, but do it in an exclusive proc sort. Of course i understand and acknowledge, another pass or step is not optimal from a performance point of view, but on the other hand helps us debug easily. For example, we can always go back to the sql and identify the cause but an edited sql may make it more tedious to debug
PS
We shall get to high performance computing when we deal with atleast 100 million++ records. Until then small few extra steps can only help
Good morning Novinosrin,
I just ran two proc sqls on the HAVE1 (N=90,000) attached to this post. WGT_DX_FULL_SAS created, with 4326 rows and P.WGT_DX_PARTIAL_SAS created, with 4205 rows while HAVE1 data has 3,977 distinct IDs(patients). Yet, we intended to produce one row per ID(patient). I'm trying to understand why each proc sql output different number of rows and both being greater then actual number of distinct ID in the data.
Please let me know if you find out why.
Thanks.
I keep investigating and found triplet output by ID in WGT_DX_DULL. Please don't get overwhelmed by my multiple posts here. I'm just posting as I go.
I added 'distinct' every where I can 🙂 with no change in the final output.
proc sql;
create table WGT_DX_FULL as
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select DISTINCT *, min(DISTINCT(DOD_DIFF)) as min_DOD_DIFF
from HAVE1
group by id
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(DISTINCT(wgt_dx))=wgt_dx;
quit;
Just made it to my lab. Train delays
Let me test thoroughly
Here is a test
Problem stmt: There are duplicate ID's in the results
Check 1:
Find how many distinct id'sexist and of which how many duplicates are produced in the results
proc sql;
create table dup_check as
select id, count(id) as count
from WGT_DX_FULL
group by id
having count>1;
quit;
proc print noobs;run;
Quite a lot of duplicates happens to be in results
ID count
43 2
52 2
59 2
67 2
75 2
106 2
116 2
126 2
128 2
131 2
153 2
161 2
and others
Now, lets take ID=43
proc sql;
create table test as
select *,min(dod_diff)as min_DOD_DIFF
from have1(where=(id=43))
group by id
having dod_diff=min_DOD_DIFF
order by WGT_DX;
quit;
proc print noobs;run;
SER_DATE_DX DOD_DIFF CODE ID WGT_DX min_DOD_DIFF
07OCT2014 0 153 43 1 0
07OCT2014 0 153 43 1 0
07OCT2014 0 154 43 1 0
07OCT2014 0 154 43 1 0
07OCT2014 0 V76 43 4 0
07OCT2014 0 V76 43 4 0
07OCT2014 0 V76 43 4 0
07OCT2014 0 V76 43 4 0
In the above you would notice there are more than once distinct code, for the lowest WGT_DX , therefore even if you choose lowest WGT_DX and distinct, the code 153 and 154 both satisfy and hence you see duplicates.
Now it is up to you to take a call on that
Yes, I noticed that as well. CODE 153 and 154 take identical score 1, for example. However, that is fine and we're telling SAS to pick the distinct of WGT_DX regardless of CODEs, right? How to tell SAS pick the dictinct values for WGT_DX even if CODES took the same values?
Some codes take identical scores because they offer same level of evidence on the diagnosis even they're different in values
IF CODE IN ('153') THEN WGT_DX=1; ELSE
IF CODE IN ('154') THEN WGT_DX=1; ELSE
IF CODE IN ('159') THEN WGT_DX=2; ELSE
IF CODE IN ('197') THEN WGT_DX=3; ELSE
IF CODE IN ('199') THEN WGT_DX=3; ELSE
IF CODE IN ('209') THEN WGT_DX=5; ELSE
IF CODE IN ('211') THEN WGT_DX=6; ELSE
IF CODE IN ('230') THEN WGT_DX=2; ELSE
IF CODE IN ('235') THEN WGT_DX=3; ELSE
IF CODE IN ('239') THEN WGT_DX=3; ELSE
IF CODE IN ('V10') THEN WGT_DX=4; ELSE
IF CODE IN ('V76') THEN WGT_DX=4;
The distinct applies to the entire record I am afraid , so "we're telling SAS no matter CODEs " will not serve here coz SAS will not know which one to pick 153 or 154.
You could do a proc sort nodupkey but it's unethical for me to recommend such blatant ideas without knowing business requirement. Hence I dropped the ball back in your court as 153 could mean something imp and vice versa for 154
Thank you very much for explaining.
The health condition I'm interested here is the colorectal cancer. It can be either colon or rectum or both colon and rectum cancer.
If patient takes 153 then colorectal
if patients takes 154 then colorectal
if patient takes 153 and 154 then colorectal too. Because colon and rectum are anatomically connected. Similar logic applies to the CODEs that take same values. I'm not able to score 153 and 154 in hierarchical order because they're are both valid evidence for colorectal cancer with the same weight of evidence. proc nodupkey should work for me. However, I'm not sure about the triplet outputs for the WGT_DX_FULL. Image attached below. I think this also because of the same problem.
What i am by duplicates is not just count of 2, merely count greater than 1.
So you may any number of duplicates depending upon varying code. Since it appears that code are the same
I think you may apply
proc sort data=results out=new_results nodupkey;
by id;
run;
This would pick the first where you written select
Yes, if you choose to specify variables and exclude code in select list, you will get distinct records. The problem is only when you want to keep code column
Bear in mind, you would have to do it everywhere. Rather, I would just go for a proc sort dupkey.
I find editing select clauses all over is boring than to safely do proc sort
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.