BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
novinosrin
Tourmaline | Level 20

Ah ok, I was thinking it's only CODE that did it. Well, this has made us learn not to edit the sql, but do it in an exclusive proc sort. Of course i understand and acknowledge, another pass or step is not optimal from a performance point of view, but on the other hand helps us debug easily. For example, we can always go back to the sql and identify the cause but an edited sql may make it more tedious to debug

 

PS

 

We shall get to high performance computing when we deal with atleast 100 million++ records. Until then small few extra steps can only help

Cruise
Ammonite | Level 13

@novinosrin

Good morning Novinosrin, 

 

I just ran two proc sqls on the HAVE1 (N=90,000) attached to this post. WGT_DX_FULL_SAS created, with 4326 rows and P.WGT_DX_PARTIAL_SAS created, with 4205 rows while HAVE1 data has 3,977 distinct IDs(patients). Yet, we intended to produce one row per ID(patient). I'm trying to understand why each proc sql output different number of rows and both being greater then actual number of distinct ID in the data. 

Please let me know if you find out why. 

Thanks.

Cruise
Ammonite | Level 13

@novinosrin

I keep investigating and found triplet output by ID in WGT_DX_DULL. Please don't get overwhelmed by my multiple posts here. I'm just posting as I go. 

 

triplets.png

Cruise
Ammonite | Level 13

I added 'distinct' every where I can 🙂 with no change in the final output. 

 


proc sql;
create table WGT_DX_FULL as 
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select DISTINCT *, min(DISTINCT(DOD_DIFF)) as min_DOD_DIFF
from HAVE1
group by id 
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(DISTINCT(wgt_dx))=wgt_dx;
quit;
novinosrin
Tourmaline | Level 20

Just made it to my lab. Train delaysSmiley Frustrated

 

Let me test thoroughly

novinosrin
Tourmaline | Level 20

Here is a test

 

Problem stmt: There are duplicate ID's in the results

 

Check 1:

Find how many distinct id'sexist and of which how many duplicates are produced in the results

 

proc sql;
create table dup_check as
select id, count(id) as count
from WGT_DX_FULL
group by id
having count>1;
quit;

proc print noobs;run;

Quite a lot of duplicates happens to be in results

ID count
43 2
52 2
59 2
67 2
75 2
106 2
116 2
126 2
128 2
131 2
153 2
161 2

and others

 

Now, lets take ID=43


proc sql;
create table test as
select *,min(dod_diff)as min_DOD_DIFF
from have1(where=(id=43))
group by id
having dod_diff=min_DOD_DIFF
order by WGT_DX;
quit;
proc print noobs;run;

 

SER_DATE_DX DOD_DIFF CODE ID WGT_DX min_DOD_DIFF
07OCT2014 0 153 43 1 0
07OCT2014 0 153 43 1 0
07OCT2014 0 154 43 1 0
07OCT2014 0 154 43 1 0
07OCT2014 0 V76 43 4 0
07OCT2014 0 V76 43 4 0
07OCT2014 0 V76 43 4 0
07OCT2014 0 V76 43 4 0

 

In the above you would notice there are more than once distinct code, for the lowest WGT_DX , therefore even if you choose lowest WGT_DX and distinct, the code 153 and 154 both satisfy and hence you see duplicates.

 

Now it is up to you to take a call on that

Cruise
Ammonite | Level 13

Yes, I noticed that as well. CODE 153 and 154 take identical score 1, for example. However, that is fine and we're telling SAS to pick the distinct of WGT_DX regardless of CODEs, right? How to tell SAS pick the dictinct values for WGT_DX even if CODES took the same values?

 

Some codes take identical scores because they offer same level of evidence on the diagnosis even they're different in values

 

IF CODE IN ('153') THEN WGT_DX=1; ELSE
IF CODE IN ('154') THEN WGT_DX=1; ELSE
IF CODE IN ('159') THEN WGT_DX=2; ELSE
IF CODE IN ('197') THEN WGT_DX=3; ELSE
IF CODE IN ('199') THEN WGT_DX=3; ELSE
IF CODE IN ('209') THEN WGT_DX=5; ELSE
IF CODE IN ('211') THEN WGT_DX=6; ELSE
IF CODE IN ('230') THEN WGT_DX=2; ELSE
IF CODE IN ('235') THEN WGT_DX=3; ELSE
IF CODE IN ('239') THEN WGT_DX=3; ELSE
IF CODE IN ('V10') THEN WGT_DX=4; ELSE
IF CODE IN ('V76') THEN WGT_DX=4;

novinosrin
Tourmaline | Level 20

The distinct applies to the entire record I am afraid , so "we're telling SAS no matter CODEs " will not serve here coz SAS will not know which one to pick 153 or 154. 

You could do a proc sort nodupkey but it's unethical for me to recommend such blatant ideas without knowing business requirement. Hence I dropped the ball back in your court as 153 could mean something imp and vice versa for 154

Cruise
Ammonite | Level 13

Thank you very much for explaining. 

 

The health condition I'm interested here is the colorectal cancer. It can be either colon or rectum or both colon and rectum cancer. 

If patient takes 153 then colorectal 

if patients takes 154 then colorectal 

if patient takes 153 and 154 then colorectal too. Because colon and rectum are anatomically connected. Similar logic applies to the CODEs that take same values. I'm not able to score 153 and 154 in hierarchical order because they're are both valid evidence for colorectal cancer with the same weight of evidence. proc nodupkey should work for me. However, I'm not sure about the triplet outputs for the WGT_DX_FULL. Image attached below. I think this also because of the same problem. 

 

triplets.png

novinosrin
Tourmaline | Level 20

What i am by duplicates is not just count of 2, merely count greater than 1.

 

So you may any number of duplicates depending upon varying code. Since it appears that code are the same

 

I think you may apply

 

proc sort data=results out=new_results nodupkey;

by id;

run;

 

This would pick the first where you written select

Cruise
Ammonite | Level 13
I think, i can go around this problem if i specify the variables I want instead select *
novinosrin
Tourmaline | Level 20

Yes, if you choose to specify variables and exclude code in select list, you will get distinct records. The problem is only when you want to keep code column

 

Bear in mind, you would have to do it everywhere. Rather, I would just go for a proc sort dupkey.

Cruise
Ammonite | Level 13
Thank you so much for reassuring on this. I don't need codes at this level. But i will need CODEs at the next round when I want to estimate the error for each CODEs.
novinosrin
Tourmaline | Level 20

I find editing select clauses all over is boring than to safely do proc sort

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 43 replies
  • 1873 views
  • 18 likes
  • 3 in conversation