Solved: How to select the distinct rows when variables are not one-to-one with...

Cruise · Posted 11-12-2018 11:51 AM

Hi All,

I'd like to:

- Select the row associated with the smallest diff_range

- Among the patients who took diagnosis codes 1 or 2 or 6

- Select the row associated with the distinct smallest diff_range

The want is below and for example why ID=1 was selected was that:

ID=1 took target diagnosis codes 1 and 2. But 1, 1, 2, row is selected because of minimum value for a diff_range.

My current codes produce more observations than I need because code and diff_range variables are not one-to-one with the ID and min_diff. But I really don't want min_diff repeated in the final want data and also not willing to deduplicate data using extra step afterwards.

Any solutions? please. I'll greatly appreciate your help.

data p.have;
input id code diff_range;
cards;
1 1 2
1 2 3
1 3 4
1 3 4
1 5 7
2 1 1
2 2 3
2 2 3
2 4 5
2 5 6
3 3 1
3 4 2
3 5 3
4 8 8
4 9 9
5 6 10
5 10 12
6 2 2
6 2 1
6 6 9
6 6 4
;

data want;
input id code diff_range;
cards;
1 1 2
2 1 1
5 6 10
6 2 1
;

proc sql;
create table p.want_min1 as
select distinct * , min(diff_range) as min_diff from p.have
where id in
(select id from p.have where code in (1,2,6))
group by id;
quit;

novinosrin · Posted 11-12-2018 05:23 PM

I think the my sql will do that, test and let us know. Also, if you want to make your life easy, of course sort and datastep is by far simplest.

The below makes sql my style with fancy boolean although not much of big deal from the previous


data have;
input id code diff_range weight_evidence;
cards;
1 1 2 10 
1 2 2 8 
1 3 4 0
1 3 4 0
1 5 7 0
2 1 1 10
2 2 3 8
2 2 3 8
2 4 5 0
2 5 6 0
3 3 1 0
3 4 2 0
3 5 3 0
4 8 8 0
4 9 9 0
5 6 10 6
5 1 10 10 
5 10 12 0
6 1 2 10 
6 2 2 8
6 1 9 10
6 6 9 6 
;
proc sql;
create table want(drop=_:) as
select *
from 
(select *, max(code in (1,2,6)) as _c,min(diff_range) as _min from have group by id)
group by id,_min
having _c and max(weight_evidence)=weight_evidence and _min=diff_range;
quit;

View solution in original post

r_behata · Posted 11-12-2018 12:04 PM

data have;
input id code diff_range;
cards;
1 1 2
1 2 3
1 3 4
1 3 4
1 5 7
2 1 1
2 2 3
2 2 3
2 4 5
2 5 6
3 3 1
3 4 2
3 5 3
4 8 8
4 9 9
5 6 10
5 10 12
6 2 2 
6 2 1
6 6 9
6 6 4
;
run;

proc sort data=have(where=(code in (1,2,6)));
	by id diff_range;
run;

data want;
	set have;
	by id diff_range;

	if first.id then output;
run;

novinosrin · Posted 11-12-2018 12:06 PM

That's happening because of remerging behaviour of proc sql

try this change highlighted

proc sql;
create table p.want_min1 as
select distinct * , min(diff_range) as min_diff from p.have
where id in
(select id from p.have where code in (1,2,6))
group by id

having diff_range=min_diff;
quit;

novinosrin · Posted 11-12-2018 12:08 PM

One question you would have to let know is what if there is a tie in min_diff?

And if you do not want min_diff in your output, the code can be as simple as

data have;
input id code diff_range;
cards;
1 1 2
1 2 3
1 3 4
1 3 4
1 5 7
2 1 1
2 2 3
2 2 3
2 4 5
2 5 6
3 3 1
3 4 2
3 5 3
4 8 8
4 9 9
5 6 10
5 10 12
6 2 2 
6 2 1
6 6 9
6 6 4
;

proc sql;
create table want_min2 as
select  *  from have
where id in 
(select id from have where code in (1,2,6))
group by id

having diff_range= min(diff_range);
quit;

proc print noobs;run;

Results:

SAS Output

The SAS System

id	code	diff_range
1	1	2
2	1	1
5	6	10
6	2	1

Cruise · Posted 11-12-2018 03:25 PM

@novinosrin

so true, it doesn't work when min_diff ties up, just like below. When min_diff are the same across the different diagnosis codes that single patient took, I have to assign the weight to each diagnosis codes in terms of the strength of the evidence to prove of the disease. And the row associated wit the higher score for diagnosis code would be selected among the ties by min_diff.

data p.have;
input id code diff_range;
cards;
1 1 2
1 2 3
1 3 4
1 3 4
1 5 7
2 1 1
2 2 3
2 2 3
2 4 5
2 5 6
3 3 1
3 4 2
3 5 3
4 8 8
4 9 9
5 6 10
5 10 12
6 1 2
6 2 2
6 1 9
6 6 9
;

proc sql;
create table p.want_min1 as
select distinct * , min(diff_range) as min_diff from p.have
where id in
(select id from p.have where code in (1,2,6))
group by id
having diff_range=min_diff;
quit;

novinosrin · Posted 11-12-2018 03:34 PM

Ok,so do you have weighting variable in place?

Cruise · Posted 11-12-2018 03:47 PM

@novinosrin

Yes, please see below:

data p.have;
input id code diff_range weight_evidence;
cards;
1 1 2 10
1 2 2 8
1 3 4 0
1 3 4 0
1 5 7 0
2 1 1 10
2 2 3 8
2 2 3 8
2 4 5 0
2 5 6 0
3 3 1 0
3 4 2 0
3 5 3 0
4 8 8 0
4 9 9 0
5 6 10 6
5 1 10 10
5 10 12 0
6 1 2 10
6 2 2 8
6 1 9 10
6 6 9 6
;

novinosrin · Posted 11-12-2018 04:02 PM


data have;
input id code diff_range weight_evidence;
cards;
1 1 2 10 
1 2 2 8 
1 3 4 0
1 3 4 0
1 5 7 0
2 1 1 10
2 2 3 8
2 2 3 8
2 4 5 0
2 5 6 0
3 3 1 0
3 4 2 0
3 5 3 0
4 8 8 0
4 9 9 0
5 6 10 6
5 1 10 10 
5 10 12 0
6 1 2 10 
6 2 2 8
6 1 9 10
6 6 9 6 
;
proc sql;
create table want_min1 as
select *
from 
(select distinct * , min(diff_range)as min_diff from have
where id in 
(select id from have where code in (1,2,6))
group by id)
group by id,min_diff
having weight_evidence=max(weight_evidence);
quit;

PS I will simplify the above if i can

Reeza · Posted 11-12-2018 04:11 PM

Cruise, can you post the expected output as well, how the weights need to be handled is unclear. Is it just part of the selection or are you reweighting?

Reeza · Posted 11-12-2018 04:09 PM

Does the minimum code have to be with diagnosis 1, 2, or 6? Or just a patient with one of those diagnosis? That will help novinosrin simplify his code.

mkeintz · Posted 11-12-2018 03:43 PM

Then make a minor modification to the proc sort solution. Assuming your weight variable is WGT:

proc sort data= have (where=(code in (1,2,6))) out=need;
  by Id diff_range descending wgt;
run;

data want;
  set need; 
  by id diff_range;
  if first.id;
run;

which will work perfectly unless you have an ID containing multiple records at the lowest diff_range having tied values at the highest wgt.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Reeza · Posted 11-12-2018 01:57 PM

Sort and use FIRST.

Proc sort data= have;
By Id diff_range code;
Run;

Data want;
Set have; 
By id diff_range;
Where code in (1, 2, 6);

If first.id;

Run;

This will not keep duplicates. You could also sort by code to ensure you always get the smallest or largest. Though this may not be necessary, it’s easier to have a consistent logic so that you can replicate this if ever needed.

Cruise · Posted 11-12-2018 05:17 PM

hI

WEIGHT_EVIDENCE IS CREATED BASED ON THE CODE LIKE FOLLOWING:

if code=1 then weight_evidence=10; else
if code=2 then weight_evidence=8; else
if code=6 then weight_evidence=6;

HOWEVER, MIN_DIFF CAN RANGE 1 THRU 30.

In ideal world, where life goes our way, maximum weight of evidence would be corresponded with the minimum difference. However, there are discordant cases, which I thought of following as shown int he image. Please help me think of more discordant scenario if seem not considered here.

The Rationale for Selection and the Selected Cases should answer @Reeza and @novinosrin to understand how weight should be handled in the final output data.

Anything apparently incorrect? please let me know. I needed to think of what discordant cases might be. My actual data is huge and I have to resolve the issue on these hypothetical small dataset first. Hope that makes sense.

Reeza · Posted 11-12-2018 05:20 PM

Rather than try and think of all circumstances, what happens if you find all the cases that don't have issues, ie where min diff and max_weight and then see how many you have to deal with and consider them case by case?

novinosrin · Posted 11-12-2018 05:23 PM

I think the my sql will do that, test and let us know. Also, if you want to make your life easy, of course sort and datastep is by far simplest.

The below makes sql my style with fancy boolean although not much of big deal from the previous


data have;
input id code diff_range weight_evidence;
cards;
1 1 2 10 
1 2 2 8 
1 3 4 0
1 3 4 0
1 5 7 0
2 1 1 10
2 2 3 8
2 2 3 8
2 4 5 0
2 5 6 0
3 3 1 0
3 4 2 0
3 5 3 0
4 8 8 0
4 9 9 0
5 6 10 6
5 1 10 10 
5 10 12 0
6 1 2 10 
6 2 2 8
6 1 9 10
6 6 9 6 
;
proc sql;
create table want(drop=_:) as
select *
from 
(select *, max(code in (1,2,6)) as _c,min(diff_range) as _min from have group by id)
group by id,_min
having _c and max(weight_evidence)=weight_evidence and _min=diff_range;
quit;

How to select the distinct rows when variables are not one-to-one with 'group by' variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Re: How to select the distinct rows with when variables are not one-to-one with group by variable

Classroom Training Available!