Solved: how to selecet some ID out?

xiangpang · Posted 08-08-2018 10:12 PM

Hello,

I want to select the IDs with same visit and bmi difference is less than 10%.

not proc sql. Anyone knows how to do that?

Thanks a lot

data new;
input id visit bmi ;
cards;
1 day1 45
1 day12 66
1 day60 53
2 day1 45
2 day6 14
2 day12 36
2 day60 83
3 day1 745
3 day12 66
3 day60 50
;
run;

want;
1 day1 45
2 day1 45
1 day12 66
3 day12 66
1 day60 53
3 day60 50

s_manoj · Posted 08-09-2018 06:53 AM

Hello,

Check this code, this could help your need.

proc sort;
by visit;
run;

data want (keep = id visit bmi);
set new (rename = (id = id1 visit = visit1 bmi = bmi1));
do i =1 to n;
	set new point = i nobs = n;
 	v = divide(abs(bmi1 - bmi), bmi1);
	if (visit = visit1) & (id ne id1) & v le 0.1 then output  ;
end;
run;

Regards

Manoj

View solution in original post

xiangpang · Posted 08-08-2018 10:21 PM

visit should add $ , sorry for that

PGStats · Posted 08-08-2018 11:14 PM

So what if your data contains

1 day1 45

2 day1 49

3 day1 53

what obs should be selected?

PG

hashman · Posted 08-09-2018 03:18 AM

@xiangpang:

Actually, it looks more like a job for SQL, and in all candor, the tool should fit the job, not the other way around.

But if you don't want SQL, the job can be done with other means - it's SAS, after all.

However, as your specs are rather vague, I have to make assumptions, to wit:

- If there's only one record in a visit, you don't want the visit in the output; your sample output without DAY6 seems to suggest that, too.

- If there're no BMIs differing by less than 10% in the same visit (not in your sample input data), you don't want that visit in the output, either.

- No visit has more than 9999 BMI measurements. Methinks it's a reasonable assumption ;).

With these assumptions:

data new ;                                                     
  input id visit $ bmi ;                                       
  cards ;                                                      
1 day1   45                                                    
1 day12  66                                                    
1 day60  53                                                    
2 day1   45                                                    
2 day6   14                                                    
2 day12  36                                                    
2 day60  83                                                    
3 day1  745                                                    
3 day12  66                                                    
3 day60  50                                                    
run ;                                                          
                                                               
proc sort data = new out = newsort ;                           
  by visit bmi ;                                               
run ;                                                          
                                                               
data want (keep = id visit bmi) ;                              
  array abmi [9999] _temporary_ ;                               
  array aid  [9999] _temporary_ ;                               
  do n = 1 by 1 until (last.visit) ;                           
    set newsort ;                                              
    by visit ;                                                 
    abmi [n] = bmi ;                                           
    aid  [n] = id ;                                            
  end ;                                                        
  if n > 1 then do i = 1 to n - 1 ;                            
    do j = i + 1 to n ;                                        
      if divide (abmi[j] - abmi[i], abmi[j]) > 0.1 then leave ;
      bmi = abmi [i] ; id = aid [i] ; output ;                 
      bmi = abmi [j] ; id = aid [j] ; output ;                 
    end ;                                                      
  end ;                                                        
run ;

Note that the sorting, though not strictly necessary, technically speaking, greatly simplifies the code. Also, it helps avoid comparing ALL BMI measurements within the same visit pairwise since in the sorted BMI sequence, the difference between item [i] and item [i+1] is always no greater than between item [i] and item [i+(>1)].

HTH

Paul D.

xiangpang · Posted 08-09-2018 03:19 AM

In this case, all three should be included. Since day1 is same, and 49-53 difference is less than 10% while 45-49 difference is less than 10%.

xiangpang · Posted 08-09-2018 03:24 AM

Thanks, I will try it later.

Kurt_Bremser · Posted 08-09-2018 04:28 AM

Given the brevity of a SQL solution:

proc sql;
create table want as
select distinct a.*
from new a, new b
where a.visit = b.visit and a.id ne b.id and abs(a.bmi - b.bmi) <= min(a.bmi,b.bmi) / 10
order by visit,id;
quit;

not using it is dumb, IMHO. Unless the performance of SQL because of dataset size is intolerable.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

hashman · Posted 08-09-2018 04:56 AM

@Kurt_Bremser:

I doubt your SQL will underperform against a much more sizable input since the optimizer plan for this query uses the SQXJHSH method and doesn't resort to any Cartesian expansions. But even if it did, the simplicity of coding (or, rather, nearly not needing to code) would more than make up for that. To put together a viable DATA step solution in this case, one needs to know infinitely more about procedural programming and think its details out than to write a simple SQL query hardly requiring any programming at all. Like I said, the tool should fit the job, not vice versa. I'm not quite an SQL head; but in this case, it's exactly what the doctor ordered (and my having offered a DATA step solution notwithstanding).

Best

Paul D.

s_manoj · Posted 08-09-2018 06:53 AM

Hello,

Check this code, this could help your need.

proc sort;
by visit;
run;

data want (keep = id visit bmi);
set new (rename = (id = id1 visit = visit1 bmi = bmi1));
do i =1 to n;
	set new point = i nobs = n;
 	v = divide(abs(bmi1 - bmi), bmi1);
	if (visit = visit1) & (id ne id1) & v le 0.1 then output  ;
end;
run;

Regards

Manoj

hashman · Posted 08-09-2018 01:03 PM

@s_manoj:

Your code runs a de facto Cartesian product reading every obs from one copy of NEW for every obs in another, and it doesn't even take advantage of the sorted order. Hence, it will scale extremely poorly as the input size grows. At nobs=100000, an efficient solution would run in under a second, yet your program would take good 30 minutes. And at nobs=1000000 (which is a very small data set by today's standards) it would take upward of 50 hours.

The point= approach could be viable (though still slower than anything else) if you take advantage of the sorted order BY VISIT BMI and (a) use point= to read only the observations between the endpoints of the current BY group and (b) leave the point= loop as soon as the 10% condition is false.

Paul D.

how to selecet some ID out?

Re: how to selecet some ID out?

Re: how to selecet some ID out?

Re: how to selecet some ID out?

Re: how to selecet some ID out?

Re: how to selecet some ID out?

Re: how to selecet some ID out?

Re: how to selecet some ID out?

Re: how to selecet some ID out?

Re: how to selecet some ID out?

Re: how to selecet some ID out?

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away