Re: SAS Program to De-Duplicate with a Condition

Sapkota · Posted 02-09-2015 03:02 PM

I need to create a dataset on which there are two systolic blood pressure measurements taken from the same individual. I need to take only the lowest of the two measurements only if the measurements are taken within 4 weeks from each other. If the measurements are not within four weeks then the measurement is not taken. The names and date of birth are not real.

ID Name Date of Birth Systolic Blood Pressure Date of measurement

0001 abc 12/12/1965 145 8/10/2010

0002 abc 12/12/1965 139 8/20/2010

0003 def 8/30/1962 132 8/04/2010

0004 def 8/30/1962 137 8/21/2010

0005 ghi 3/12/1964 140 02/07/2010

0006 ghi 3/12/1964 129 03/17/2010

007 jkl 5/10/1963 127 01/07/2010

008 jkl 5/10/1963 121 02/13/2010

As can be seen, abc, def, ghi and jkl are four different persons. Each of them have two readings of systolic blood pressure. I need to write SAS program to take the lowest of the two measurements, only if the date of measurements are within four weeks from each other. If not, then they are not included. In this case the first two are seen to meet that criteria of four weeks while the last two does not.

Any help will be appreciated. Thanks in advance.

ballardw · Posted 02-09-2015 04:16 PM

This may get you started:

proc summary data=have nway;

class name date_of_birth;

var Systolic_Blood_Pressure;

output out=temp (drop=_type_ _freq_) min=

idgroup (min(Systolic_Blood_Pressure) out[2]

(Systolic_Blood_Pressure id Date_of_measurement)=)

/autolabel autoname;

run;

This gets the two lowest values.

Since you don't guarantee there are two or more readings the deduplication may be a bit more.

You will have two variable in the output set above for each date of measurement.

It could be:

data want;

set temp;

where Date_of_measurement_1 - Date_of_measurement2 le 28;

run;

Sapkota · Posted 02-10-2015 08:28 AM

Thank you. Your code did help in getting started but I still need to get the desired outputs which are 1) first get the two measurements per person that are 12 weeks apart and then 2) get rid of the one that is the highest (and keep the lowest). I am playing around with this code though to get to the outputs.

RW9 · Posted 02-10-2015 08:47 AM

Well, depends on your definitions, does it mean that the two observations are exactly 12 weeks apart? Will there only be 2 results, or many results, what happens if more than 2 results overlapping etc. Clearly define the logic needed first. If its just two and they are exact:

proc sql;

create table WANT as

select COALESCE(A.ID,B.ID) as ID,

...

case when A.SYSBP is not null and A.SYSBP < B.SYSBP then A.SYSBP

else B.SYSBP end as SYSBP

from HAVE A

full join HAVE B

on A.ID=B.ID

and A.DOB=B.DOB

and A.NAME=B.NAME

and ...

and B.MEAS=intnx('week',A.MEAS,12,'same');

quit;

Sapkota · Posted 02-11-2015 10:45 AM

Thank you. It is 4 weeks and not 12 weeks. But they need not be exactly 4 weeks apart. The dates could be within 4 weeks. There will be only two results, and not multiple. Your code is helpful if it needed an exact time difference (4 weeks in this case). thanks again.

ballardw · Posted 02-10-2015 11:09 AM

12 week? earlier it was 4 weeks...

The data set I show has 2 minimum values and the associated dates. I thought you might be able to determine which ones were of interest by comparing the dates.

I think you are possibly hung up on "deduplicate first" then find min as the order to accomplish the task. Sometimes the order may work better in a different sequence.

Sapkota · Posted 02-11-2015 10:39 AM

Thank you. It is 4 weeks. Sorry for the typo. I will try the different sequence as you suggested.

Ksharp · Posted 02-10-2015 05:14 AM

You need to post the output you need according to your sample data. Otherwise We just are guessing what you want.


data have;
input ID                   $         Name    $      DateofBirth : mmddyy12.                       SystolicBloodPressure                   Dateofmeasurement : mmddyy12.;
format DateofBirth        Dateofmeasurement     mmddyy10.;
cards;
0001                      abc                                       12/12/1965                               145                                                     8/10/2010
0002                      abc                                       12/12/1965                               139                                                     8/20/2010
0003                      def                                        8/30/1962                                 132                                                    8/04/2010
0004                      def                                       8/30/1962                                  137                                                    8/21/2010
0005                      ghi                                       3/12/1964                                  140                                                   02/07/2010
0006                      ghi                                       3/12/1964                                  129                                                   03/17/2010
007                        jkl                                        5/10/1963                                  127                                                  01/07/2010
008                        jkl                                       5/10/1963                                   121                                                  02/13/2010     
;
run;
proc sql;
 create table want as
  select * 
   from have
    group by Name
      having range(Dateofmeasurement) le 28 and SystolicBloodPressure=min(SystolicBloodPressure);
quit;

Xia Keshan

Sapkota · Posted 02-10-2015 08:41 AM

Thank you much. Here is the desired final output:

id name DateofBirth SystolicBloodPressire DateofMeasurement

0001 abc 12/12/1965 139 8/20/2010

0003 def 8/30/1962 132 8/04/2010

To get this output there are two steps I guess. The first step will remove all the duplicates that have blood pressure measurements that are beyond 4 weeks apart from each other and keep those duplicates' measurements that are

within 4 weeks.

The second step would then give the above output where only the lowest of the measurement is kept.

Your SQL code will get me the second step i.e. the lowest of the two duplicate's measurement, but may not get me the first desired output which is list of duplicates that have measurement within 4 weeks and that are beyond four weeks.

Ksharp · Posted 02-10-2015 08:47 AM

No. I can get you the first desired output . The question is what output you are looking for ?

proc sql;

create table want as

select *

from have

group by Name

having range(Dateofmeasurement) le 28 ;

quit;

Sapkota · Posted 02-11-2015 10:49 AM

Thank you very much. This was helpful.

ballardw · Posted 02-10-2015 11:11 AM

Suppose you have a patient with 16 readings all exactly 3 weeks apart. Which are kept by your deduplication rule?

Also you did not specify that your output requirement was a list of duplicates. Plus you need to provide a much more comprehensive example to show what you mean by

duplicates that have measurement within 4 weeks and that are beyond four weeks.

because that at face value is a null set.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away