Solved: file sorted by ID, EDD=expected delivery date, visit dates,

johnowen · Posted 07-22-2021 04:18 PM

I am running 9.4 on my laptop. I have a SAS file sorted by patient ID-MRN, her due date=EDD, and one or more clinic visit dates for each pregnancy-EDD; There can be multiple lines of data for each visit date as each OBS for that visit date contains different info. Each EDD identifies a single pregnancy. Unfortunately the data contains some "glitches" whereby the EDD can change by +/- a few days (say <30) over the course of the pregnancy. Thus if I sort by EDD and increment pregnancy number when I encounter a first.EDD it "counts" too many pregnancies. I need code to allow for a small programmable "fuzz factor" where if the change in the next sorted EDD is small compared to the previous, it will not increment the pregnancy #. However if the increment to the next EDD is great (usually >9 months!), it will count the next pregnancy.

This is what I want the output to look like, where X&Y are generally small "errors" <30 or so. The visit dates are sorted in order by the EDD and I just used D1... as crude examples:

ID-MRN	EDD	PREGNANCY#	VISIT DATE	VISIT DETAILS
1	DATE1-1	1	D1	X.....
1	DATE1-1	1	D1	Y.....
1	DATE1-1	1	D1	Z......
1	DATE1-1	1	D2	ETC
1	DATE1-1±X	1	D2	ETC
1	DATE1-1±X	1	D2	ETC
1	DATE1-2	2	D3	ETC
1	DATE1-2	2	D3	ETC
1	DATE1-2	2	D4	ETC
1	DATE1-2±Y	2	D4	ETC
1	DATE1-2±Y	2	D5	ETC
2	DATE2-1	1	D1	ETC
2	DATE2-1	1	D1	ETC
2	DATE2-1	1	D2	ETC
2	DATE2-1	1	D2	ETC
2	DATE2-1±X	1	D3	ETC
2	DATE2-1±X	1	D17	ETC
2	DATE2-1±X	1	D18	ETC
2	DATE2-1±X	1	D19	ETC
2	DATE2-2	2	D20	ETC
2	DATE2-2	2	D21	ETC
2	DATE2-2±Y	2	D22	ETC
2	DATE2-2±Y	2	D23	ETC
2	DATE2-2±Y	2	D24	ETC

This is my current code but it only "fixes" the first occasion of the wonky EDDs.

DATA RKPREG; SET RKGEN3; BY MRN EDD DATE;
***********************************************************;
IF FIRST.MRN THEN DO; PREG=0; RETAIN PREG; END;
IF FIRST.EDD THEN DO; PREG=PREG+1; END; *this counts too many pregnancies!;
IF LAST.MRN THEN RETURN;
RUN;
PROC PRINT DATA=RKPREG (FIRSTOBS=1 OBS=2000); VAR MRN LNAME EDD PREG DATE BESTGA; TITLE2 'DATA RKPREG'; RUN;
******************************************************************;
DATA RKPREG2; FORMAT NEWEDD MMDDYY10.;
*******************************************************************;
RETAIN NEWEDD; *not sure where to put retain statement!?!;
SET RKPREG; BY MRN EDD DATE;
IF FIRST.MRN THEN DO; NEWEDD=EDD; RETAIN NEWEDD;
END;
DIFF=ABS(EDD-NEWEDD);
IF 0<DIFF<30 THEN DO; PREG=PREG-1; ***NEWEDD=EDD; END;
*here I'm using a "fuzz factor" of +/-30 days), if I uncomment NEWEDD=EDD, it fails;
IF LAST.MRN THEN RETURN;

PROC PRINT DATA=RKPREG2 (FIRSTOBS=1 OBS=2000); VAR MRN LNAME DIFF EDD NEWEDD PREG DATE BESTGA; TITLE2 'DATA RKPREG2'; RUN;

This code fails on the next pregnancy(ies) because it continues to "retain" the EDD from the first pregnancy and I can' figure out how to "reset" it for each new pregnancy/EDD. If you reset the "wonky" EDDs to the previous (i.e. remove the error term) and make them all exactly the same as the initial value, that would be perfectly fine as well and might be the easiest solution. If I ran that code first, that way the first few lines of code should work fine.

Any coding help would be greatly appreciated.

Pulling my hair!

Many Thanks,

John Owen. MSPH

Univ of Ala at B'ham

Tom · Posted 07-26-2021 10:57 AM

So it looks to me like you are sorting by the wrong date. Use the other date which should be the event date.

Then it should work easily.

First let's clean up your data to eliminate any potential for identifying actual subjects and to make the variable names easier to work with.

proc sort data="c:\downloads\forsas" out=have;
  by mrn date___time code ;
run;

data have;
  length group 8;
  set have ;
  by mrn ;
  group + first.mrn ;
  rename date___time = date group=mrn Code__ICD9_=ICD9CODE;
  drop mrn code ;
  format date___time edd yymmdd10.;
run;

Now we have data like:

Obs    mrn          date           EDD    ICD9CODE

  1     1     2014-04-03    2014-10-19     V28.3
  2     1     2014-04-11    2014-10-19     659.63
  3     1     2014-04-11    2014-10-19     646.93
  4     1     2014-05-09    2014-10-19     659.63
  5     1     2014-05-09    2014-10-19     V23.41
  6     1     2014-05-20    2014-10-19     642.03
  7     1     2014-05-20    2014-10-19     659.63
  8     1     2014-05-20    2014-10-19     649.13
  9     1     2014-05-20    2014-10-19     648.03
 10     1     2014-05-20    2014-10-19     659.63

So now we can make a new EDD variable and while we are at it count the number of them per MRN.


data want;
  set have;
  by mrn date ;
  if first.mrn then do;
    new_edd=edd;
    eddn=1;
  end;
  else do;
    if edd-new_edd > 30 then do;
       new_edd=edd;
       eddn+1;
    end;
  end;
  retain new_edd eddn;
  format new_edd yymmdd10.;
run;

Let's see what we created.

data _null_;
  set want;
  by mrn eddn ;
  if first.eddn then put mrn= eddn= new_edd= ;
run;

Results:

224   data _null_;
225     set want;
226     by mrn eddn ;
227     if first.eddn then put mrn= eddn= new_edd= ;
228   run;

mrn=1 eddn=1 new_edd=2014-10-19
mrn=1 eddn=2 new_edd=2017-02-25
mrn=2 eddn=1 new_edd=2016-08-26
mrn=2 eddn=2 new_edd=2017-04-11
mrn=3 eddn=1 new_edd=2015-12-29
mrn=4 eddn=1 new_edd=2014-06-30
mrn=4 eddn=2 new_edd=2016-03-22
mrn=5 eddn=1 new_edd=2014-06-09
mrn=6 eddn=1 new_edd=2013-09-01
mrn=7 eddn=1 new_edd=2013-01-05
mrn=7 eddn=2 new_edd=2014-08-08
mrn=7 eddn=3 new_edd=2016-08-24
mrn=8 eddn=1 new_edd=2015-10-04
mrn=8 eddn=2 new_edd=2016-03-31
mrn=9 eddn=1 new_edd=2017-10-13
mrn=9 eddn=2 new_edd=2019-04-27
mrn=10 eddn=1 new_edd=2015-11-18
mrn=10 eddn=2 new_edd=2019-01-30

View solution in original post

ballardw · Posted 07-22-2021 06:45 PM

If you have replaced values with dummys, such as D1 for an actual date you adding lots of complexity for absolutely no gain.

Yes patient confidentiality is important but if there is a dummy ID value you gain nothing in security by obscuring dates and remove the most likely tool for building solution.

Is that EDD value actually the literal text "DATE1-1"? If not then show us actual date values. The actual dates could be made up as far as that goes as long as they behave like actual data.

Then show "worked" result that you expect from this exercise.

Because of the order of variables I am practically guessing that what you want for a result is a pregnancy sequence, correct?

If the "Visit details" do not actually contribute information to this process then don't bring it up.

From working a bit with pregnancy data your "the next EDD is great (usually >9 months!)" is a red herring as any number of things that terminate a pregnancy early can mess with this logic.

Also, how good to you think the data entry is to begin with? Do you have any EDD that are prior to an actual Visit date? Or EDD considerably later than 9 months from a visit date? (i.e. actual EDD 17Jul2021 but 17Jul2022 was entered by accident).

I ask because I have some "specimen collection dates" for 2026 that entered in data in 1998.

johnowen · Posted 07-22-2021 07:32 PM

Thanks, BallardW for your interest.

No, there are no dummy values in the SAS data. The actual MRNs and dates are all legit and properly sorted. I showed the Excel file example (with dummies) to demonstrate to other SAS users basically how the data were arranged AND how I wanted the PREG values to appear after an algorithm (which I am unable to craft) is applied. I want each EDD (+/- some fuzz) to represent each PREG #. Biologically, if the EDD changes by several months or more, it has to be a new pregnancy (PREG#). So if you could write an algorithm that simply removes the fuzz and makes all the EDDs the same as the first.EDD, then the first few lines of my code should work. For the vast majority of the patients in the data set, the EDD does not change (no fuzz!) and in these cases my code works perfectly.

The program I showed DOES fix the first pregnancy, but if a patient had >1 pregnancy in the 7 years worth of data and had the EDD glitch in the second, third, 4th, etc.t, it fails.

Hope this makes sense and that you can solve this frustrating puzzle.

Thanks again,

John Owen

ballardw · Posted 07-23-2021 03:20 AM

No actual dates with expected results and I am not going to write or attempt code.

The INTCK function returns number of intervals such as days between two actual SAS date values: DaysDiff = intck('day',firstdatevalue, seconddatevalue). If the values are dates subtraction works but you may be interested in other intervals such as 'week' 'month' or others. Check the documentation for other intervals, shifts and multiples.

Lag1 to LagN function will allow you to use values from a previous record (or up to N records previous): Somevar= lag(var); If no numeral is supplied then Lag1 is implied.

Retain will let you create a new variable and hold the value across iterations of the data step (such as the FIRST EDD value encountered: Retain Newvar <optionally an initial value>. If you decide to Retain a character variable it is a good idea to explicitly define the Length needed before the Retain statement.

Those are the tools likely needed.

johnowen · Posted 07-23-2021 08:44 AM

OK, well the master data set has >450k OBS so I'll be sending you a trimmed version with the issues I've described.

Thanks,

J.O.

ballardw · Posted 07-23-2021 12:22 PM

@johnowen wrote:
OK, well the master data set has >450k OBS so I'll be sending you a trimmed version with the issues I've described.

Thanks,

J.O.

Don't need the whole data set. Just enough to demonstrate typical behavior. Your "posted" example with actual dates instead of things like "date1-1" and "D1" might be enough to get a start.

Tom · Posted 07-23-2021 12:55 PM

and I can't figure out how to "reset" it for each new pregnancy/EDD.

But you know how to reset it when you start a new patient:

IF FIRST.MRN THEN DO; NEWEDD=EDD; END;
RETAIN NEWEDD;

So please describe in detail how it fails.

if I uncomment NEWEDD=EDD, it fails;

Include an example sets of dates that cause the failure and explain what you want it to produce instead.

johnowen · Posted 07-25-2021 08:57 PM

I've been travelling all weekend. Tom, so please forgive the late response.

It fails on the second occurrence because it retains the first EDD value for that MRN, and I can't figure out how to "reset" the EDD to the next (correct) pregnancy-EDD value. When I try to reset it as shown in the comment *.......; program then it doesn't see the "wrong" EDD that is +/- X,Y days compared to the original.

You know, the more I think about the problem, there really isn't an "absolutely correct set in stone" EDD which is why it can be changed in the data base by a provider. Thus to simplify this problem, would you be able to help me figure out how to retain the initial MRN EDD and then if SAS encounters a subsequent EDD < +/- 30 days different it changes that EDD back to the "original" value. That way my code that uses first.EDD will correctly increment the pregnancy #.

Many thanks,

John Owen

johnowen · Posted 07-25-2021 11:50 PM

OK, I have uploaded an .htm output file, a program file and 5 datasets.

The program takes my original data in FORSAS and counts pregnancies (PREG) by first.EDD for each patient (MRN) outputting to FORSAS2.

In the first 2 MRNs it works fine because the EDDs remain constant until they change by > 30 days, certainly indicating a "new" pregnancy that should be correctly counted.

However, in the 3rd case CooFe, you see the EDD changes from 3/29 to 3/30 and PREG is incorrectly incremented.

Then I TRY to fix the wonky EDDs and output to FORSAS3. Because the retained ALTEDD doesn't change as it encounters new EDDs, it corrects the first wonky EDD, but after that it doesn't find wonky EDDs that occur in more future pregnancies. You can see this when I run data=FORSAS3 again by first.EDD after dropping PREG in FORSAS3B. CooFe is fixed but DavJe isn't fixed because the EDD changes from 3/22 to 3/26 AFTER the first pregnancy as ALTEDD is left over from the first pregnancy. In that case PREG should not have been incremented from 2 to 3. JohSh demonstrates the same problem. The referent ALTEDD needs to change after each CORRECT EDD is processed but I can't figure this out.

Hope this makes sense. It is driving me crazy!

P.S. SAS only lets me upload max 5 files so I chose the program and 4 of the data sets, omitting FORSAS3B where I drop PREG for the next run. You can easily generate the output file using the program or just open each data set.

Thanks,

John Owen

Tom · Posted 07-26-2021 10:57 AM

So it looks to me like you are sorting by the wrong date. Use the other date which should be the event date.

Then it should work easily.

First let's clean up your data to eliminate any potential for identifying actual subjects and to make the variable names easier to work with.

proc sort data="c:\downloads\forsas" out=have;
  by mrn date___time code ;
run;

data have;
  length group 8;
  set have ;
  by mrn ;
  group + first.mrn ;
  rename date___time = date group=mrn Code__ICD9_=ICD9CODE;
  drop mrn code ;
  format date___time edd yymmdd10.;
run;

Now we have data like:

Obs    mrn          date           EDD    ICD9CODE

  1     1     2014-04-03    2014-10-19     V28.3
  2     1     2014-04-11    2014-10-19     659.63
  3     1     2014-04-11    2014-10-19     646.93
  4     1     2014-05-09    2014-10-19     659.63
  5     1     2014-05-09    2014-10-19     V23.41
  6     1     2014-05-20    2014-10-19     642.03
  7     1     2014-05-20    2014-10-19     659.63
  8     1     2014-05-20    2014-10-19     649.13
  9     1     2014-05-20    2014-10-19     648.03
 10     1     2014-05-20    2014-10-19     659.63

So now we can make a new EDD variable and while we are at it count the number of them per MRN.


data want;
  set have;
  by mrn date ;
  if first.mrn then do;
    new_edd=edd;
    eddn=1;
  end;
  else do;
    if edd-new_edd > 30 then do;
       new_edd=edd;
       eddn+1;
    end;
  end;
  retain new_edd eddn;
  format new_edd yymmdd10.;
run;

Let's see what we created.

data _null_;
  set want;
  by mrn eddn ;
  if first.eddn then put mrn= eddn= new_edd= ;
run;

Results:

224   data _null_;
225     set want;
226     by mrn eddn ;
227     if first.eddn then put mrn= eddn= new_edd= ;
228   run;

mrn=1 eddn=1 new_edd=2014-10-19
mrn=1 eddn=2 new_edd=2017-02-25
mrn=2 eddn=1 new_edd=2016-08-26
mrn=2 eddn=2 new_edd=2017-04-11
mrn=3 eddn=1 new_edd=2015-12-29
mrn=4 eddn=1 new_edd=2014-06-30
mrn=4 eddn=2 new_edd=2016-03-22
mrn=5 eddn=1 new_edd=2014-06-09
mrn=6 eddn=1 new_edd=2013-09-01
mrn=7 eddn=1 new_edd=2013-01-05
mrn=7 eddn=2 new_edd=2014-08-08
mrn=7 eddn=3 new_edd=2016-08-24
mrn=8 eddn=1 new_edd=2015-10-04
mrn=8 eddn=2 new_edd=2016-03-31
mrn=9 eddn=1 new_edd=2017-10-13
mrn=9 eddn=2 new_edd=2019-04-27
mrn=10 eddn=1 new_edd=2015-11-18
mrn=10 eddn=2 new_edd=2019-01-30

johnowen · Posted 07-26-2021 02:17 PM

Thanks so much, Tom SUPER USER.

I was hoping it was something not too complicated. Sorting by date of visit instead of EDD did the trick!

Best,

John Owen

file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

Re: file sorted by ID, EDD=expected delivery date, visit dates,

SAS Innovate 2025: Call for Content

Classroom Training Available!