BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
johnowen
Calcite | Level 5

I am running 9.4 on my laptop. I have a SAS file sorted by patient ID-MRN, her due date=EDD, and one or more clinic visit dates for each pregnancy-EDD; There can be multiple lines of data for each visit date as each OBS for that visit date contains different info. Each EDD identifies a single pregnancy. Unfortunately the data contains some "glitches" whereby the EDD can change by +/- a few days (say <30) over the course of the pregnancy. Thus if I sort by EDD and increment pregnancy number when I encounter a first.EDD it "counts" too many pregnancies. I need code to allow for a small programmable "fuzz factor" where if the change in the next sorted EDD is small compared to the previous, it will not increment the pregnancy #. However if the increment to the next EDD is great (usually >9 months!), it will count the next pregnancy.

This is what I want the output to look like, where X&Y are generally small "errors" <30 or so. The visit dates are sorted in order by the EDD and I just used D1... as crude examples:

ID-MRNEDDPREGNANCY#VISIT DATEVISIT DETAILS
1DATE1-11D1X.....
1DATE1-11D1Y.....
1DATE1-11D1Z......
1DATE1-11D2ETC
1DATE1-1±X1D2ETC
1DATE1-1±X1D2ETC
1DATE1-22D3ETC
1DATE1-22D3ETC
1DATE1-22D4ETC
1DATE1-2±Y2D4ETC
1DATE1-2±Y2D5ETC
2DATE2-11D1ETC
2DATE2-11D1ETC
2DATE2-11D2ETC
2DATE2-11D2ETC
2DATE2-1±X1D3ETC
2DATE2-1±X1D17ETC
2DATE2-1±X1D18ETC
2DATE2-1±X1D19ETC
2DATE2-22D20ETC
2DATE2-22D21ETC
2DATE2-2±Y2D22ETC
2DATE2-2±Y2D23ETC
2DATE2-2±Y2D24ETC

 

This is my current code but it only "fixes" the first occasion of the wonky EDDs.

DATA RKPREG; SET RKGEN3; BY MRN EDD DATE;
***********************************************************;
IF FIRST.MRN THEN DO; PREG=0; RETAIN PREG; END;
IF FIRST.EDD THEN DO; PREG=PREG+1; END; *this counts too many pregnancies!;
IF LAST.MRN THEN RETURN;
RUN;
PROC PRINT DATA=RKPREG (FIRSTOBS=1 OBS=2000); VAR MRN LNAME EDD PREG DATE BESTGA; TITLE2 'DATA RKPREG'; RUN;
******************************************************************;
DATA RKPREG2; FORMAT NEWEDD MMDDYY10.;
*******************************************************************;
RETAIN NEWEDD; *not sure where to put retain statement!?!;
SET RKPREG; BY MRN EDD DATE;
IF FIRST.MRN THEN DO; NEWEDD=EDD; RETAIN NEWEDD;
END;
DIFF=ABS(EDD-NEWEDD);
IF 0<DIFF<30 THEN DO; PREG=PREG-1; ***NEWEDD=EDD; END;
*here I'm using a "fuzz factor" of +/-30 days), if I uncomment NEWEDD=EDD, it fails;
IF LAST.MRN THEN RETURN;

PROC PRINT DATA=RKPREG2 (FIRSTOBS=1 OBS=2000); VAR MRN LNAME DIFF EDD NEWEDD PREG DATE BESTGA; TITLE2 'DATA RKPREG2'; RUN;

 

This code fails on the next pregnancy(ies) because it continues to "retain" the EDD from the first pregnancy and I can' figure out how to "reset" it for each new pregnancy/EDD. If you reset the "wonky" EDDs to the previous (i.e. remove the error term) and make them all exactly the same as the initial value, that would be perfectly fine as well and might be the easiest solution. If I ran that code first, that way the first few lines of code should work fine.

 

Any coding help would be greatly appreciated.

Pulling my hair!

 

Many Thanks,

John Owen. MSPH

Univ of Ala at B'ham

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So it looks to me like you are sorting by the wrong date.  Use the other date which should be the event date.

Then it should work easily. 

First let's clean up your data to eliminate any potential for identifying actual subjects and to make the variable names easier to work with.

proc sort data="c:\downloads\forsas" out=have;
  by mrn date___time code ;
run;

data have;
  length group 8;
  set have ;
  by mrn ;
  group + first.mrn ;
  rename date___time = date group=mrn Code__ICD9_=ICD9CODE;
  drop mrn code ;
  format date___time edd yymmdd10.;
run;

Now we have data like:

Obs    mrn          date           EDD    ICD9CODE

  1     1     2014-04-03    2014-10-19     V28.3
  2     1     2014-04-11    2014-10-19     659.63
  3     1     2014-04-11    2014-10-19     646.93
  4     1     2014-05-09    2014-10-19     659.63
  5     1     2014-05-09    2014-10-19     V23.41
  6     1     2014-05-20    2014-10-19     642.03
  7     1     2014-05-20    2014-10-19     659.63
  8     1     2014-05-20    2014-10-19     649.13
  9     1     2014-05-20    2014-10-19     648.03
 10     1     2014-05-20    2014-10-19     659.63

So now we can make a new EDD variable and while we are at it count the number of them per MRN.


data want;
  set have;
  by mrn date ;
  if first.mrn then do;
    new_edd=edd;
    eddn=1;
  end;
  else do;
    if edd-new_edd > 30 then do;
       new_edd=edd;
       eddn+1;
    end;
  end;
  retain new_edd eddn;
  format new_edd yymmdd10.;
run;

Let's see what we created.

data _null_;
  set want;
  by mrn eddn ;
  if first.eddn then put mrn= eddn= new_edd= ;
run;

Results:

224   data _null_;
225     set want;
226     by mrn eddn ;
227     if first.eddn then put mrn= eddn= new_edd= ;
228   run;

mrn=1 eddn=1 new_edd=2014-10-19
mrn=1 eddn=2 new_edd=2017-02-25
mrn=2 eddn=1 new_edd=2016-08-26
mrn=2 eddn=2 new_edd=2017-04-11
mrn=3 eddn=1 new_edd=2015-12-29
mrn=4 eddn=1 new_edd=2014-06-30
mrn=4 eddn=2 new_edd=2016-03-22
mrn=5 eddn=1 new_edd=2014-06-09
mrn=6 eddn=1 new_edd=2013-09-01
mrn=7 eddn=1 new_edd=2013-01-05
mrn=7 eddn=2 new_edd=2014-08-08
mrn=7 eddn=3 new_edd=2016-08-24
mrn=8 eddn=1 new_edd=2015-10-04
mrn=8 eddn=2 new_edd=2016-03-31
mrn=9 eddn=1 new_edd=2017-10-13
mrn=9 eddn=2 new_edd=2019-04-27
mrn=10 eddn=1 new_edd=2015-11-18
mrn=10 eddn=2 new_edd=2019-01-30

View solution in original post

10 REPLIES 10
ballardw
Super User

If you have replaced values with dummys, such as D1 for an actual date you adding lots of complexity for absolutely no gain.

Yes patient confidentiality is important but if there is a dummy ID value you gain nothing in security by obscuring dates and remove the most likely tool for building solution.

 

Is that EDD value actually the literal text "DATE1-1"? If not then show us actual date values. The actual dates could be made up as far as that goes as long as they behave like actual data.

 

Then show "worked" result that you expect from this exercise.

 

Because of the order of variables I am practically guessing that what you want for a result is a pregnancy sequence, correct?

 

If the "Visit details" do not actually contribute information to this process then don't bring it up.

 

From working a bit with pregnancy data your "the next EDD is great (usually >9 months!)" is a red herring as any number of things that terminate a pregnancy early can mess with this logic.

 

Also, how good to you think the data entry is to begin with? Do you have any EDD that are prior to an actual Visit date? Or EDD considerably later than 9 months from a visit date? (i.e. actual EDD 17Jul2021 but 17Jul2022 was entered by accident).

I ask because I have some "specimen collection dates" for 2026 that entered in data in 1998.

johnowen
Calcite | Level 5
Thanks, BallardW for your interest.


No, there are no dummy values in the SAS data. The actual MRNs and dates are all legit and properly sorted. I showed the Excel file example (with dummies) to demonstrate to other SAS users basically how the data were arranged AND how I wanted the PREG values to appear after an algorithm (which I am unable to craft) is applied. I want each EDD (+/- some fuzz) to represent each PREG #. Biologically, if the EDD changes by several months or more, it has to be a new pregnancy (PREG#). So if you could write an algorithm that simply removes the fuzz and makes all the EDDs the same as the first.EDD, then the first few lines of my code should work. For the vast majority of the patients in the data set, the EDD does not change (no fuzz!) and in these cases my code works perfectly.


The program I showed DOES fix the first pregnancy, but if a patient had >1 pregnancy in the 7 years worth of data and had the EDD glitch in the second, third, 4th, etc.t, it fails.


Hope this makes sense and that you can solve this frustrating puzzle.


Thanks again,

John Owen
ballardw
Super User

No actual dates with expected results and I am not going to write or attempt code.

 

The INTCK function returns number of intervals such as days between two actual SAS date values: DaysDiff = intck('day',firstdatevalue, seconddatevalue). If the values are dates subtraction works but you may be interested in other intervals such as 'week' 'month' or others. Check the documentation for other intervals, shifts and multiples.

Lag1 to LagN function will allow you to use values from a previous record (or up to N records previous): Somevar= lag(var); If no numeral is supplied then Lag1 is implied.

Retain will let you create a new variable and hold the value across iterations of the data step (such as the FIRST EDD value encountered:  Retain Newvar <optionally an initial value>. If you decide to Retain a character variable it is a good idea to explicitly define the Length needed before the Retain statement.

 

Those are the tools likely needed.

 

johnowen
Calcite | Level 5
OK, well the master data set has >450k OBS so I'll be sending you a trimmed version with the issues I've described.


Thanks,

J.O.
ballardw
Super User

@johnowen wrote:
OK, well the master data set has >450k OBS so I'll be sending you a trimmed version with the issues I've described.


Thanks,

J.O.

Don't need the whole data set. Just enough to demonstrate typical behavior. Your "posted" example with actual dates instead of things like "date1-1" and "D1" might be enough to get a start.

Tom
Super User Tom
Super User

and I can't figure out how to "reset" it for each new pregnancy/EDD.

But you know how to reset it when you start a new patient:

 

IF FIRST.MRN THEN DO; NEWEDD=EDD; END;
RETAIN NEWEDD;

 

So please describe in detail how it fails.

if I uncomment NEWEDD=EDD, it fails;

Include an example sets of dates that cause the failure and explain what you want it to produce instead.

johnowen
Calcite | Level 5
I've been travelling all weekend. Tom, so please forgive the late response.


It fails on the second occurrence because it retains the first EDD value for that MRN, and I can't figure out how to "reset" the EDD to the next (correct) pregnancy-EDD value. When I try to reset it as shown in the comment *.......; program then it doesn't see the "wrong" EDD that is +/- X,Y days compared to the original.


You know, the more I think about the problem, there really isn't an "absolutely correct set in stone" EDD which is why it can be changed in the data base by a provider. Thus to simplify this problem, would you be able to help me figure out how to retain the initial MRN EDD and then if SAS encounters a subsequent EDD < +/- 30 days different it changes that EDD back to the "original" value. That way my code that uses first.EDD will correctly increment the pregnancy #.


Many thanks,

John Owen
johnowen
Calcite | Level 5

OK, I have uploaded an .htm output file, a program file and  5 datasets.

The program takes my original data in FORSAS and counts pregnancies (PREG) by first.EDD for each patient (MRN) outputting to FORSAS2.

In the first 2 MRNs it works fine because the EDDs remain constant until they change by > 30 days, certainly indicating a "new" pregnancy that should be correctly counted.

However, in the 3rd case CooFe, you see the EDD changes from 3/29 to 3/30 and PREG is incorrectly incremented.

Then I TRY to fix the wonky EDDs and output to FORSAS3. Because the retained ALTEDD doesn't change as it encounters new EDDs, it corrects the first wonky EDD, but after that it doesn't find wonky EDDs that occur in more future pregnancies. You can see this when I run data=FORSAS3 again by first.EDD after dropping PREG in FORSAS3B. CooFe is fixed but DavJe isn't fixed because the EDD changes from 3/22 to 3/26 AFTER the first pregnancy as ALTEDD is left over from the first pregnancy. In that case PREG should not have been incremented from 2 to 3. JohSh demonstrates the same problem. The referent ALTEDD needs to change after each CORRECT EDD is processed but I can't figure this out.

 

Hope this makes sense. It is driving me crazy!

 

P.S. SAS only lets me upload max 5 files so I chose the program and 4 of the data sets, omitting FORSAS3B where I drop PREG for the next run. You can easily generate the output file using the program or just open each data set.

 

Thanks,

John Owen

Tom
Super User Tom
Super User

So it looks to me like you are sorting by the wrong date.  Use the other date which should be the event date.

Then it should work easily. 

First let's clean up your data to eliminate any potential for identifying actual subjects and to make the variable names easier to work with.

proc sort data="c:\downloads\forsas" out=have;
  by mrn date___time code ;
run;

data have;
  length group 8;
  set have ;
  by mrn ;
  group + first.mrn ;
  rename date___time = date group=mrn Code__ICD9_=ICD9CODE;
  drop mrn code ;
  format date___time edd yymmdd10.;
run;

Now we have data like:

Obs    mrn          date           EDD    ICD9CODE

  1     1     2014-04-03    2014-10-19     V28.3
  2     1     2014-04-11    2014-10-19     659.63
  3     1     2014-04-11    2014-10-19     646.93
  4     1     2014-05-09    2014-10-19     659.63
  5     1     2014-05-09    2014-10-19     V23.41
  6     1     2014-05-20    2014-10-19     642.03
  7     1     2014-05-20    2014-10-19     659.63
  8     1     2014-05-20    2014-10-19     649.13
  9     1     2014-05-20    2014-10-19     648.03
 10     1     2014-05-20    2014-10-19     659.63

So now we can make a new EDD variable and while we are at it count the number of them per MRN.


data want;
  set have;
  by mrn date ;
  if first.mrn then do;
    new_edd=edd;
    eddn=1;
  end;
  else do;
    if edd-new_edd > 30 then do;
       new_edd=edd;
       eddn+1;
    end;
  end;
  retain new_edd eddn;
  format new_edd yymmdd10.;
run;

Let's see what we created.

data _null_;
  set want;
  by mrn eddn ;
  if first.eddn then put mrn= eddn= new_edd= ;
run;

Results:

224   data _null_;
225     set want;
226     by mrn eddn ;
227     if first.eddn then put mrn= eddn= new_edd= ;
228   run;

mrn=1 eddn=1 new_edd=2014-10-19
mrn=1 eddn=2 new_edd=2017-02-25
mrn=2 eddn=1 new_edd=2016-08-26
mrn=2 eddn=2 new_edd=2017-04-11
mrn=3 eddn=1 new_edd=2015-12-29
mrn=4 eddn=1 new_edd=2014-06-30
mrn=4 eddn=2 new_edd=2016-03-22
mrn=5 eddn=1 new_edd=2014-06-09
mrn=6 eddn=1 new_edd=2013-09-01
mrn=7 eddn=1 new_edd=2013-01-05
mrn=7 eddn=2 new_edd=2014-08-08
mrn=7 eddn=3 new_edd=2016-08-24
mrn=8 eddn=1 new_edd=2015-10-04
mrn=8 eddn=2 new_edd=2016-03-31
mrn=9 eddn=1 new_edd=2017-10-13
mrn=9 eddn=2 new_edd=2019-04-27
mrn=10 eddn=1 new_edd=2015-11-18
mrn=10 eddn=2 new_edd=2019-01-30
johnowen
Calcite | Level 5

Thanks so much, Tom SUPER USER.

I was hoping it was something not too complicated. Sorting by date of visit instead of EDD did the trick!

Best,

John Owen

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 710 views
  • 0 likes
  • 3 in conversation