SAS Data Management

websitelogins94 · Posted 10-01-2019 01:03 PM

Hello All,

Hope all is well. This is my first post since I've reviewed several others but unfortunately was not able to find the right solution to fit my circumstances. I have a data set that requires me to provide: (1) one row per patient id, (2) collapsed (corrected) values from subsequent rows into the earliest patient record. Please see below.

EXAMPLE

Datum id DayFromTxBase CorrectionBas Var1 Var2 Var3 Var4 Var5

1/27/2016 22:06 Z029 -2 ABC 1 2 3 UVW

6/23/2017 15:36 Z029 -2 1 DEF . 2 5 XYZ

9/27/2019 10:37 Z029 -2 1 GHI 3 . 4 XYZ

DESIRED RESULT

Datum id DayFromTxBase CorrectionBas Var1 Var2 Var3 Var4 Var5

1/27/2016 22:06 Z029 -2 GHI 3 2 4 XYZ

I've inherited some code from a colleague that runs an import macro which (1) checks the data set to see if there are corrections, and (2) if so, then make an original and a corrections data set, and (3) update the original values with corrected values. Please see below.

%macro import2b(site=want);
proc import out=want
datafile="\\.."
dbms=xlsx replace ;
sheet= " ";
getnames=yes;
run;

%symdel correct;

/* checks the dataset if there are any corrections*/

proc sql;
create table want as
select max(CORRECTIONBAS) as max_c,CORRECTIONBAS,*
from want ;
QUIT;

data want ;
%let correct=0;
set want;
call symput('correct',max_c);
run;

/* if there are corrections then make a original dataset and a corrections dataset*/

%if &correct=1 %then
%do;

DATA ORIGINAL CORRECTION;
SET want;

if CORRECTIONBAS=. then
output ORIGINAL;

if CORRECTIONBAS=1 then
output CORRECTION;
RUN;

proc sort data=original;
by ID DayFromTxBase;

run;

proc sort data=correction;
by ID DayFromTxBase;
run;

/* Update the original values (in original data)with corrected values (in corrected dataset)*/

DATA want;
UPDATE ORIGINAL CORRECTION;
BY ID DayFromTxBase;

RUN;

proc datasets;
delete ORIGINAL CORRECTION;
run;

%end;

%mend;

import2b (site =want );

The long story short here is that this current program works for some patient ids and does not work (returns more than record) for other patient ids. Also, we've also advised to remove the DayFromTxBase column altogether as this data has been deemed to be unreliable, and thus we will likely devise a solution that sorts the tables by Datum. We've been wracking our brains on trying to figure this out, so hopefully there's someone out there who can help resolve this issue. Thank you.

mkeintz · Posted 10-01-2019 01:35 PM

For each ID, you apparent want the earliest DATE, TIME, but the latest non-missing values for all other variables.

This is relatively easy, using the UPDATE statement (which gets the latest non-missing values for all variable), and then a SET with POINT= option, as in:

data have;
  input DATE mmddyy10. time time5.0   id :$4.    
                DayFromTxBase
                CorrectionBas  Var1 :$3. Var2 Var3 Var4 Var5 :$3.;
  format date date9. time hhmm5.0;
datalines;
1/27/2016 22:06   Z029  -2                          .                     ABC  1      2      3       UVW
6/23/2017 15:36   Z029  -2                          1                     DEF   .      2      5       XYZ
9/27/2019 10:37   Z029  -2                          1                     GHI   3     .       4        XYZ 
run;

data want;
  update have (obs=0) have;
  by ID;
  retain ptr;
  if first.id then ptr=_n_;
  if last.id then set have (Keep=date time) point=ptr;
run;

This program assumes the data are sorted by ID/DATE/TIME.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 10-01-2019 01:35 PM

For each ID, you apparent want the earliest DATE, TIME, but the latest non-missing values for all other variables.

This is relatively easy, using the UPDATE statement (which gets the latest non-missing values for all variable), and then a SET with POINT= option, as in:

data have;
  input DATE mmddyy10. time time5.0   id :$4.    
                DayFromTxBase
                CorrectionBas  Var1 :$3. Var2 Var3 Var4 Var5 :$3.;
  format date date9. time hhmm5.0;
datalines;
1/27/2016 22:06   Z029  -2                          .                     ABC  1      2      3       UVW
6/23/2017 15:36   Z029  -2                          1                     DEF   .      2      5       XYZ
9/27/2019 10:37   Z029  -2                          1                     GHI   3     .       4        XYZ 
run;

data want;
  update have (obs=0) have;
  by ID;
  retain ptr;
  if first.id then ptr=_n_;
  if last.id then set have (Keep=date time) point=ptr;
run;

This program assumes the data are sorted by ID/DATE/TIME.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

websitelogins94 · Posted 10-01-2019 01:57 PM

We tested out this solution, and it works! Many thanks for the quick response!

ballardw · Posted 10-01-2019 01:42 PM

Without actual example data of records that create "duplicate" id output the most likely cause is the step that creates the ORIGINAL data set. If your source data file, the one imported apparently, has more than one record for any given Id that is not marked as correction then you will get one for each such record in the output.

Since these are "patient id" then likely you have case of lapses or starts of difference sequences of services.

Depending on what your actual goal need is you might try something like this at the top of that macro:

proc sql;
   create table work.temp as
   select max(CORRECTIONBAS) as max_c,CORRECTIONBAS,*
   from want 
   order by id,datum;
QUIT;

DATA ORIGINAL CORRECTION;
   SET work.temp (drop = dayfromtxbase);
   by id;
   if first.id  then output ORIGINAL;
   if CORRECTIONBAS=1 then
   output CORRECTION;
RUN;

This assumes that the actual desired single "original" value is 1) the earliest dated and 2) that one is itself not marked as correction.

Other wise you may need to provide a more extensive example of some records with duplicates, the actual desired output for those and any rules regarding how to pick the "correct" original.

You can drop the unwanted variable almost anywhere. I would likely do it in the data split to Original and Correction as above.

CAUTION: I am very leery of any step that starts with Proc Import, especially if the source is a spreadsheet, as things like variable types and lengths frequently change when reading different files.

websitelogins94 · Posted 10-01-2019 03:13 PM

You are correct in your overall observation, and your cautionary statement rings true given beginning a data step with a proc import has been particularly troublesome. I have not yet had the opportunity to test your code, however, given that this a large project that warrants further exploration into alternative solutions, I will do so and provide further feedback. Thanks.

SAS Data Management

Help with returning one row per id by collapsing corrected values from later rows into first row.

Re: Help with returning one row per id by collapsing corrected values from later rows into first row

Re: Help with returning one row per id by collapsing corrected values from later rows into first row

Re: Help with returning one row per id by collapsing corrected values from later rows into first row

Re: Help with returning one row per id by collapsing corrected values from later rows into first row

Re: Help with returning one row per id by collapsing corrected values from later rows into first row

Collapse occurrencies (indexes) present in multiple rows into one sing...

Count of specific value across row

Swap values row wise

Collapsing rows into one row and overwriting a column

Value from a precedent row

Follow Us

What is...

SAS Data Management

Special offer for SAS Communities members

Follow Us

What is...