DATA Step, Macro, Functions and more

Question regarding "By" processing in data step

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 82
Accepted Solution

Question regarding "By" processing in data step

I have a claims database with enrolle ID, unique claim ID, date, and a field called "Dx_string", which I created by concatatening all of the five-digit ICD-9 diagnosis codes on the claim.  It looks like this:

 

ID             CLAIM_ID        DATE           DX_STRING

1               100                 1/1/2015       7804         

1               101                 1/1/2015       30921 39021

1               102                 2/1/2015       30943 902 01920

1               103                 3/1/2015       011

2               104                 4/1/2015       4530

2               105                 5/1/2015       V9090

3               106                 6/1/2015       E884 3092

3               107                 6/1/2015       7039

3               108                 6/1/2015       800 0930 1092  

3               109                 7/1/2015       3094 

 



data have;
input ID CLAIM_ID DATE $ DX_STRING $30.;
datalines;
1 100 1/1/2015 7804
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 E884 3092
3 107 6/1/2015 7039
3 108 6/1/2015 800 0930 1092
3 109 7/1/2015 3094
;

run;

 

My goal is to concatenate all DX_STRING values that occur on the same date into a larger string called DX_STRING_DAY:

 

ID             CLAIM_ID        DATE           DX_STRING            DX_STRING_DAY

1               100                 1/1/2015       7804                        7804 30921 39021

1               101                 1/1/2015       30921 39021           7804 30921 39021

1               102                 2/1/2015       30943 902 01920    30943 902 01920 

1               103                 3/1/2015       011                          011

2               104                 4/1/2015       4530                        4530

2               105                 5/1/2015       V9090                      V9090

3               106                 6/1/2015       E884 3092               E844 3092 7039 800 0930 1092

3               107                 6/1/2015       7039                        E844 3092 7039 800 0930 1092 

3               108                 6/1/2015       800 0930 1092        E844 3092 7039 800 0930 1092 

3               109                 7/1/2015       3094                        3094

 

I can make the variable called DX_STRING_DAY using this code, but then I can't figure out how to make all values of DX_STRING_DAY the same within an individual on a given day.


data want; set have;
by id date;
retain dx_string_day;
if first.date then dx_string_day = dx_string;
if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string);
run;

 

Any suggestions?  Thanks.


Accepted Solutions
Solution
‎12-18-2015 02:20 PM
Regular Contributor
Posts: 222

Re: Question regarding "By" processing in data step

Hi..not too sure if this is what you want:

 

data want; set have;

by id date;

retain dx_string_day;

if first.date then dx_string_day = dx_string;

if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string);

run;

PROC SORT DATA=WANT;

BY ID DATE;

RUN;

DATA WANT1;

SET WANT;

BY id date;

IF LAST.DATE;

RUN;

PROC SQL;

CREATE TABLE WANT2 AS

SELECT DISTINCT

WANT.ID,

WANT.CLAIM_ID,

WANT.DATE,

WANT.DX_STRING,

WANT1.dx_string_day

FROM WORK.WANT LEFT JOIN WORK.WANT1 ON (WANT.ID = WANT1.ID) AND (WANT.DATE = WANT1.DATE);

QUIT;

View solution in original post


All Replies
Super User
Posts: 17,942

Re: Question regarding "By" processing in data step

You need to add a length to the dx_string_day variable so it's big enough to hold all the values.

length dx_string_day $100.;

You could also use proc transpose, though it would put each diagnosis into a different variable.

In general, when working with dx codes, I've found the long format more useful.
Super User
Super User
Posts: 7,430

Re: Question regarding "By" processing in data step

Totally agree with Reeza, long form is easier to program with.  Take a simple example, you want to get counts of code by subject, long form you could simply use a proc freq, by id.  With the concatenated verision you would have to do looping over to get counts and then sum those.

Frequent Contributor
Posts: 82

Re: Question regarding "By" processing in data step

I completely agree that long form is easier for programming, but in this case I actually need to keep the claims in wide form to mesh with the rest of the program.  Do you have any suggestions?

Solution
‎12-18-2015 02:20 PM
Regular Contributor
Posts: 222

Re: Question regarding "By" processing in data step

Hi..not too sure if this is what you want:

 

data want; set have;

by id date;

retain dx_string_day;

if first.date then dx_string_day = dx_string;

if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string);

run;

PROC SORT DATA=WANT;

BY ID DATE;

RUN;

DATA WANT1;

SET WANT;

BY id date;

IF LAST.DATE;

RUN;

PROC SQL;

CREATE TABLE WANT2 AS

SELECT DISTINCT

WANT.ID,

WANT.CLAIM_ID,

WANT.DATE,

WANT.DX_STRING,

WANT1.dx_string_day

FROM WORK.WANT LEFT JOIN WORK.WANT1 ON (WANT.ID = WANT1.ID) AND (WANT.DATE = WANT1.DATE);

QUIT;

Frequent Contributor
Posts: 82

Re: Question regarding "By" processing in data step

Yes, thank you very much.  I was doing something similar except without PROC SQL but I see how this is much more efficient.

Super User
Posts: 9,691

Re: Question regarding "By" processing in data step

data have;
input ID CLAIM_ID DATE $ DX_STRING $30.;
datalines;
1 100 1/1/2015 7804
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 E884 3092
3 107 6/1/2015 7039
3 108 6/1/2015 800 0930 1092
3 109 7/1/2015 3094
;
run;
data want;
 length DX_STRING_DAY $ 200;
 do until(last.date);
  set have;
  by ID DATE ;
  DX_STRING_DAY=catx(' ',DX_STRING_DAY ,DX_STRING );
 end;
  do until(last.date);
  set have;
  by ID DATE ;
  output;
 end;
run;
Frequent Contributor
Posts: 82

Re: Question regarding "By" processing in data step

Thanks for the solution.

 

My claims dataset has over 100 million observations and it's not clear to me how to choose a length for the DX_STRING_DAY variable. It's quite possible that this string could get really long if someone had something like 40 claims in a day.  On the other hand, I don't want to waste valuable hard drive space.  Any strategies on how to manage this situation?  I guess you could use something like a length of $1000 to be overly generous, then trim it later.

 

 

Thanks.

Super User
Posts: 17,942

Re: Question regarding "By" processing in data step

One reason why the data isn't stored in this format...Is there a reason you want it in one variable? A proc transpose would create a variable for each of the diagnosis automatically.
Frequent Contributor
Posts: 82

Re: Question regarding "By" processing in data step

Thanks.  I do understand your point.  I had tried it your way with proc transpose but was running into prolems.

 

Going back to the original structure of the claims (which has four diagnosis codes per claim), I have data like this:

 

data have;
input ID CLAIM_ID DATE $ DX1 $ DX2 $ DX3 $ DX4;
datalines;
1 100 1/1/2015 7804   
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 7039
3 107 6/1/2015 7039
3 108 6/1/2015 E884 0930 1092 0930
3 109 7/1/2015 3094
;

I want to create a dataset that has all of the *unique* diagnosis codes that occurred on claims during the same day for a given individual, as below (I don't care about whether it was DX1, DX2, DX3, or DX4).  This would involve using proc transpose somehow to create a series of variables that I might call "DXSAMEDAY1-DXSAMEDAY5" (in this fake example, five variables would be created but it would in reality be way more).  I've been playing with the syntax of proc transpose and can't get this work, though.  

 

 

I'd appreciate suggestions on how to proceed.  Thanks.

Super User
Super User
Posts: 7,430

Re: Question regarding "By" processing in data step

Your problms, and future ones, all lie in the method of storing your data.  If you normalise your data, and work with the normalised data, you will find your code is simpler to work with.  This doesn't prevent you, if you want a proc report with transposed data, transposing it at that point.

data have;
  infile datalines dlm="," missover;
  input id claim_id date $ dx1 $ dx2 $ dx3 $ dx4 $;
datalines;
1,100,1/1/2015,7804,,   
1,101,1/1/2015,30921,39021,,
1,102,2/1/2015,30943,902,01920,
1,103,3/1/2015,011,,,
2,104,4/1/2015,4530,,,
2,105,5/1/2015,V9090,,,
3,106,6/1/2015,7039,,,
3,107,6/1/2015,7039,,,
3,108,6/1/2015,E884,0930,1092,0930
3,109,7/1/2015,3094,,,
;
run;

/* Normalise the data to have one DX per row */
proc transpose data=have out=want;
  by id claim_id date;
  var dx:;         /* Note the use of dx: which means all with the prefix dx */
run;

/* Sort duplicates and missings out */
proc sort data=want (drop=_name_ rename=(col1=dx)) nodupkey;
  by id claim_id date dx;
  where dx ne "";
run;

You see from the above that finding duplicates, sorting etc. becomes so much easier with the different strcuture.

Super User
Posts: 17,942

Re: Question regarding "By" processing in data step

This is a different question than your original question, please ask new questions in a new thread.
Super User
Posts: 9,691

Re: Question regarding "By" processing in data step

In SAS ,each variable has the same length unlike other Data Base like Oracle have VARCHAR() format.

If you don't want this kind of code, you can create many variables like DX_STRING1 ,DX_STRING2,DX_STRING3, DX_STRING4  ....... to hold these values.

 

 

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 392 views
  • 0 likes
  • 5 in conversation