BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chuakp
Obsidian | Level 7

I have a claims database with enrolle ID, unique claim ID, date, and a field called "Dx_string", which I created by concatatening all of the five-digit ICD-9 diagnosis codes on the claim.  It looks like this:

 

ID             CLAIM_ID        DATE           DX_STRING

1               100                 1/1/2015       7804         

1               101                 1/1/2015       30921 39021

1               102                 2/1/2015       30943 902 01920

1               103                 3/1/2015       011

2               104                 4/1/2015       4530

2               105                 5/1/2015       V9090

3               106                 6/1/2015       E884 3092

3               107                 6/1/2015       7039

3               108                 6/1/2015       800 0930 1092  

3               109                 7/1/2015       3094 

 



data have;
input ID CLAIM_ID DATE $ DX_STRING $30.;
datalines;
1 100 1/1/2015 7804
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 E884 3092
3 107 6/1/2015 7039
3 108 6/1/2015 800 0930 1092
3 109 7/1/2015 3094
;

run;

 

My goal is to concatenate all DX_STRING values that occur on the same date into a larger string called DX_STRING_DAY:

 

ID             CLAIM_ID        DATE           DX_STRING            DX_STRING_DAY

1               100                 1/1/2015       7804                        7804 30921 39021

1               101                 1/1/2015       30921 39021           7804 30921 39021

1               102                 2/1/2015       30943 902 01920    30943 902 01920 

1               103                 3/1/2015       011                          011

2               104                 4/1/2015       4530                        4530

2               105                 5/1/2015       V9090                      V9090

3               106                 6/1/2015       E884 3092               E844 3092 7039 800 0930 1092

3               107                 6/1/2015       7039                        E844 3092 7039 800 0930 1092 

3               108                 6/1/2015       800 0930 1092        E844 3092 7039 800 0930 1092 

3               109                 7/1/2015       3094                        3094

 

I can make the variable called DX_STRING_DAY using this code, but then I can't figure out how to make all values of DX_STRING_DAY the same within an individual on a given day.


data want; set have;
by id date;
retain dx_string_day;
if first.date then dx_string_day = dx_string;
if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string);
run;

 

Any suggestions?  Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
twildone
Pyrite | Level 9

Hi..not too sure if this is what you want:

 

data want; set have;

by id date;

retain dx_string_day;

if first.date then dx_string_day = dx_string;

if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string);

run;

PROC SORT DATA=WANT;

BY ID DATE;

RUN;

DATA WANT1;

SET WANT;

BY id date;

IF LAST.DATE;

RUN;

PROC SQL;

CREATE TABLE WANT2 AS

SELECT DISTINCT

WANT.ID,

WANT.CLAIM_ID,

WANT.DATE,

WANT.DX_STRING,

WANT1.dx_string_day

FROM WORK.WANT LEFT JOIN WORK.WANT1 ON (WANT.ID = WANT1.ID) AND (WANT.DATE = WANT1.DATE);

QUIT;

View solution in original post

12 REPLIES 12
Reeza
Super User
You need to add a length to the dx_string_day variable so it's big enough to hold all the values.

length dx_string_day $100.;

You could also use proc transpose, though it would put each diagnosis into a different variable.

In general, when working with dx codes, I've found the long format more useful.
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Totally agree with Reeza, long form is easier to program with.  Take a simple example, you want to get counts of code by subject, long form you could simply use a proc freq, by id.  With the concatenated verision you would have to do looping over to get counts and then sum those.

chuakp
Obsidian | Level 7

I completely agree that long form is easier for programming, but in this case I actually need to keep the claims in wide form to mesh with the rest of the program.  Do you have any suggestions?

twildone
Pyrite | Level 9

Hi..not too sure if this is what you want:

 

data want; set have;

by id date;

retain dx_string_day;

if first.date then dx_string_day = dx_string;

if not(first.date) then dx_string_day = catx(" ", dx_string_day, dx_string);

run;

PROC SORT DATA=WANT;

BY ID DATE;

RUN;

DATA WANT1;

SET WANT;

BY id date;

IF LAST.DATE;

RUN;

PROC SQL;

CREATE TABLE WANT2 AS

SELECT DISTINCT

WANT.ID,

WANT.CLAIM_ID,

WANT.DATE,

WANT.DX_STRING,

WANT1.dx_string_day

FROM WORK.WANT LEFT JOIN WORK.WANT1 ON (WANT.ID = WANT1.ID) AND (WANT.DATE = WANT1.DATE);

QUIT;

chuakp
Obsidian | Level 7

Yes, thank you very much.  I was doing something similar except without PROC SQL but I see how this is much more efficient.

Ksharp
Super User
data have;
input ID CLAIM_ID DATE $ DX_STRING $30.;
datalines;
1 100 1/1/2015 7804
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 E884 3092
3 107 6/1/2015 7039
3 108 6/1/2015 800 0930 1092
3 109 7/1/2015 3094
;
run;
data want;
 length DX_STRING_DAY $ 200;
 do until(last.date);
  set have;
  by ID DATE ;
  DX_STRING_DAY=catx(' ',DX_STRING_DAY ,DX_STRING );
 end;
  do until(last.date);
  set have;
  by ID DATE ;
  output;
 end;
run;
chuakp
Obsidian | Level 7

Thanks for the solution.

 

My claims dataset has over 100 million observations and it's not clear to me how to choose a length for the DX_STRING_DAY variable. It's quite possible that this string could get really long if someone had something like 40 claims in a day.  On the other hand, I don't want to waste valuable hard drive space.  Any strategies on how to manage this situation?  I guess you could use something like a length of $1000 to be overly generous, then trim it later.

 

 

Thanks.

Reeza
Super User
One reason why the data isn't stored in this format...Is there a reason you want it in one variable? A proc transpose would create a variable for each of the diagnosis automatically.
chuakp
Obsidian | Level 7

Thanks.  I do understand your point.  I had tried it your way with proc transpose but was running into prolems.

 

Going back to the original structure of the claims (which has four diagnosis codes per claim), I have data like this:

 

data have;
input ID CLAIM_ID DATE $ DX1 $ DX2 $ DX3 $ DX4;
datalines;
1 100 1/1/2015 7804   
1 101 1/1/2015 30921 39021
1 102 2/1/2015 30943 902 01920
1 103 3/1/2015 011
2 104 4/1/2015 4530
2 105 5/1/2015 V9090
3 106 6/1/2015 7039
3 107 6/1/2015 7039
3 108 6/1/2015 E884 0930 1092 0930
3 109 7/1/2015 3094
;

I want to create a dataset that has all of the *unique* diagnosis codes that occurred on claims during the same day for a given individual, as below (I don't care about whether it was DX1, DX2, DX3, or DX4).  This would involve using proc transpose somehow to create a series of variables that I might call "DXSAMEDAY1-DXSAMEDAY5" (in this fake example, five variables would be created but it would in reality be way more).  I've been playing with the syntax of proc transpose and can't get this work, though.  

 

 

I'd appreciate suggestions on how to proceed.  Thanks.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Your problms, and future ones, all lie in the method of storing your data.  If you normalise your data, and work with the normalised data, you will find your code is simpler to work with.  This doesn't prevent you, if you want a proc report with transposed data, transposing it at that point.

data have;
  infile datalines dlm="," missover;
  input id claim_id date $ dx1 $ dx2 $ dx3 $ dx4 $;
datalines;
1,100,1/1/2015,7804,,   
1,101,1/1/2015,30921,39021,,
1,102,2/1/2015,30943,902,01920,
1,103,3/1/2015,011,,,
2,104,4/1/2015,4530,,,
2,105,5/1/2015,V9090,,,
3,106,6/1/2015,7039,,,
3,107,6/1/2015,7039,,,
3,108,6/1/2015,E884,0930,1092,0930
3,109,7/1/2015,3094,,,
;
run;

/* Normalise the data to have one DX per row */
proc transpose data=have out=want;
  by id claim_id date;
  var dx:;         /* Note the use of dx: which means all with the prefix dx */
run;

/* Sort duplicates and missings out */
proc sort data=want (drop=_name_ rename=(col1=dx)) nodupkey;
  by id claim_id date dx;
  where dx ne "";
run;

You see from the above that finding duplicates, sorting etc. becomes so much easier with the different strcuture.

Reeza
Super User
This is a different question than your original question, please ask new questions in a new thread.
Ksharp
Super User

In SAS ,each variable has the same length unlike other Data Base like Oracle have VARCHAR() format.

If you don't want this kind of code, you can create many variables like DX_STRING1 ,DX_STRING2,DX_STRING3, DX_STRING4  ....... to hold these values.

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 3131 views
  • 0 likes
  • 5 in conversation