Solved: Adding non-duplicate value of a row to a new column

GS2 · Posted 09-22-2020 04:01 PM

Using SAS 9.4

I am trying to get all codes unique by ID and Date into the same row. I have data below to show but basically an ID can have multiple ICD9, ICD10 and/or CPT codes per date. I would like an efficient way to transform the data from long to wide but I need it only to be wide where the ID and Date are the same. As I show below, if there are 2 rows for an ID with the same Date I would like to take the different ICD9/ICD10/CPT code from the row below and move it into a new column so that each unique ID and Date has its own row with all ICD9, ICD10 and CPT listed in wide format.

I have the following data:

data have;
input ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code Chg_Procedure_Code;
datalines;
1 09222020 1111 2222 3333
1 09222020 1111 2222 4444

2 09222020 1111 2222 3333

2 08222020 1111 2222 4444
;
run;

I want my data to look like this (As you can see if the ID is the same but the Date is different I do not want the data to transform to wide):

data want;
input ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code Chg_Procedure_Code Chg_Procedure_Code2;
datalines;
1 09222020 1111 2222 3333 4444

2 09222020 1111 2222 3333 .
2 08012000 1111 2222 4444 .
;
run;

Any thoughts of place to start with this would be helpful. Thank you

SK_11 · Posted 09-23-2020 06:27 PM

If it helps

data have;
input ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code Chg_Procedure_Code;
datalines;
1 09222020 1111 2222 3333
1 09222020 1111 2222 4444
2 09222020 1111 2222 3333
2 08222020 1111 2222 4444
;
run;
Proc sort data=have out=have;
By ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code;
run;

proc transpose data=have out=want (drop=_NAME_) prefix=Chg_Procedure_Code;
By ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code;
var Chg_Procedure_Code;
run;

The output

ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code Chg_Procedure_Code1 Chg_Procedure_Code2
1 9222020 1111 2222 3333 4444
2 8222020 1111 2222 4444 .
2 9222020 1111 2222 3333 .

View solution in original post

GS2 · Posted 09-22-2020 04:48 PM

My thoughts have been to use the following code but I am not capturing what I would like but hopefully the code will give an idea of what I am thinking will work. Thank want

data have;
set all;
array nine icd9_1-icd9_3;
array ten icd10_1-icd10_4;
array cpt cpt1-cpt7;
do i = 1 to 3;*3 is the max number of different codes;
nine{i} = lag(Chg_Pri_Diagnosis_Code);
end;

do i = 1 to 4;*4 is the max number of different codes;
ten{i} = lag(Chg_Pri_ICD10_Diagnosis_Code);
end;

do i = 1 to 7;*7 is the max number of different codes;
cpt{i} = lag(Chg_Procedure_Code);
end;
b

ballardw · Posted 09-22-2020 05:00 PM

Are there any other variables in your have data set other than the ones shown? If so, do they appear in the result?

jimbarbour · Posted 09-23-2020 12:54 AM

@GS2,

Here's one approach; see code below, followed by the results.

This code allows one to have any number of Chg_Procedure_Code columns in the wide version of the dataset. One simply sets the Nbr_of_Codes macro variable to whatever the maximum number of Chg_Procedure_Code columns you desire. In this example, I've modified the data to fit three Chg_Procedure_Code columns.

Notice also that I've introduced a duplicate for ID 2 on Sept. 22, 6666. In our results we can see that there is no duplicate 6666. The Multidata:'N' in conjunction with how the keys are defined removes duplicates. One could also remove the same duplicates in the Sort step.

This wasn't a specified requirement, but the Chg_Procedure_Code columns will have the values in sort order. I.e. for values 1111, 2222, and 3333, the first Chg_Procedure_Code column will have 1111, the second 2222, and the third 3333. This is accomplished by using Ordered:'A' (ascending). If you don't want this behavior, you can just remove this parameter.

%LET	Nbr_of_Codes	3;

DATA	Have;
	FORMAT	ID;
	FORMAT	Date	YYMMDDD10.;
	INFILE	Datalines;
	input 
		ID								$
		Date							:	ANYDTDTE8.
		Chg_Pri_Diagnosis_Code			$ 
		Chg_Pri_ICD10_Diagnosis_Code 	$
		Chg_Procedure_Code				$
		;
Datalines;
1 09222020 1111 2222 3333
1 09222020 1111 2222 4444
2 09222020 1111 2222 1111
2 09222020 1111 2222 6666
2 09222020 1111 2222 6666
2 08222020 1111 2222 7777
7 07222020 1111 2222 6666
7 07222020 1111 2222 7777
7 07222020 1111 2222 8888
;
RUN;

PROC	SORT	DATA=Have;
	BY	ID Date;
RUN;

DATA	Want;
	DROP	_:;
	DROP	Chg_Procedure_Code;

	SET	Have;
		BY	ID	Date;

	ARRAY	Chg_Procedure_Codes	[*]	$8	Chg_Procedure_Code1 - Chg_Procedure_Code&Nbr_of_Codes;

	IF	_N_					=	1	THEN
		DO;
			DECLARE	HASH	H_Pat_Data(ORDERED: 'A', MULTIDATA:'N');
							H_Pat_Data.DEFINEKEY ('ID', 'Date', 'Chg_Procedure_Code');
							H_Pat_Data.DEFINEDONE();
			DECLARE	HITER	HI_Pat_Data('H_Pat_Data');
		END;

	_RC						=	H_Pat_Data.ADD();

	IF	LAST.Date	THEN
		DO;
			_RC				=	HI_Pat_Data.FIRST();
			DO	_i			=	1	TO	&Nbr_of_Codes;
				Chg_Procedure_Codes[_i]	=	Chg_Procedure_Code;
				_RC			=	HI_Pat_Data.NEXT();
				IF	_RC		>	0	THEN
					DO;
						_i	=	&Nbr_of_Codes;
						_RC	=	H_Pat_Data.CLEAR();
					END;
			END;
		END;
	ELSE
		DO;
			DELETE;
		END;
RUN;

Results:

Jim

GS2 · Posted 09-23-2020 08:28 AM

There are other variables but this is just building a data set for review so they are brought over but not part of what is wanted. Thank you

andreas_lds · Posted 09-23-2020 01:15 AM

Having data in wide format can cause difficulties when processing the data. The first difficulty is already shown in the solution provided by @jimbarbour: you need to know how many variables for the Chg_Procedure_Codes are required. Later on you will have to write checks talking into account that some of the Chg_Procedure_Code-variable are missing. So: are you sure, that the wide-format serves the whatever you are doing with the data afterwards better than the long-format?

Something else: What do you expect from the following data (don't know if the combination of values is possible, at all):

ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code Chg_Procedure_Code
1 09222020 1111 2222 3333
1 09222020 4242 2222 4444

From your description those obs should be combined, forming something like???

GS2 · Posted 09-23-2020 10:00 AM

Yes, wide is how the data needs to be structured.

There may be combinations but this is more gathering how many potential codes are available.

s_lassen · Posted 09-23-2020 04:57 AM

If I understand you correctly, you want to do this with all the 3 variables.

The first requirement is to get the number of elements for each variable:

proc sql noprint;
  select 
    max(n1),
    max(n2),
    max(n3)
  into :n1 trimmed,:n2 trimmed,:n3 trimmed
  from(
    select 
      count(distinct Chg_Pri_Diagnosis_Code) as n1,
      count(distinct Chg_Pri_ICD10_Diagnosis_Code) as n2,
      count(distinct Chg_Procedure_Code) as n3
    from have
    group by ID,Date
    )
  ;
quit;

Then, define a macro to get the data into an array, and use that in a DoW loop:

%macro make_array(var,n);
  %if &n>1 %then %do;
    array _&var(*) 8 &var.1-&var.&n;
    if not whichn(&var,of _&var(*)) then
      _&var(n(of _&var(*))+1)=&var;
    drop &var;
    %end;
%mend;
  
options mprint;
data want;
  do until(last.Date);
    set have;  
    by ID Date notsorted;
    %make_array(Chg_Pri_Diagnosis_Code,&n1);
    %make_array(Chg_Pri_ICD10_Diagnosis_Code,&n2);
    %make_array(Chg_Procedure_Code,&n3);
    end;
run;

You data was not sorted correctly by DATE, so I used the NOTSORTED option. You may want to replace that with a PROC SORT instead.

GS2 · Posted 09-23-2020 08:52 AM

So I believe this method will work based on what I am seeing; However, I did make a mistake with the data. Chg_Pri_ICD10_Diagnosis_Code is actually a character variable, there is a letter in a couple observations. Sorry about that. Is it possible to modify the code to incorporate if 1 variable is a character? Thank you

jimbarbour · Posted 09-23-2020 10:36 AM

@GS2,

The code I have written will accommodate character. I usually just code everything character unless I intend to do calculations or it's a date.

@andreas_lds made a good point that it might be a bit of a pain to have to manually count the number of unique procedure codes. That is easily solved by adding a NODUPKEY parameter to the sort and then inserting a little Data step like the below. The results are I believe what you want and there's no manual step of counting the number of unique procedure codes.

DATA	Have;
	FORMAT	ID;
	FORMAT	Date	YYMMDDD10.;
	INFILE	Datalines;
	input 
		ID								$
		Date							:	ANYDTDTE8.
		Chg_Pri_Diagnosis_Code			$ 
		Chg_Pri_ICD10_Diagnosis_Code 	$
		Chg_Procedure_Code				$
		;
Datalines;
1 09222020 1111 2222 3333
1 09222020 1111 2222 4444
2 09222020 1111 2222 1111
2 09222020 1111 2222 6666
2 09222020 1111 2222 6666
2 08222020 1111 2222 7777
7 07222020 1111 2222 6666
7 07222020 1111 2222 7777
7 07222020 1111 2222 8888
;
RUN;

PROC	SORT	DATA=Have	NODUPKEY;
	BY	ID Date	Chg_Procedure_Code;
RUN;

DATA	_NULL_;
	RETAIN	Max_Procedure_Code_Cnt		0;

	IF	_End_of_Data												THEN
		CALL	SYMPUTX('Nbr_of_Codes', Max_Procedure_Code_Cnt, 'G');

	SET	Have	END						=	_End_of_Data;
		BY	ID Date	Chg_Procedure_Code;

	Procedure_Code_Cnt					+	1;

	IF	LAST.Chg_Procedure_Code										THEN
		IF	Procedure_Code_Cnt			>	Max_Procedure_Code_Cnt	THEN
			DO;
				Max_Procedure_Code_Cnt	=	Procedure_Code_Cnt;
				Procedure_Code_Cnt		=	0;
			END;
RUN;

%PUT	&Nte1  &=Nbr_of_Codes;

DATA	Want;
	DROP	_:;
	DROP	Chg_Procedure_Code;

	SET	Have;
		BY	ID	Date;

	ARRAY	Chg_Procedure_Codes	[*]	$8	Chg_Procedure_Code1 - Chg_Procedure_Code&Nbr_of_Codes;

	IF	_N_					=	1	THEN
		DO;
			DECLARE	HASH	H_Pat_Data(ORDERED: 'A', MULTIDATA:'N');
							H_Pat_Data.DEFINEKEY ('ID', 'Date', 'Chg_Procedure_Code');
							H_Pat_Data.DEFINEDONE();
			DECLARE	HITER	HI_Pat_Data('H_Pat_Data');
		END;

	_RC						=	H_Pat_Data.ADD();

	IF	LAST.Date	THEN
		DO;
			_RC				=	HI_Pat_Data.FIRST();
			DO	_i			=	1	TO	&Nbr_of_Codes;
				Chg_Procedure_Codes[_i]	=	Chg_Procedure_Code;
				_RC			=	HI_Pat_Data.NEXT();
				IF	_RC		>	0	THEN
					DO;
						_i	=	&Nbr_of_Codes;
						_RC	=	H_Pat_Data.CLEAR();
					END;
			END;
		END;
	ELSE
		DO;
			DELETE;
		END;
RUN;

Jim

s_lassen · Posted 09-24-2020 06:11 AM

@GS2 wrote:

So I believe this method will work based on what I am seeing; However, I did make a mistake with the data. Chg_Pri_ICD10_Diagnosis_Code is actually a character variable, there is a letter in a couple observations. Sorry about that. Is it possible to modify the code to incorporate if 1 variable is a character? Thank you

That makes things a bit more complicated. I changed the macro to find the variable type and length of the variable, and define the array (and the WHICH function to call) using that:

%macro make_array(indata,var,n,onerror=abort cancel);
  %if &n>1 %then %do;
    %local dsid varnum vartype vardef whichfunc;
    %let dsid=%sysfunc(open(&indata));
    %if &dsid=0 %then %do;
      %put %qsysfunc(sysmsg());
      &onerror;
      %return;
      %end;
    %let varnum=%sysfunc(varnum(&dsid,&var));
    %if &varnum=0 %then %do;
      %put Variable &var not found on &indata;
      &onerror;
      %return;
      %end;
    %let vartype=%sysfunc(vartype(&dsid,&varnum));
    %if &vartype=N %then %do;
      %let vardef=8;
      %let whichfunc=whichn;
      %end;
    %else %do;
      %let vardef=$%sysfunc(varlen(&dsid,&varnum));
      %let whichfunc=whichc;
      %end;
    %let dsid=%sysfunc(close(&dsid));
    array _&var(*) &vardef &var.1-&var.&n;
    if not &whichfunc(&var,of _&var(*)) then
      _&var(&n-cmiss(of _&var(*))+1)=&var;
    drop &var;
    %end;
%mend;

The macro now has two more parameters: First parameter is the input data set, the last parameter is what to do if there is an error (e.g. the input data does not exist, or the variable is not on the input data set). I set the default to ABORT CANCEL, which will exit the submit block, but not close SAS.

You will now have to call the macro with one more parameter, namely the name of the input:

options mprint;
data want;
  do until(last.Date);
    set have;  
    by ID Date notsorted;
    %make_array(have,Chg_Pri_Diagnosis_Code,&n1);
    %make_array(have,Chg_Pri_ICD10_Diagnosis_Code,&n2);
    %make_array(have,Chg_Procedure_Code,&n3);
    end;
run;

The SQL part is the same as my first answer.

SK_11 · Posted 09-23-2020 06:27 PM

If it helps

data have;
input ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code Chg_Procedure_Code;
datalines;
1 09222020 1111 2222 3333
1 09222020 1111 2222 4444
2 09222020 1111 2222 3333
2 08222020 1111 2222 4444
;
run;
Proc sort data=have out=have;
By ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code;
run;

proc transpose data=have out=want (drop=_NAME_) prefix=Chg_Procedure_Code;
By ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code;
var Chg_Procedure_Code;
run;

The output

ID Date Chg_Pri_Diagnosis_Code Chg_Pri_ICD10_Diagnosis_Code Chg_Procedure_Code1 Chg_Procedure_Code2
1 9222020 1111 2222 3333 4444
2 8222020 1111 2222 4444 .
2 9222020 1111 2222 3333 .

Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Re: Adding non-duplicate value of a row to a new column

Registration is open

SAS Training: Just a Click Away