Solved: Using Proc SQL to match participants that have specific ICD-10 codes

PunkinSAS08 · Posted 06-08-2023 10:53 PM

Hi! I'm new to SAS and cannot figure out how to create a cohort of participants that have the ICD-10 codes I'm looking for. Here is my code:

proc import datafile='/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv'
out=cohort2016
dbms=csv replace;
getnames=yes;
run;

proc import datafile='/home/u62039685/DatasetEE2016/TBI ICD-10.xlsx'
out= icd10codes
dbms= XlSX replace;
Sheet= 'Sheet1';
run;

proc sql;
create table cohort as
select a.*, b.*
from cohort2016 a
inner join icd10codes b
on put (a.inc_key, best10.) = b.diag_code;
select * from cohort;
quit;

Kurt_Bremser · Posted 06-09-2023 12:21 PM

You should start with first getting your data import process right. This means that you do NOT use PROC IMPORT, and do NOT use Excel files.

Save the Excel spreadsheet as a csv file, and then use DATA steps to read both csv files, in which you take full control over variable types and attributes, and how data is read.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Patrick · Posted 06-09-2023 12:28 AM

Without representative sample data nor SAS log nor you actually telling us if you get an error or just not the expected result, it's a bit hard to advise.

One thing I can see:

put (a.inc_key, best10.)

Will create a string of length 10 that is right aligned meaning you will have leading blanks if the number is less than 10 digits. If the string in variable diag_code is left aligned (most likely the case) then the keys wouldn't match.

To get around this use either of below two coding options:

put(b.inc_key,best10. -l)
 
or:

strip(put(b.inc_key,best10.))

PunkinSAS08 · Posted 06-09-2023 11:34 AM

Ah, sorry. Here is the log:

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

68

69 proc sql;

70 create table cohort as

71 select a.*, b.*

72 from cohort2016 a

73 inner join icd10codes b

74 on put (a.inc_key, best10.-L) = b.diag_code;

NOTE: Table WORK.COHORT created, with 0 rows and 4 columns.

75 select * from cohort;

NOTE: No rows were selected.

76 quit;

NOTE: PROCEDURE SQL used (Total process time):

real time 0.93 seconds

user cpu time 0.91 seconds

system cpu time 0.01 seconds

memory 7241.40k

OS Memory 29872.00k

Timestamp 06/09/2023 03:30:44 PM

Step Count 26 Switch Count 5

Page Faults 0

Page Reclaims 550

Page Swaps 0

Voluntary Context Switches 17

Involuntary Context Switches 1

Block Input Operations 0

Block Output Operations 264

77

78 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

88

**this produces a blank table in the output tab with all columns from both imported tables instead of just the inc_key (participants) with the intended diag_code (ICD-10 codes).

I hope this helps! I added the -L function and there is no difference in output.

ballardw · Posted 06-09-2023 11:45 AM

Time to extract some values from BOTH data sets that you think should match.

Depending on the source of the ICD codes you may have different implementations of some of the conventions. I have seen some users that have replaced decimal points in codes with other characters or removed them entirely, as mentioned LETTERS are standard in many ICD codes and your code shows that you are using a numeric value. So if all of the B data set Diag_code values have letters then none of the A.inc_key will ever match because those from A are numeric.

ballardw · Posted 06-09-2023 12:39 AM

Actual examples of data would be helpful.

This is unlikely to match given my limited experience with ICD-10 codes.

put (a.inc_key, best10.) = b.diag_code;

First problem: Put (a.inc_key, best10.) will generate a 10 character value that is right justified. That means if the numeric value of Inc_key is 123, for example the result is " 123" and comparisons with character values start with the first character. Unless diag_code is pretty strange it is unlikely to start with spaces.

The justification issue can be solved with the -L parmeter in the Put function call:

put(a.inc_key, best10. -L)

But that may not be sufficient. There are many ICD-10 codes that incorporate letters in the code the value. So the Imported data may have a bunch of missing values for the code because of treating it as numeric.

Suggestion: go back to the import step an modify it to examine more than the default 20 rows that Proc Import uses to set properties of variables. The Guessingrows statement lets you specify a number of rows to use to set properties or the keyword Max to use the whole file.

proc import datafile='/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv'
out=cohort2016
dbms=csv replace;
getnames=yes;
Guessingrows=max;
run;

Or use documentation from the file source to write a data step with the correct properties for all of the variables.

You can have similar issues with spreadsheets depending on how someone makes them. Typically I save the data from XLSX files to CSV using spreadsheet program options and read the CSV file where I can set properties.

Tom · Posted 06-09-2023 11:53 AM

ICD-10 codes are not numbers. A quick search found this site: https://www.aapc.com/codes/icd-10-codes-range

If you have numbers for ICD-10 codes then something is wrong.

PunkinSAS08 · Posted 06-09-2023 11:57 AM

Hi Tom,

The ICD-10 codes are alphanumeric and do not have any spaces or decimals.

Tom · Posted 06-09-2023 12:22 PM

@PunkinSAS08 wrote:

Hi Tom,

The ICD-10 codes are alphanumeric and do not have any spaces or decimals.

You cannot apply a numeric format like BEST to a character variable. Note that ICD-10 procedure codes also include letters.

Perhaps the code was previously used with ICD-9 codes? Those were frequently stored as numbers since they mainly did not include letters.

ballardw · Posted 06-09-2023 01:11 PM

@PunkinSAS08 wrote:

Hi Tom,

The ICD-10 codes are alphanumeric and do not have any spaces or decimals.

Your attempt to use this code says that statement is not correct:

on put (a.inc_key, best10.) = b.diag_code;

and again here

74 on put (a.inc_key, best10.-L) = b.diag_code;
NOTE: Table WORK.COHORT created, with 0 rows and 4 columns.

If a.inc_key was "alphanumeric", meaning character values allowed, then that Put would generate the following error:

ERROR: Numeric format BEST in PUT function requires a numeric argument.

So a.inc_key is numeric. For the nth time.

You need to reread the source data to make sure the variable is character.

Kurt_Bremser · Posted 06-09-2023 12:21 PM

You should start with first getting your data import process right. This means that you do NOT use PROC IMPORT, and do NOT use Excel files.

Save the Excel spreadsheet as a csv file, and then use DATA steps to read both csv files, in which you take full control over variable types and attributes, and how data is read.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PunkinSAS08 · Posted 06-12-2023 10:58 PM

Thank you! I went back and changed my code from proc import to a data step but I am still having issues with the proc sql step. I keep receiving the error:

ERROR: File WORK.EE2016.DATA does not exist.

ERROR: File WORK.ICD_CODES.DATA does not exist.

but they both exist. Here is my code:

*data step for EE2106 with ICD10 code with decimal removed";
data EE2106;
infile '/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv' dsd;
input inc_key ICD10 $;
length ICD10_dcode $10;
ICD10_dcode = compress(icd10,".");
run;

*data step for ICD10 code running list";
data ICD10csv;
infile '/home/u62039685/DatasetEE2016/TBI ICD-10.csv';
input icd10 $;
diag_code = compress (icd10,",");
run;

*merging the two data sets to create cohort";
proc sql;
create table cohort2016 as
select *
from EE2016
where ICDTBI in (select ICD_Code from ICD_Codes);
run;

Kurt_Bremser · Posted 06-13-2023 12:51 AM

When SAS tells you something isn't there, then it is not there, period.

Run all three steps in one sweep and look at the log. If this does not give you a clue, post the complete (code and messages) log text by copy/pasting it into a window opened with this button:

You can reduce multiple identical ERROR messages, if such exist.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PunkinSAS08 · Posted 06-13-2023 04:57 PM

Code:

*data step for EE2106 with ICD10 code with decimal removed";
data EE2106;
infile '/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv' dsd;
input inc_key ICD10 $;
length ICD10_dcode $10;
ICD10_dcode = compress(icd10,".");
run;

*data step for ICD10 code running list";
data ICD10csv;
infile '/home/u62039685/DatasetEE2016/TBI ICD-10.csv';
input icd10 $;
diag_code = compress (icd10,",");
run;

*merging the two data sets to create cohort";
proc sql;
create table cohort2016 as
select *
from EE2016
where ICDTBI in (select ICD_Code from ICD_Codes);
run;

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

68

69 proc sql;

70 create table cohort2016 as

71 select *

72 from EE2016

73 where ICDTBI in (select ICD_Code from ICD_Codes);

ERROR: File WORK.EE2016.DATA does not exist.

ERROR: File WORK.ICD_CODES.DATA does not exist.

NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.

74 run;

NOTE: PROC SQL statements are executed immediately; The RUN statement has no effect.

75

76 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

86

Kurt_Bremser · Posted 06-13-2023 11:31 PM

You only ran the SQL, not the DATA steps. So the datasets in the temporary WORK library won't be there.

As I already said, run all three steps in one submit and post the log, and use the proper window!

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

Re: Using Proc SQL to match participants that have specific ICD-10 codes

SAS Innovate 2025: Call for Content

Classroom Training Available!