BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
PunkinSAS08
Fluorite | Level 6

Hi! I'm new to SAS and cannot figure out how to create a cohort of participants that have the ICD-10 codes I'm looking for. Here is my code: 

 

proc import datafile='/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv'
out=cohort2016
dbms=csv replace;
getnames=yes;
run;

proc import datafile='/home/u62039685/DatasetEE2016/TBI ICD-10.xlsx'
out= icd10codes
dbms= XlSX replace;
Sheet= 'Sheet1';
run;

proc sql;
create table cohort as
select a.*, b.*
from cohort2016 a
inner join icd10codes b
on put (a.inc_key, best10.) = b.diag_code;
select * from cohort;
quit;

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

You should start with first getting your data import process right. This means that you do NOT use PROC IMPORT, and do NOT use Excel files.

Save the Excel spreadsheet as a csv file, and then use DATA steps to read both csv files, in which you take full control over variable types and attributes, and how data is read.

View solution in original post

13 REPLIES 13
Patrick
Opal | Level 21

Without representative sample data nor SAS log nor you actually telling us if you get an error or just not the expected result, it's a bit hard to advise. 

 

One thing I can see:

put (a.inc_key, best10.)

Will create a string of length 10 that is right aligned meaning you will have leading blanks if the number is less than 10 digits. If the string in variable diag_code is left aligned (most likely the case) then the keys wouldn't match.

 

To get around this use either of below two coding options:

put(b.inc_key,best10. -l)
 
or:

strip(put(b.inc_key,best10.))

 

PunkinSAS08
Fluorite | Level 6

Ah, sorry. Here is the log:

1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
68
69 proc sql;
70 create table cohort as
71 select a.*, b.*
72 from cohort2016 a
73 inner join icd10codes b
74 on put (a.inc_key, best10.-L) = b.diag_code;
NOTE: Table WORK.COHORT created, with 0 rows and 4 columns.
 
75 select * from cohort;
NOTE: No rows were selected.
76 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.93 seconds
user cpu time 0.91 seconds
system cpu time 0.01 seconds
memory 7241.40k
OS Memory 29872.00k
Timestamp 06/09/2023 03:30:44 PM
Step Count 26 Switch Count 5
Page Faults 0
Page Reclaims 550
Page Swaps 0
Voluntary Context Switches 17
Involuntary Context Switches 1
Block Input Operations 0
Block Output Operations 264
 
 
77
78 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
88
 
**this produces a blank table in the output tab with all columns from both imported tables instead of just the inc_key (participants) with the intended diag_code (ICD-10 codes).
 
I hope this helps! I added the -L function and there is no difference in output.
ballardw
Super User

Time to extract some values from BOTH data sets that you think should match.

Depending on the source of the ICD codes you may have different implementations of some of the conventions. I have seen some users that have replaced decimal points in codes with other characters or removed them entirely, as mentioned LETTERS are standard in many ICD codes and your code shows that you are using a numeric value. So if all of the B data set Diag_code values have letters then none of the A.inc_key will ever match because those from A are numeric.

 

 

ballardw
Super User

Actual examples of data would be helpful.

 

This is unlikely to match given my limited experience with ICD-10 codes.

put (a.inc_key, best10.) = b.diag_code;

First problem: Put (a.inc_key, best10.) will generate a 10 character value that is right justified. That means if the numeric value of Inc_key is 123, for example the result is "       123" and comparisons with character values start with the first character. Unless diag_code is pretty strange it is unlikely to start with spaces.

The justification issue can be solved with the -L parmeter in the Put function call:

put(a.inc_key, best10. -L)

But that may not be sufficient. There are many ICD-10 codes that incorporate letters in the code the value. So the Imported data may have a bunch of missing values for the code because of treating it as numeric.

Suggestion: go back to the import step an modify it to examine more than the default 20 rows that Proc Import uses to set properties of variables. The Guessingrows statement lets you specify a number of rows to use to set properties or the keyword Max to use the whole file.

proc import datafile='/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv'
out=cohort2016
dbms=csv replace;
getnames=yes;
Guessingrows=max;
run;

Or use documentation from the file source to write a data step with the correct properties for all of the variables.

You can have similar issues with spreadsheets depending on how someone makes them. Typically I save the data from XLSX files to CSV using spreadsheet program options and read the CSV file where I can set properties.

Tom
Super User Tom
Super User

ICD-10 codes are not numbers.  A quick search found this site:  https://www.aapc.com/codes/icd-10-codes-range

Tom_0-1686325905455.png

 

If you have numbers for ICD-10 codes then something is wrong.

PunkinSAS08
Fluorite | Level 6

Hi Tom,

 

The ICD-10 codes are alphanumeric and do not have any spaces or decimals. 

Tom
Super User Tom
Super User

@PunkinSAS08 wrote:

Hi Tom,

 

The ICD-10 codes are alphanumeric and do not have any spaces or decimals. 


You cannot apply a numeric format like BEST to a character variable.  Note that ICD-10 procedure codes also include letters.

Tom_0-1686327677489.png

 

Perhaps the code was previously used with ICD-9 codes?  Those were frequently stored as numbers since they mainly did not include letters.

 

ballardw
Super User

@PunkinSAS08 wrote:

Hi Tom,

 

The ICD-10 codes are alphanumeric and do not have any spaces or decimals. 


Your attempt to use this code says that statement is not correct:

on put (a.inc_key, best10.) = b.diag_code;

and again here

74 on put (a.inc_key, best10.-L) = b.diag_code;
NOTE: Table WORK.COHORT created, with 0 rows and 4 columns.

If a.inc_key was "alphanumeric", meaning character values allowed, then that Put would generate the following error:

ERROR: Numeric format BEST in PUT function requires a numeric argument.

So a.inc_key is numeric. For the nth time.

You need to reread the source data to make sure the variable is character.

Kurt_Bremser
Super User

You should start with first getting your data import process right. This means that you do NOT use PROC IMPORT, and do NOT use Excel files.

Save the Excel spreadsheet as a csv file, and then use DATA steps to read both csv files, in which you take full control over variable types and attributes, and how data is read.

PunkinSAS08
Fluorite | Level 6

Thank you! I went back and changed my code from proc import to a data step but I am still having issues with the proc sql step. I keep receiving the error: 

ERROR: File WORK.EE2016.DATA does not exist.
ERROR: File WORK.ICD_CODES.DATA does not exist.
but they both exist. Here is my code:
 

*data step for EE2106 with ICD10 code with decimal removed";
data EE2106;
infile '/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv' dsd;
input inc_key ICD10 $;
length ICD10_dcode $10;
ICD10_dcode = compress(icd10,".");
run;

*data step for ICD10 code running list";
data ICD10csv;
infile '/home/u62039685/DatasetEE2016/TBI ICD-10.csv';
input icd10 $;
diag_code = compress (icd10,",");
run;

*merging the two data sets to create cohort";
proc sql;
create table cohort2016 as
select *
from EE2016
where ICDTBI in (select ICD_Code from ICD_Codes);
run;

Kurt_Bremser
Super User

When SAS tells you something isn't there, then it is not there, period.

Run all three steps in one sweep and look at the log. If this does not give you a clue, post the complete (code and messages) log text by copy/pasting it into a window opened with this button:

Bildschirmfoto 2020-04-07 um 08.32.59.jpg

You can reduce multiple identical ERROR messages, if such exist.

PunkinSAS08
Fluorite | Level 6

Code: 

*data step for EE2106 with ICD10 code with decimal removed";
data EE2106;
infile '/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv' dsd;
input inc_key ICD10 $;
length ICD10_dcode $10;
ICD10_dcode = compress(icd10,".");
run;

*data step for ICD10 code running list";
data ICD10csv;
infile '/home/u62039685/DatasetEE2016/TBI ICD-10.csv';
input icd10 $;
diag_code = compress (icd10,",");
run;

*merging the two data sets to create cohort";
proc sql;
create table cohort2016 as
select *
from EE2016
where ICDTBI in (select ICD_Code from ICD_Codes);
run;

 

 
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
68
69 proc sql;
70 create table cohort2016 as
71 select *
72 from EE2016
73 where ICDTBI in (select ICD_Code from ICD_Codes);
ERROR: File WORK.EE2016.DATA does not exist.
ERROR: File WORK.ICD_CODES.DATA does not exist.
NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.
74 run;
NOTE: PROC SQL statements are executed immediately; The RUN statement has no effect.
75
76 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
86
Kurt_Bremser
Super User

You only ran the SQL, not the DATA steps. So the datasets in the temporary WORK library won't be there.

As I already said, run all three steps in one submit and post the log, and use the proper window!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 7887 views
  • 2 likes
  • 5 in conversation