Hi! I'm new to SAS and cannot figure out how to create a cohort of participants that have the ICD-10 codes I'm looking for. Here is my code:
proc import datafile='/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv'
out=cohort2016
dbms=csv replace;
getnames=yes;
run;
proc import datafile='/home/u62039685/DatasetEE2016/TBI ICD-10.xlsx'
out= icd10codes
dbms= XlSX replace;
Sheet= 'Sheet1';
run;
proc sql;
create table cohort as
select a.*, b.*
from cohort2016 a
inner join icd10codes b
on put (a.inc_key, best10.) = b.diag_code;
select * from cohort;
quit;
You should start with first getting your data import process right. This means that you do NOT use PROC IMPORT, and do NOT use Excel files.
Save the Excel spreadsheet as a csv file, and then use DATA steps to read both csv files, in which you take full control over variable types and attributes, and how data is read.
Without representative sample data nor SAS log nor you actually telling us if you get an error or just not the expected result, it's a bit hard to advise.
One thing I can see:
put (a.inc_key, best10.)
Will create a string of length 10 that is right aligned meaning you will have leading blanks if the number is less than 10 digits. If the string in variable diag_code is left aligned (most likely the case) then the keys wouldn't match.
To get around this use either of below two coding options:
put(b.inc_key,best10. -l) or: strip(put(b.inc_key,best10.))
Ah, sorry. Here is the log:
Time to extract some values from BOTH data sets that you think should match.
Depending on the source of the ICD codes you may have different implementations of some of the conventions. I have seen some users that have replaced decimal points in codes with other characters or removed them entirely, as mentioned LETTERS are standard in many ICD codes and your code shows that you are using a numeric value. So if all of the B data set Diag_code values have letters then none of the A.inc_key will ever match because those from A are numeric.
Actual examples of data would be helpful.
This is unlikely to match given my limited experience with ICD-10 codes.
put (a.inc_key, best10.) = b.diag_code;
First problem: Put (a.inc_key, best10.) will generate a 10 character value that is right justified. That means if the numeric value of Inc_key is 123, for example the result is " 123" and comparisons with character values start with the first character. Unless diag_code is pretty strange it is unlikely to start with spaces.
The justification issue can be solved with the -L parmeter in the Put function call:
put(a.inc_key, best10. -L)
But that may not be sufficient. There are many ICD-10 codes that incorporate letters in the code the value. So the Imported data may have a bunch of missing values for the code because of treating it as numeric.
Suggestion: go back to the import step an modify it to examine more than the default 20 rows that Proc Import uses to set properties of variables. The Guessingrows statement lets you specify a number of rows to use to set properties or the keyword Max to use the whole file.
proc import datafile='/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv' out=cohort2016 dbms=csv replace; getnames=yes; Guessingrows=max; run;
Or use documentation from the file source to write a data step with the correct properties for all of the variables.
You can have similar issues with spreadsheets depending on how someone makes them. Typically I save the data from XLSX files to CSV using spreadsheet program options and read the CSV file where I can set properties.
ICD-10 codes are not numbers. A quick search found this site: https://www.aapc.com/codes/icd-10-codes-range
If you have numbers for ICD-10 codes then something is wrong.
Hi Tom,
The ICD-10 codes are alphanumeric and do not have any spaces or decimals.
@PunkinSAS08 wrote:
Hi Tom,
The ICD-10 codes are alphanumeric and do not have any spaces or decimals.
You cannot apply a numeric format like BEST to a character variable. Note that ICD-10 procedure codes also include letters.
Perhaps the code was previously used with ICD-9 codes? Those were frequently stored as numbers since they mainly did not include letters.
@PunkinSAS08 wrote:
Hi Tom,
The ICD-10 codes are alphanumeric and do not have any spaces or decimals.
Your attempt to use this code says that statement is not correct:
on put (a.inc_key, best10.) = b.diag_code;
and again here
74 on put (a.inc_key, best10.-L) = b.diag_code; NOTE: Table WORK.COHORT created, with 0 rows and 4 columns.
If a.inc_key was "alphanumeric", meaning character values allowed, then that Put would generate the following error:
ERROR: Numeric format BEST in PUT function requires a numeric argument.
So a.inc_key is numeric. For the nth time.
You need to reread the source data to make sure the variable is character.
You should start with first getting your data import process right. This means that you do NOT use PROC IMPORT, and do NOT use Excel files.
Save the Excel spreadsheet as a csv file, and then use DATA steps to read both csv files, in which you take full control over variable types and attributes, and how data is read.
Thank you! I went back and changed my code from proc import to a data step but I am still having issues with the proc sql step. I keep receiving the error:
*data step for EE2106 with ICD10 code with decimal removed";
data EE2106;
infile '/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv' dsd;
input inc_key ICD10 $;
length ICD10_dcode $10;
ICD10_dcode = compress(icd10,".");
run;
*data step for ICD10 code running list";
data ICD10csv;
infile '/home/u62039685/DatasetEE2016/TBI ICD-10.csv';
input icd10 $;
diag_code = compress (icd10,",");
run;
*merging the two data sets to create cohort";
proc sql;
create table cohort2016 as
select *
from EE2016
where ICDTBI in (select ICD_Code from ICD_Codes);
run;
When SAS tells you something isn't there, then it is not there, period.
Run all three steps in one sweep and look at the log. If this does not give you a clue, post the complete (code and messages) log text by copy/pasting it into a window opened with this button:
You can reduce multiple identical ERROR messages, if such exist.
Code:
*data step for EE2106 with ICD10 code with decimal removed";
data EE2106;
infile '/home/u62039685/DatasetEE2016/PUF_ICD10_DCODE2016.csv' dsd;
input inc_key ICD10 $;
length ICD10_dcode $10;
ICD10_dcode = compress(icd10,".");
run;
*data step for ICD10 code running list";
data ICD10csv;
infile '/home/u62039685/DatasetEE2016/TBI ICD-10.csv';
input icd10 $;
diag_code = compress (icd10,",");
run;
*merging the two data sets to create cohort";
proc sql;
create table cohort2016 as
select *
from EE2016
where ICDTBI in (select ICD_Code from ICD_Codes);
run;
You only ran the SQL, not the DATA steps. So the datasets in the temporary WORK library won't be there.
As I already said, run all three steps in one submit and post the log, and use the proper window!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Ready to level-up your skills? Choose your own adventure.