Solved: Finding the ICD diagnosis for earliest date

mantubiradar19 · Posted 11-18-2020 03:17 AM

Hello everyone,

I have A CSV file containing the list of individuals with their ICD diagnosis on multiple dates. The data looks as below:

ID	ICD VERSION	ICD CODE	DIAGNOSIS DATE
M0058	ICD-9	733.42	19-06-14
M0058	ICD-9	733.42	19-09-14
M0058	ICD-9	733.42	18-12-14
M0058	ICD-9	733.42	03-12-15
M0058	ICD-9	733.42	24-12-15
M0058	ICD-9	733.42	31-12-15

I would like to create a new SAS dataset with the list of observations along with their first ICD diagnosis. For the above case, I would like to retain only "M000000058" "ICD-9" "733.42" "19-06-14"
Thanks in advance

Kurt_Bremser · Posted 11-18-2020 04:13 AM

See?

You showed us what Excel showed you, and that sparked my comment about 2-digit years. Now we see that the file has in fact very sensible date values in ISO format, as it should be.

Moral: never use Excel to look at text files.

Use this code to import that file:

data import;
infile "your_file_name" dlm="," dsd firstobs=2 truncover;
input
  id :$10.
  icd_version :$10.
  icd_code :$8.
  diagnosis_date :yymmdd10.
;
format diagnosis_date yymmdd10.;
run;

If that throws NOTEs about invalid values for diagnosis_date, post one of those messages (including the listing of the raw data line) as you posted the file data. We can then see what we need to do to avoid that NOTE, as you want a clean log (Maxim 25).

I suspect that all the data was imported as character because PROC IMPORT was misled by the quotes around the date values.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 11-18-2020 03:24 AM

Sort by id and date, then run

data want;
set have;
by id;
if first.id;
run;

or (no preceding sort needed, but may perform worse)

proc sql;
create table want as
  select *
  from have
  group by id
  having date = min(date)
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mantubiradar19 · Posted 11-18-2020 03:38 AM

Thanks, Kurt, After importing the CSV the "DIAGNOSIS_DATE" has a character type. Do I need to change this date format before sorting?

Kurt_Bremser · Posted 11-18-2020 03:48 AM

@mantubiradar19 wrote:
Thanks, Kurt, After importing the CSV the "DIAGNOSIS_DATE" has a character type. Do I need to change this date format before sorting?

Absolutely. Obvious date values must always be imported as SAS date values to be useful.

Also, anybody who sends you dates with two-digit years needs to be beat over the head with a medium-sized reinforced-concrete pillar. After Y2K, I consider anybody doing this as having an IQ lower than the number of years passed since 2000.

To get help for correctly reading your data, please open the CSV file with an editor (NOT WITH EXCEL!!) and copy/paste the first 10 lines into a window opened with this button:

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mantubiradar19 · Posted 11-18-2020 03:57 AM

"ID","ICD VERSION","ICD CODE","DIAGNOSIS DATE"
"M0058","ICD-9","733.42","2014/06/19"
"M0058","ICD-9","733.42","2014/09/19"
"M0058","ICD-9","733.42","2014/12/18"
"M0058","ICD-9","733.42","2015/12/03"
"M0058","ICD-9","733.42","2015/12/24"
"M0058","ICD-9","733.42","2015/12/31"
"M00331","ICD-10-CM","M81.0","2020/04/03"
"M00331","ICD-10-CM","M81.0","2020/04/07"
"M00331","ICD-10-CM","M81.0","2020/04/10"

Here you go!

Kurt_Bremser · Posted 11-18-2020 04:13 AM

See?

You showed us what Excel showed you, and that sparked my comment about 2-digit years. Now we see that the file has in fact very sensible date values in ISO format, as it should be.

Moral: never use Excel to look at text files.

Use this code to import that file:

data import;
infile "your_file_name" dlm="," dsd firstobs=2 truncover;
input
  id :$10.
  icd_version :$10.
  icd_code :$8.
  diagnosis_date :yymmdd10.
;
format diagnosis_date yymmdd10.;
run;

If that throws NOTEs about invalid values for diagnosis_date, post one of those messages (including the listing of the raw data line) as you posted the file data. We can then see what we need to do to avoid that NOTE, as you want a clean log (Maxim 25).

I suspect that all the data was imported as character because PROC IMPORT was misled by the quotes around the date values.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mantubiradar19 · Posted 11-18-2020 04:23 AM

Thank you so much for your advice Kurt, let me try this

mantubiradar19 · Posted 11-18-2020 04:42 AM

I was able to import the CSV file using your code and sort by id and diagnosis_date to obtain unique observations with the earliest diagnosis date (using Sort by id and date, then run). For your reference, I used the following import code earlier:

%let path=PATH;
options validvarname=v7;

proc import datafile="&path\My_CSV.csv"
out=MY_CSV
dbms=csv;
guessingrows=max;
run;quit;

mantubiradar19 · Posted 11-18-2020 08:08 AM

Can you please tell me what is the function of "dsd firstobs=2 truncover;"?

Kurt_Bremser · Posted 11-18-2020 08:16 AM

DSD means "delimiter sensitive data". It has three effects: quotes around values are removed, delimiters within the quoted string a re disregarded, and two or more delimiters in succession mean that there are missing values

FIRSTOBS= instructs the data step to skip lines at the beginning, FIRSTOBS=2 means skipping a single header line

TRUNCOVER controls what should happen if not enough data is found to fill all columns in the INPUT statement. TRUNCOVER means that any shorter value is still used to fill a column, and any subsequent columns are set to missing.

All these options are described in detail in the documentation of the INFILE Statement.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 11-18-2020 04:17 AM

PS adapt the lengths of the character informats as needed. The lengths I use are guesses.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 11-18-2020 03:26 AM

So you still want as many observations as in the original data?

The DATA to DATA Step Macro
Blog: SASnrd

mantubiradar19 · Posted 11-18-2020 03:35 AM

I want one observation per line with the earliest date for each observation

Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

Re: Finding the ICD diagnosis for earliest date

SAS Innovate 2025: Save the Date