Hello everyone,
I have A CSV file containing the list of individuals with their ICD diagnosis on multiple dates. The data looks as below:
ID | ICD VERSION | ICD CODE | DIAGNOSIS DATE |
M0058 | ICD-9 | 733.42 | 19-06-14 |
M0058 | ICD-9 | 733.42 | 19-09-14 |
M0058 | ICD-9 | 733.42 | 18-12-14 |
M0058 | ICD-9 | 733.42 | 03-12-15 |
M0058 | ICD-9 | 733.42 | 24-12-15 |
M0058 | ICD-9 | 733.42 | 31-12-15 |
I would like to create a new SAS dataset with the list of observations along with their first ICD diagnosis. For the above case, I would like to retain only "M000000058" "ICD-9" "733.42" "19-06-14"
Thanks in advance
See?
You showed us what Excel showed you, and that sparked my comment about 2-digit years. Now we see that the file has in fact very sensible date values in ISO format, as it should be.
Moral: never use Excel to look at text files.
Use this code to import that file:
data import;
infile "your_file_name" dlm="," dsd firstobs=2 truncover;
input
id :$10.
icd_version :$10.
icd_code :$8.
diagnosis_date :yymmdd10.
;
format diagnosis_date yymmdd10.;
run;
If that throws NOTEs about invalid values for diagnosis_date, post one of those messages (including the listing of the raw data line) as you posted the file data. We can then see what we need to do to avoid that NOTE, as you want a clean log (Maxim 25).
I suspect that all the data was imported as character because PROC IMPORT was misled by the quotes around the date values.
Sort by id and date, then run
data want;
set have;
by id;
if first.id;
run;
or (no preceding sort needed, but may perform worse)
proc sql;
create table want as
select *
from have
group by id
having date = min(date)
;
quit;
@mantubiradar19 wrote:
Thanks, Kurt, After importing the CSV the "DIAGNOSIS_DATE" has a character type. Do I need to change this date format before sorting?
Absolutely. Obvious date values must always be imported as SAS date values to be useful.
Also, anybody who sends you dates with two-digit years needs to be beat over the head with a medium-sized reinforced-concrete pillar. After Y2K, I consider anybody doing this as having an IQ lower than the number of years passed since 2000.
To get help for correctly reading your data, please open the CSV file with an editor (NOT WITH EXCEL!!) and copy/paste the first 10 lines into a window opened with this button:
"ID","ICD VERSION","ICD CODE","DIAGNOSIS DATE" "M0058","ICD-9","733.42","2014/06/19" "M0058","ICD-9","733.42","2014/09/19" "M0058","ICD-9","733.42","2014/12/18" "M0058","ICD-9","733.42","2015/12/03" "M0058","ICD-9","733.42","2015/12/24" "M0058","ICD-9","733.42","2015/12/31" "M00331","ICD-10-CM","M81.0","2020/04/03" "M00331","ICD-10-CM","M81.0","2020/04/07" "M00331","ICD-10-CM","M81.0","2020/04/10"
Here you go!
See?
You showed us what Excel showed you, and that sparked my comment about 2-digit years. Now we see that the file has in fact very sensible date values in ISO format, as it should be.
Moral: never use Excel to look at text files.
Use this code to import that file:
data import;
infile "your_file_name" dlm="," dsd firstobs=2 truncover;
input
id :$10.
icd_version :$10.
icd_code :$8.
diagnosis_date :yymmdd10.
;
format diagnosis_date yymmdd10.;
run;
If that throws NOTEs about invalid values for diagnosis_date, post one of those messages (including the listing of the raw data line) as you posted the file data. We can then see what we need to do to avoid that NOTE, as you want a clean log (Maxim 25).
I suspect that all the data was imported as character because PROC IMPORT was misled by the quotes around the date values.
DSD means "delimiter sensitive data". It has three effects: quotes around values are removed, delimiters within the quoted string a re disregarded, and two or more delimiters in succession mean that there are missing values
FIRSTOBS= instructs the data step to skip lines at the beginning, FIRSTOBS=2 means skipping a single header line
TRUNCOVER controls what should happen if not enough data is found to fill all columns in the INPUT statement. TRUNCOVER means that any shorter value is still used to fill a column, and any subsequent columns are set to missing.
All these options are described in detail in the documentation of the INFILE Statement.
PS adapt the lengths of the character informats as needed. The lengths I use are guesses.
So you still want as many observations as in the original data?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.