Hello SAS experts,
I have a huge text file trying to read in with SAS. The first few observations look as below when I open it with excel. The data is separated by }. However, excel truncates to its first million rows which is a fraction of the complete data. I attached csv file of these first few observations to this post, in case. As you will note, ndc_code variable is there for the majority of the data (99%) but not 100% consistent in all rows.
I'll highly appreciate any help. The data is so important to me guys. I'm freaked out to lose or break it before or during my attempts to read it into SAS.
Thanks in advance.
Copy of first few observations. Attached as csv as well.
ID}"trans_id"}"ndc_code"}"start_date" | |||||||
Y96033K}200 | 524 | 406 | 3 | 641}}2005-08-02 00:00:00 | |||
C89950C}200 | 524 | 406 | 7 | 191}}2005-07-15 00:00:00 | |||
C89950C}200 | 524 | 406 | 7 | 194}}2005-07-19 00:00:00 | |||
G57388K}200 | 524 | 406 | 7 | 444}}2005-07-29 00:00:00 | |||
P33846V}200 | 524 | 406 | 11 | 998}}2005-07-29 00:00:00 | |||
X87617C}200 | 524 | 406 | 24 | 406}}2005-08-09 00:00:00 | |||
S88141M}200 | 524 | 406 | 38 | 001}}2005-06-17 00:00:00 | |||
P01353V}200 | 524 | 406 | 41 | 517}}2005-07-23 00:00:00 | |||
Y58550R}200 | 524 | 406 | 55 | 262}"00185011701"}2005-08-24 00:00:00 | |||
N87928G}200 | 524 | 406 | 63 | 646}"00186504054"}2005-08-25 00:00:00 | |||
E11620G}200 | 524 | 406 | 68 | 200}"00024542131"}2005-08-25 00:00:00 | |||
K96751N}200 | 524 | 406 | 69 | 379}}2005-07-14 00:00:00 | |||
W65400Y}200 | 524 | 406 | 74 | 975}"00456321060"}2005-08-25 00:00:00 | |||
W65400Y}200 | 524 | 406 | 75 | 085}"00536375610"}2005-08-25 00:00:00 | |||
S47546N}200 | 524 | 406 | 75 | 269}"00006011731"}2005-08-25 00:00:00 | |||
W65400Y}200 | 524 | 406 | 75 | 358}"00013830304"}2005-08-25 00:00:00 | |||
X45852C}200 | 524 | 406 | 78 | 350}"00088110747"}2005-08-25 00:00:00 | |||
W41407V}200 | 524 | 406 | 83 | 114}"65726023510"}2005-08-25 00:00:00 | |||
E39839F}200 | 524 | 406 | 83 | 373}"00182432906"}2005-08-25 00:00:00 | |||
X94973M}200 | 524 | 406 | 84 | 142}"00023917715"}2005-08-25 00:00:00 |
firstly, i wouldn't modify the csv. Not sure if you were contemplating that, in any case I would make the changes to the data within sas eg x=index(variable,'}') y=substr(variable,x+1) etc. I don't see where ndc is missing, but it wouldn't matter if it was. Are you having problems reading the file or you're just concerned about the format you received it in ie with "}" etc
Thanks for asking.
It just worked out as below:
However, Resulting date 'start_date' looks like: 01JUN05:00:00:00. I don't know yet if it is the best format to read in date with. Please let me know if better suggestion.
data drug;
infile 'data.csv' delimiter='}' missover
dsd lrecl=32767 firstobs=2;
informat ID $7.;
informat second_id $20. ;
informat ndc $11.;
informat start_date anydtdtm40. ;
format ID $7. ;
format second_id $20. ;
format ndc $11. ;
format start_date datetime. ;
input
IF $
second_id $
ndc $
start_date
;
run;
re date format, since you have so many obs you'd want to check whether there are partial dates or spurious dates eg year only. You may want to read it as text and then derive the date within sas. If you have time as part of the start date then you might want to separate out the time component. It appears that you don't have time and thus you may want to use date9. format because the time part is superfluous
Good point. proc freq turned 0.01% of years were '3000'. Bad dates. Thanks Paul.
data have1(compress=yes); set have;
start_date1 = datepart(start_date);
format start_date1 date9.;
start_year=year(start_date1);
run;
proc freq data=have1(compress=yes);
tables start_year;
run;
Sometimes that's a problem with the ANYDTDTM informat rather than the actual data, I would confirm those data points are incorrect by verifying against the actual data first.
@Cruise wrote:
Good point. proc freq turned 0.01% of years were '3000'. Bad dates. Thanks Paul.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.