DATA Step, Macro, Functions and more

Read huge and kinda messy text data into SAS

Reply
Super Contributor
Posts: 374

Read huge and kinda messy text data into SAS

Hello SAS experts,

 

I have a huge text file trying to read in with SAS. The first few observations look as below when I open it with excel. The data is separated by }. However, excel truncates to its first million rows which is a fraction of the complete data.  I attached csv file of these first few observations to this post, in case. As you will note, ndc_code variable is there for the majority of the data (99%) but not 100% consistent in all rows.

 

I'll highly appreciate any help. The data is so important to me guys. I'm freaked out to lose or break it before or during my attempts to read it into SAS.

 

Thanks in advance.

 

Copy of first few observations. Attached as csv as well.

 

ID}"trans_id"}"ndc_code"}"start_date"        
Y96033K}200 524 406 3 641}}2005-08-02 00:00:00  
C89950C}200 524 406 7 191}}2005-07-15 00:00:00  
C89950C}200 524 406 7 194}}2005-07-19 00:00:00  
G57388K}200 524 406 7 444}}2005-07-29 00:00:00  
P33846V}200 524 406 11 998}}2005-07-29 00:00:00  
X87617C}200 524 406 24 406}}2005-08-09 00:00:00  
S88141M}200 524 406 38 001}}2005-06-17 00:00:00  
P01353V}200 524 406 41 517}}2005-07-23 00:00:00  
Y58550R}200 524 406 55 262}"00185011701"}2005-08-24 00:00:00
N87928G}200 524 406 63 646}"00186504054"}2005-08-25 00:00:00
E11620G}200 524 406 68 200}"00024542131"}2005-08-25 00:00:00
K96751N}200 524 406 69 379}}2005-07-14 00:00:00  
W65400Y}200 524 406 74 975}"00456321060"}2005-08-25 00:00:00
W65400Y}200 524 406 75 085}"00536375610"}2005-08-25 00:00:00
S47546N}200 524 406 75 269}"00006011731"}2005-08-25 00:00:00
W65400Y}200 524 406 75 358}"00013830304"}2005-08-25 00:00:00
X45852C}200 524 406 78 350}"00088110747"}2005-08-25 00:00:00
W41407V}200 524 406 83 114}"65726023510"}2005-08-25 00:00:00
E39839F}200 524 406 83 373}"00182432906"}2005-08-25 00:00:00
X94973M}200 524 406 84 142}"00023917715"}2005-08-25 00:00:00

 

 

 

 

Regular Contributor
Posts: 164

Re: Read huge and kinda messy text data into SAS

firstly, i wouldn't modify the csv. Not sure if you were contemplating that, in any case I would make the changes to the data within sas eg x=index(variable,'}') y=substr(variable,x+1) etc. I don't see where ndc is missing, but it wouldn't matter if it was. Are you having problems reading the file or you're just concerned about the format you received it in ie with "}" etc

--------------
blog: papersandprograms.com
Super Contributor
Posts: 374

Re: Read huge and kinda messy text data into SAS

Posted in reply to PaulBrownPhD

Thanks for asking.

 

It just worked out as below:

 

However, Resulting date 'start_date' looks like: 01JUN05:00:00:00. I don't know yet if it is the best format to read in date with. Please let me know if better suggestion.

 

data drug;
infile 'data.csv' delimiter='}' missover
dsd lrecl=32767 firstobs=2;
informat ID $7.;
informat second_id $20. ;
informat ndc $11.;
informat start_date anydtdtm40. ;
format ID $7. ;
format second_id $20. ;
format ndc $11. ;
format start_date datetime. ;
input 
IF $
second_id $
ndc $
start_date
;
run;
Regular Contributor
Posts: 164

Re: Read huge and kinda messy text data into SAS

[ Edited ]

re date format, since you have so many obs you'd want to check whether there are partial dates or spurious dates eg year only. You may want to read it as text and then derive the date within sas. If you have time as part of the start date then you might want to separate out the time component. It appears that you don't have time and thus you may want to use date9. format because the time part is superfluous

--------------
blog: papersandprograms.com
Super Contributor
Posts: 374

Re: Read huge and kinda messy text data into SAS

Posted in reply to PaulBrownPhD

@PaulBrownPhD

Good point. proc freq turned 0.01% of years were '3000'. Bad dates. Thanks Paul.

 

data have1(compress=yes); set have;
start_date1 = datepart(start_date);
format start_date1 date9.;
start_year=year(start_date1);
run;
proc freq data=have1(compress=yes);
tables start_year;
run; 
Super User
Posts: 23,773

Re: Read huge and kinda messy text data into SAS

Sometimes that's a problem with the ANYDTDTM informat rather than the actual data, I would confirm those data points are incorrect by verifying against the actual data first.

 


@Cruise wrote:


Good point. proc freq turned 0.01% of years were '3000'. Bad dates. Thanks Paul.

 

Ask a Question
Discussion stats
  • 5 replies
  • 87 views
  • 5 likes
  • 3 in conversation