BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Informatician
Calcite | Level 5

I am trying to read in the following example raw data (ALL PHI and PII Removed fictitious data only):

|LastName|,|FirstName|,|Gender|,|PatientID|,|DateOfBirth|,|Ethnicity|,|HospitalRegion|,|HospitalFacility|,|HospitalFacilityTelephone|,|TypeOfTransportation|,|AdmissionDate|,|NumberOfHospitalDays|,|ReferringRegion|,|ReferringFacility|,|ReferringStaffMember|,|AdmissionType_Emergency|,|AdmissionType_Inpatient|,|PhysicianDirectedExamResult|,|RequiresFollowup|,|FollowupDate|,|DischargeDiagnosis|,|DischargedToRegion|,|DischargedToFacility|,|DischargeDate|,|ERClaim|,|HospClaim|,|AdmittingDiag|,|DRG|,|ProfClaim|,|AnesClaim|,|SurgClaim|,|AmbClaim|,|PaymentStatus|,|AuthorizationNumber|,|Comments|
|FakeLAstName|,|FakeFirstName|,|M|,|12345678|,|1900-01-01 00:00:00|,|Caucasian|,|Non Facilities|,|General Hospital|,||,|Ambulance|,|2000-01-01 08:30:00|,|0|,|Region|,|N - M|,|DRLastName, DrFirstName|,|Ye|,|No|,|Chest pain
|,|False|,||,|Chest Pain|,|Region|,|N - M|,|2010-01-01 00:00:00|,||,||,|Chest pain|,||,||,||,||,||,|Approved|,|OP0000000000|,||
|FakeLAstName2|,|FakeFirstName2|,|M|,|23456789|,|1900-01-11 00:00:00|,|Caucasian|,|Non Facilities|,|General Hospital|,|00000000|,|Bus|,|2010-01-01 00:00:00|,|00|,|Region|,|xxxx|,|., (Not Specified)|,|Ye|,|No|,|Hypertension due to septic shock
Abdominal pain due to GI Bleed - resolved
C diff colitis- resolved
Acute Kidney injury likely pre-renal- resolved
PUD w iron deficency anemia-improved|,|False|,||,||,|Non Facilities|,|Unknown|,|2000-01-01 00:00:00|,||,||,||,||,||,||,||,||,|Pending|,||,||

 

 

 

I believe the code I am using fails at "ReferringStaffMember" first because the value has an embedded comma. DSD only works with double quotes and not pipes, is there a workaround? I have tried to use the dlmstr option without success. Please note the "&" and "~" in the input statement.

 

My code:

data ERHosp;
	infile "ERHospitalization.txt" DLM=","  firstobs=2 ;
	length LastName $40 FirstName $40 Gender $5 PatientID $8 DateOfBirth $200 Ethnicity $20
	HospitalRegion $200 HospitalFacility $60 HospitalFacilityTelephone $20
	TypeOfTransportation $20 AdmissionDate $20 NumberOfHospitalDays  ReferringRegion $20 
	ReferringFacility $20 ReferringStaffMember $40 AdmissionType_Emergency $20 
	AdmissionType_Inpatient $20 PhysicianDirectedExamResult $1000 RequiresFollowup $20 
	FollowupDate $200 DischargeDiagnosis $200 DischargedToRegion $20 DischargedToFacility $20 
	DischargeDate $20 ERClaim $20 HospClaim $20 AdmittingDiag $200 DRG $20 ProfClaim $20 AnesClaim $20
	SurgClaim $20 AmbClaim $20 PaymentStatus $20 AuthorizationNumber $20 Comments $500;

	input LastName $ FirstName $ Gender $ PatientID $ DateOfBirth $ Ethnicity $
	HospitalRegion $ HospitalFacility $ HospitalFacilityTelephone $
	TypeOfTransportation $ AdmissionDate $ NumberOfHospitalDays  ReferringRegion $ 
	ReferringFacility $  ReferringStaffMember & $40.  AdmissionType_Emergency $ 
	AdmissionType_Inpatient $  PhysicianDirectedExamResult &~$1000. RequiresFollowup  $ 
	FollowupDate $ DischargeDiagnosis $ DischargedToRegion $ DischargedToFacility $ 
	DischargeDate $ ERClaim $ HospClaim $ AdmittingDiag $ DRG $ ProfClaim $ AnesClaim $ 
	SurgClaim $ AmbClaim $ PaymentStatus $ AuthorizationNumber $ Comments $;
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You need to have each observation on its own line. 

If you are lucky the extra embedded line breaks are done with just CR or just LF and the real end of lines are done with CR and LF (as it normal for Windows text files).

Try adding TERMSTR=CRLF to the INFILE statement.

Keep the TRUNCOVER option. Do not use FLOWOVER.  You could use MISSOVER but then you cannot use formatted input in the INPUT statement because if the last delimited value is shorted than the fixed length informat MISSOVER will throw the whole truncated value away.

 

If TERMSTR=CRLF does not work then you need to fix the file first. You could fix the pipes and the line breaks at the same time.

Something like which will keep track of the number of pipe characters.  It will remove all CR characters. Replace pipe with quote. Replace LF  that are inside the pipes (that are becoming quotes) with pipes and replace LF that are outside of the pipes with real end of lines written using CRLF.

filename original "ERHospitalization.txt";
filename fixed temp;
data _null_;
  infile original lrecl=1 recfm=f ;
  file fixed termstr=crlf;
  input char $char1.;
  numpipe + (char='|');
  if char='0D'x then return;
  else if char='0A'x then do;
    if mod(numpipe,2)=1 then put '|' @;
    else do;
       numpipe=0;
       put;
    end;
  end;
  else if char='|' then put '"' @;
  else if char='"' then put '""' @;
  else put char $char1. @;
run;

Then try reading the fixed version of the file with your data step. 

 

Note there is no need for your complicated INPUT statement. Since you have already defined the variables with your LENGTH statement you can just use a positional variable list like in my example above.  Or if for some reason you wanted to define the variables with the LENGTH statement in a different order than they are in the text file then just list the names. But there is no need to add and $ or ~ or & characters to the INPUT statement.  

View solution in original post

5 REPLIES 5
Tom
Super User Tom
Super User

Why not just convert the pipes to quotes? 

data ERHosp;
  infile "ERHospitalization.txt" DSD DLM=","  firstobs=2 TRUNCOVER;
  input @;
  _infile_=translate(_infile_,'"','|');
  length LastName $40 FirstName $40 Gender $5 PatientID $8 DateOfBirth $200 Ethnicity $20
    HospitalRegion $200 HospitalFacility $60 HospitalFacilityTelephone $20
    TypeOfTransportation $20 AdmissionDate $20 NumberOfHospitalDays  ReferringRegion $20 
    ReferringFacility $20 ReferringStaffMember $40 AdmissionType_Emergency $20 
    AdmissionType_Inpatient $20 PhysicianDirectedExamResult $1000 RequiresFollowup $20 
    FollowupDate $200 DischargeDiagnosis $200 DischargedToRegion $20 DischargedToFacility $20 
    DischargeDate $20 ERClaim $20 HospClaim $20 AdmittingDiag $200 DRG $20 ProfClaim $20 AnesClaim $20
    SurgClaim $20 AmbClaim $20 PaymentStatus $20 AuthorizationNumber $20 Comments $500
  ;
  input LastName -- Comments ;
run;

Do you have any actual quotes?  If so then you might want to change them to pipes and add additional code to convert them back.

...
  _infile_=translate(_infile_.'|"','"|');
...
  array _c _character_;
  do over _c;
    _c=translate(_c,'|"','"|');
  end;
run;

 

Informatician
Calcite | Level 5

Tom,

 

I have the following modification of your code working, mostly. I never knew about the _infile_ variable!!

Now we fail at the embedded pilcrows (paragraphs) in the "PhysicianDirectedExamResult" variable. I copied the raw data into word and used show markup to reveal several pilcrows as shown in the screenshot below. Truncover and missover result in the data flowing into additional obs inappropriately.

 

data ERHosp;
	infile "ERHospitalization.txt" DLM="," dsd firstobs=2 lrecl=500 ;
  input @;
     _infile_=translate(_infile_,"'","|");
  length LastName $40 FirstName $40 Gender $5 PatientID $8 DateOfBirth $200 Ethnicity $20
    HospitalRegion $200 HospitalFacility $60 HospitalFacilityTelephone $20
    TypeOfTransportation $20 AdmissionDate $20 NumberOfHospitalDays  ReferringRegion $20 
    ReferringFacility $20 ReferringStaffMember $40 AdmissionType_Emergency $20 
    AdmissionType_Inpatient $20 PhysicianDirectedExamResult $500 RequiresFollowup $20 
    FollowupDate $200 DischargeDiagnosis $200 DischargedToRegion $20 DischargedToFacility $20 
    DischargeDate $20 ERClaim $20 HospClaim $20 AdmittingDiag $200 DRG $20 ProfClaim $20 AnesClaim $20
    SurgClaim $20 AmbClaim $20 PaymentStatus $20 AuthorizationNumber $20 Comments $500
  ;

	input LastName $ FirstName $ Gender $ PatientID $ DateOfBirth $ Ethnicity $
	HospitalRegion $ HospitalFacility $ HospitalFacilityTelephone $
	TypeOfTransportation $ AdmissionDate $ NumberOfHospitalDays  ReferringRegion $ 
	ReferringFacility $  ReferringStaffMember & $40.  AdmissionType_Emergency $ 
	AdmissionType_Inpatient $  PhysicianDirectedExamResult  ~& $500. RequiresFollowup $ 
	FollowupDate $ DischargeDiagnosis $ DischargedToRegion $ DischargedToFacility $ 
	DischargeDate $ ERClaim $ HospClaim $ AdmittingDiag $ DRG $ ProfClaim $ AnesClaim $ 
	SurgClaim $ AmbClaim $ PaymentStatus $ AuthorizationNumber $ Comments $;

run;

Pilcow Error.PNG

Capture.PNG

Tom
Super User Tom
Super User

You need to have each observation on its own line. 

If you are lucky the extra embedded line breaks are done with just CR or just LF and the real end of lines are done with CR and LF (as it normal for Windows text files).

Try adding TERMSTR=CRLF to the INFILE statement.

Keep the TRUNCOVER option. Do not use FLOWOVER.  You could use MISSOVER but then you cannot use formatted input in the INPUT statement because if the last delimited value is shorted than the fixed length informat MISSOVER will throw the whole truncated value away.

 

If TERMSTR=CRLF does not work then you need to fix the file first. You could fix the pipes and the line breaks at the same time.

Something like which will keep track of the number of pipe characters.  It will remove all CR characters. Replace pipe with quote. Replace LF  that are inside the pipes (that are becoming quotes) with pipes and replace LF that are outside of the pipes with real end of lines written using CRLF.

filename original "ERHospitalization.txt";
filename fixed temp;
data _null_;
  infile original lrecl=1 recfm=f ;
  file fixed termstr=crlf;
  input char $char1.;
  numpipe + (char='|');
  if char='0D'x then return;
  else if char='0A'x then do;
    if mod(numpipe,2)=1 then put '|' @;
    else do;
       numpipe=0;
       put;
    end;
  end;
  else if char='|' then put '"' @;
  else if char='"' then put '""' @;
  else put char $char1. @;
run;

Then try reading the fixed version of the file with your data step. 

 

Note there is no need for your complicated INPUT statement. Since you have already defined the variables with your LENGTH statement you can just use a positional variable list like in my example above.  Or if for some reason you wanted to define the variables with the LENGTH statement in a different order than they are in the text file then just list the names. But there is no need to add and $ or ~ or & characters to the INPUT statement.  

Informatician
Calcite | Level 5

YES!! YES!! YES!! Thank you Tom! Now I get to go try and really understand the code you have provided (I made a few tweaks). I am grateful for the opportunity to learn new things! Thanks again!

ballardw
Super User

Please do yourself a favor and paste the data into a code box opened with either the {I} or "running man" icon. The main message window reformats things such that I would not even bother to write code to read that as the pasted text may vary considerably from your actual file.

 

You may also have issues caused by imbedded line feeds indicated by this stuff:

00:00:00|,|00|,|Region|,|xxxx|,|., (Not Specified)|,|Ye|,|No|,|Hypertension due to septic shock

Abdominal pain due to GI Bleed - resolved
C diff colitis- resolved
Acute Kidney injury likely pre-renal- resolved
PUD w iron deficency anemia-improved|,|False|,||,||,|Non Facilities|,|Unknown|,|2000-01-01 00:00:00|,||,||,||,||,||,||,||,||,|Pending|,||,||

It is very likely that "Abdominal pain " will be treated as the start of a new record and be read into your "Lastname" field.

 

Since this a moderately odd data structure I would start by going to the source and see if they could generate something more like a traditional CSV where character values that contain an imbedded comma would be quoted. While at that I would also request that what ever is causing the line breaks in the middle of field to be replaced with a period and single space.

 

As an aside I am going to guess that someone is assigning 1900-01-01 00:00:00 for missing date time values. And if they are sending you times in data birth that are not actually a time then read that as a date and ignore the time component.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 922 views
  • 0 likes
  • 3 in conversation