BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ucdcrush
Obsidian | Level 7

Hi all -

 

I have a large text file that looks like this:

 

MSH|askdasldkajs

OBX|asdkjaslkj1239

ORC|asdkljqoi3w4908

 

MSH|asf98asfaslk

OBX|asd8a7sdoaisyud

NTE|asdasod7as

 

 

Where I'd like each of those chunks (that start with MSH| and go to the location of the next MSH|) read into a single (large text) variable, and where each of the chunks would be it's own record.

 

So basically, read through a text file, when encountering MSH|, dump everything (including the MSH|) into a single variable, then when it runs into the next MSH|, end that first record and start the second record with MSH|, etc. all the way to the end of the file.

 

Can anyone shine light on how to accomplish this? Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

I added a "|" as a delimiter ... you can easily change that to something else, such as "?":

 

else longvar = catx('?', longvar, _infile_);

 

But your latest idea is reasonably easy, won't run into length problems, along these lines:

 

data want;

infile textfile;

length longvar $100;

input @;

longvar = _infile_;

if longvar =: 'MSH|' then chunknum + 1;

run;

View solution in original post

7 REPLIES 7
ballardw
Super User

@ucdcrush wrote:

Hi all -

 

I have a large text file that looks like this:

 

MSH|askdasldkajs

OBX|asdkjaslkj1239

ORC|asdkljqoi3w4908

 

MSH|asf98asfaslk

OBX|asd8a7sdoaisyud

NTE|asdasod7as

 

 

Where I'd like each of those chunks (that start with MSH| and go to the location of the next MSH|) read into a single (large text) variable, and where each of the chunks would be it's own record.

 

So basically, read through a text file, when encountering MSH|, dump everything (including the MSH|) into a single variable, then when it runs into the next MSH|, end that first record and start the second record with MSH|, etc. all the way to the end of the file.

 

Can anyone shine light on how to accomplish this? Thank you.


Do you have any idea how long that resulting text variable would have to be, as in character count?

Does your source file actually contain linefeeds and/or carriage returns (end of lines?). Do you want those linefeed or carriage return characters as part of the "chunk"?

 

It would help to post an actual sample of the file, or something with the same sort of structure and content. Post into a code box opened with the forum's {I} icon as the main message windows will reformat text.

ucdcrush
Obsidian | Level 7

Hi Ballardw,

 

If possible, I'd like the length of that all-encompassing string variable to be 32,767 just to account for longer "chunks". They will not get that long, but will vary in length.

 

The actual data contains line breaks between MSH chunks, but it doesn't matter whether they are retained or not in what is saved to the SAS dataset. Ideally, the solution would not rely on those line breaks, and instead rely on the MSH| characters, as I don't know whether the format of the original large file might change and there not be line breaks between them.

 

I'll try to make a better illustration of what the source file looks like, note please that the <CR><LF> are not actual strings within the file, but when I view a source file in NotePad++ and ask to show all symbols, the <CR> and <LF> symbols appear at the end of each line and at the front of the blank lines. I definitely need to keep whatever symbols/characters are present within the MSH "chunk"

MSH|asdklajsd<CR><LF>
OBX|asdkjasd1923<CR><LF>
ORC|12391283<CR><LF>
NTE|1239182390<CR><LF>
<CR><LF>
MSH|fs8dfiashdfk<CR><LF>
ORC|as8da7s9d8ah<CR><LF>
<CR><LF>
MSH|scfsdf0as9d8f<CR><LF>
OBX|as0d89asdasdjk<CR><LF>

Also, what I'm trying to do here is this:

Take this giant text file and create individual files containing each MSH "chunk". I have some macro code already where I can loop through the SAS table and generate a new file for each record, I just don't know how to generate the SAS dataset containing the file contents.

 

Thanks for your help.

 

Astounding
PROC Star

Here's one approach ...

 

 

data want;
infile text end=done;
length longvar $32767;
retain longvar;
input @;
if _infile_ =: 'MSH': then do;
   if _n_ > 1 then output;
   longvar = _infile_;
end;
else longvar = catx('|', longvar, _infile_);
if done then output;
run;

Note that I inserted pipes between original text records.  So you won't get:  MSH|abc|OBX|abd

Instead, you will get:  MSH|abc|OBX|abd

 

If that's an issue, you can always change the line that does this to:

 

else longvar = cats(longvar, _infile_);

 

It's untested code at this point, so might need a small amount of tweaking.

ucdcrush
Obsidian | Level 7

Hi Astounding, thank you, I think it's close!

 

It seems that the resulting longvar is a single line, and that the line breaks within the "chunk" are gone. When I paste the resulting string into a text file, it's on a single line and the next system in line doesn't apparently work without those line breaks that were present in the original file.

 

Is there a way to include those characters (or perhaps include a delineator in their place) within the input step?

 

ucdcrush
Obsidian | Level 7

I'm now experimenting where I just bring in each line as a new record, into the some Longvar variable.

 

Can I loop through that resulting SAS table and assign a "chunk number", as in below?

 

 

 

LONGVAR                             CHUNKNUM

MSH asdas lkajsd                         1
OBX askldjaslkdjas ldk                   1  
                                         1 
MSH asas342sd                            2
OBC ask1234k4sldkj                       2  
                                         2
MSH asaas8df79sd7f                       3
OBR asd89a7s9das                         3 

 

 

Astounding
PROC Star

I added a "|" as a delimiter ... you can easily change that to something else, such as "?":

 

else longvar = catx('?', longvar, _infile_);

 

But your latest idea is reasonably easy, won't run into length problems, along these lines:

 

data want;

infile textfile;

length longvar $100;

input @;

longvar = _infile_;

if longvar =: 'MSH|' then chunknum + 1;

run;

ucdcrush
Obsidian | Level 7

Thanks for your help Astounding! I ended up doing the "chunking" part after I'd read the data in without that chunk variable. It's working now!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2211 views
  • 0 likes
  • 3 in conversation