BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jimbobob
Quartz | Level 8

Stuck trying to figure out how to parse a file that we get from fedwire the general format is in attached myfile.txt  There can be any number of records. Within the records the data elements are separated by a tag enclosed in {} which represent a field name, the data following the tag is associated with that field, up until we reach next tag {}. Each record can have different tags and different number of tags, as in the attached example all three records have the same tags {1100}, {1110},{1120},{1510},{1520},{2000},{3100},{3320},{3400},and {3600}. However you can see record three then jumps to {4200}, whereas records 1and 2 the next tag is {3620}. 

 

I have tried using input @, but struggling with that, stopping after field2, also some of the data within the tag can be variable length, so can't always set the field value to extract exact length of the data, so trying to figure out how to parse by tags {} and capture data.

options ls=max;
filename fedwire "%sysfunc(pathname(WORK))\fedwire.txt";
data have;
  input;
  file fedwire;
  put _infile_;
  datalines;
{1100}30P N{1110}08030815FT03{1120}20210803L1B7832F00013308030815FT03{1510}1000{1520}20210801B1Q8161C001707{2000}002020230000{3100}021000018BANKOBOB*{3320}F9S2108035631900*{3400}121125631AAA BBB CCCC DD EE*{3600}CTP{3620}3*FA6FE938-A6F6-4EE9-A062-6B87FC752339*{3700}SUSD0,00*{3710}USD203000,*{4200}D1602204277*ALOT OF WRITTING STUFF 98203*{5000}D02628651*MR WARREN JOHN BLAHALBH*FLAT 30 135 AAAAAAA WALK*LONDON N1 7RR GB*{5100}D8033779245*COOTTS AND COMP'L. KING DIV*73 MAN STR, EAST HSE, 3RD FL*D*{6000}HELEN BOBS STUFF*ROUTE: 122228444*
{1100}30P N{1110}08031116FT03{1120}20210803L1B7832F00031008031116FT03{1510}1000{1520}20210803B1QGC06C006237{2000}000300899500{3100}021000021NONEOFBUS*{3320}98566273215FS*{3400}121222484AAA BBB CCCC DD EE*{3600}CTP{3620}3*9A294483-5377-4A00-9362-8C95FEF4229F*{3700}SUSD5,00*{3710}USD6000,00*{4100}F124448405*MOREBLAH BLAH BLAH STUFF*{4200}D88040660033579*EDUARDO BLAH BLAH HERNANDEZ, E*STADOS UNIDOS,11111 LILAC ,TEXAS*{4320}SWF OF 21/09/03*{5000}DCLIENTE:13770683*INDUSTY  S A DE C V*BRALES C P*45694 EL SALTO JALISCO MEXICO*{5100}BRGIOMFE33XMT*
{1100}30P S{1110}08031811FT03{1120}20210803L1B7832F00101008031811FT03{1510}1631{1520}20210803QMGAT015001875{2000}000440962083{3100}125108405CANTSEE ME*{3320}F001BB*{3400}121045584AAA BBB CCCC DD EE*{3600}DRB{4200}F125108405*SOMEWHER OVER THE RAINBOW*{4400}D125108405*NOT IN KANSAS*{5000}F125108405*LOST SOMEWHERE*{5400}125108405{6500}DRAW*CREDIT IMMUNITY PALACE*
run;
data want2;
infile fedwire  truncover;
input @"{1100}" field1 :$5. @"{1110}" field2 :$12. @"{1120}" field3 :$34.;
run;


Was trying to use regular expression but can't figure that out either, I think this will work, but not sure how to keep field name  prxparse('/(?<={\d\d\d\d})\w+/')

Any help would be appreciated. Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Seems pretty simple to read.

data mydata;
  infile 'c:\downloads\myfile.txt' dlm='{}' truncover ;
  length row col 8 tag $20 text $200 ;
  row+1;
  do col=1 by 1 until(tag=' ');
    input tag text @;
    if tag ne ' ' then output;
  end;
run;

Results:

Obs    row    col    tag     text

  1     1       1    1100    30P N
  2     1       2    1110    08030815FT03
  3     1       3    1120    20210803L1B7832F00013308030815FT03
  4     1       4    1510    1000
  5     1       5    1520    20210801B1Q8161C001707
  6     1       6    2000    002020230000
  7     1       7    3100    021000018BANKOBOB*
  8     1       8    3320    F9S2108035631900*
  9     1       9    3400    121125631AAA BBB CCCC DD EE*
 10     1      10    3600    CTP
 11     1      11    3620    3*FA6FE938-A6F6-4EE9-A062-6B87FC752339*
 12     1      12    3700    SUSD0,00*
 13     1      13    3710    USD203000,*
 14     1      14    4200    D1602204277*ALOT OF WRITTING STUFF 98203*
 15     1      15    5000    D02628651*MR WARREN JOHN BLAHALBH*FLAT 30 135 AAAAAAA WALK*LONDON N1 7RR GB*
 16     1      16    5100    D8033779245*COOTTS AND COMP'L. KING DIV*73 MAN STR, EAST HSE, 3RD FL*D*
 17     1      17    6000    HELEN BOBS STUFF*ROUTE: 122228444*
 18     2       1    1100    30P N
 19     2       2    1110    08031116FT03
 20     2       3    1120    20210803L1B7832F00031008031116FT03
...

View solution in original post

16 REPLIES 16
ballardw
Super User

The documentation for the file contents is where exactly?

My guess is some of those "fields" have a mixture of date and other stuff.

 

Is there anything that identifies a line, or combination of lines, as a single record?

 

If the fields can have different lengths it really helps to know which ones what rules may be involved for the substrings.

 

jimbobob
Quartz | Level 8

If you really want to read the documentation, here you go: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.445.7645&rep=rep1&type=pdf

 

Yes the data in between the tags can be a mixture, for my purposes if I can get each set of data elements between the tags into its own column of data, then i can parse that data further.

 

Each row is one record that can be parsed out to many fields. Data between each tag {} can be one or more fields of data as well, but im working on baby steps just trying to break out each set of data between the tags to its own column.

ChrisNZ
Tourmaline | Level 20

Like this?

data WANT;
  infile FEDWIRE truncover lrecl=8000;
  input ;
  array FIELD[0:99] $10.;
  array VALUE[0:99] $50.;
  RECORD=_N_;
  do SCAN=0 to 99 ;
    FIELD[SCAN]=scan(_INFILE_,SCAN*2+1,'{}');
    VALUE[SCAN]=scan(_INFILE_,SCAN*2+2,'{}');
    if VALUE[SCAN]=' ' then leave; 
  end;
  output;
 run;

 

jimbobob
Quartz | Level 8

Thanks Chris, I might be able to work with this. Attached is my desired output

ChrisNZ
Tourmaline | Level 20

Glad to know. I'm not touching MS Office files 🙂

jimbobob
Quartz | Level 8
1100 1110 1120 1510 1520 2000 3100 3320 3400 3600 3620 3700 3710 4100 4200 4320 4400 5000 5100 5400 6000 6500
30P N 08030815FT03 20210803L1B7832F00013308030815FT03 1000 20210801B1Q8161C001707 002020230000 021000018BANKOBOB* F9S2108035631900* 121125631AAA BBB CCCC DD EE* CTP 3*FA6FE938-A6F6-4EE9-A062-6B87FC752339* SUSD0,00* USD203000,*   D1602204277*ALOT OF WRITTING STUFF 98203*     D02628651*MR WARREN JOHN BLAHALBH*FLAT 30 135 AAAA D8033779245*COOTTS AND COMP'L. KING DIV*73 MAN STR   HELEN BOBS STUFF*ROUTE: 122228444*  
30P N 08031116FT03 20210803L1B7832F00031008031116FT03 1000 20210803B1QGC06C006237 000300899500 021000021NONEOFBUS* 98566273215FS* 121222484AAA BBB CCCC DD EE* CTP 3*9A294483-5377-4A00-9362-8C95FEF4229F* SUSD5,00* USD6000,00* F124448405*MOREBLAH BLAH BLAH STUFF* D88040660033579*EDUARDO BLAH BLAH HERNANDEZ, E*STA SWF OF 21/09/03*   DCLIENTE:13770683*INDUSTY  S A DE C V*BRALES C P*4 BRGIOMFE33XMT*      
30P S 08031811FT03 20210803L1B7832F00101008031811FT03 1631 20210803QMGAT015001875 000440962083 125108405CANTSEE ME* F001BB* 121045584AAA BBB CCCC DD EE* DRB         F125108405*SOMEWHER OVER THE RAINBOW*   D125108405*NOT IN KANSAS* F125108405*LOST SOMEWHERE*   125108405   DRAW*CREDIT IMMUNITY PALACE*
ChrisNZ
Tourmaline | Level 20

1. How do you establish the list of variable names?

2. The names you show are invalid: the first character is a digit.

jimbobob
Quartz | Level 8

The goal is to rename the numbers to field name, for example 1100 = Message_Disposition, 1110 = Receipt_Time_Stamp, this pdf list out the actuals names that I would rename the variables to https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.445.7645&rep=rep1&type=pdf

 

ballardw
Super User

@jimbobob wrote:

The goal is to rename the numbers to field name, for example 1100 = Message_Disposition, 1110 = Receipt_Time_Stamp, this pdf list out the actuals names that I would rename the variables to https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.445.7645&rep=rep1&type=pdf

 


You might consider a custom format to supply the text "Message disposition" or similar for the value of the Tag, using @Tom's code.

Or create a format so that the formatted value of the Tag variable could be used by Proc Transpose in an ID statement if you had a variable that would indicate which observations were from the same line in the original file.

Example using a format to create new variables using with Proc Transpose:

data example;
   input x $ y $ z $;
datalines;
1 2 abc
4 5 pdq
;

proc format;
value $ycode
'2'='Some_description'
'4'='Other_desc'
'5'='This_description'
;
run;

proc transpose data=example
   out=trans;
by x;
id y;
var z;
format y $ycode.;
run;

Placing a significant chunk of _infile_ into a permanent variable might work for a line identifier if nothing better comes to mind using Tom's code.

 

Tom
Super User Tom
Super User

A unique row identifier is what the ROW variable is for in the selected answer.

jimbobob
Quartz | Level 8

Excellent, thanks Ballard, way better than the route I was taking using a replace macro I found in a SAS paper was working but taking forever to run. this was instant.

Tom
Super User Tom
Super User

Seems pretty simple to read.

data mydata;
  infile 'c:\downloads\myfile.txt' dlm='{}' truncover ;
  length row col 8 tag $20 text $200 ;
  row+1;
  do col=1 by 1 until(tag=' ');
    input tag text @;
    if tag ne ' ' then output;
  end;
run;

Results:

Obs    row    col    tag     text

  1     1       1    1100    30P N
  2     1       2    1110    08030815FT03
  3     1       3    1120    20210803L1B7832F00013308030815FT03
  4     1       4    1510    1000
  5     1       5    1520    20210801B1Q8161C001707
  6     1       6    2000    002020230000
  7     1       7    3100    021000018BANKOBOB*
  8     1       8    3320    F9S2108035631900*
  9     1       9    3400    121125631AAA BBB CCCC DD EE*
 10     1      10    3600    CTP
 11     1      11    3620    3*FA6FE938-A6F6-4EE9-A062-6B87FC752339*
 12     1      12    3700    SUSD0,00*
 13     1      13    3710    USD203000,*
 14     1      14    4200    D1602204277*ALOT OF WRITTING STUFF 98203*
 15     1      15    5000    D02628651*MR WARREN JOHN BLAHALBH*FLAT 30 135 AAAAAAA WALK*LONDON N1 7RR GB*
 16     1      16    5100    D8033779245*COOTTS AND COMP'L. KING DIV*73 MAN STR, EAST HSE, 3RD FL*D*
 17     1      17    6000    HELEN BOBS STUFF*ROUTE: 122228444*
 18     2       1    1100    30P N
 19     2       2    1110    08031116FT03
 20     2       3    1120    20210803L1B7832F00031008031116FT03
...
jimbobob
Quartz | Level 8

Awesome. I love SAS folks. Thank you so much Tom.  Chris thanks as well for your help,

jimbobob
Quartz | Level 8

Hey Tom, ran into a snag with some of my files. I found that in some file the records start with "YFT811 " before starting the rest of the delimited file, so the delimted file really starts at position 9: I found a paper on using a variable for the dlm= option, but I've had no luck figuring out how to do it

 

Capture.PNG

 

Paper I was trying to follow:  https://www.lexjansen.com/nesug/nesug12/cc/cc23.pdf

My Attempted Code:

data x;
  infile "C:\Users\jwalker\Desktop\New folder (2)\FedLine32948_PCBB_FXWIRE_FEDLINE_202107201050.txt" dlm=mydlm truncover;
  length row col 8 tag $32 text $1000;
  row+1;
  mydlm = 'YFT811';
  input v1 $ @;
  do col=1 by 1 until(tag=' ');
  	 mydlm = '{}';
    input tag text @;
    if tag ne ' ' then output;
  end;
run;

I there a way to do this or tell the input to start at position 9?

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 4042 views
  • 0 likes
  • 4 in conversation