BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
sg_kr
Obsidian | Level 7

Hi,

 

I'm trying to import a text file which is unstructured like header at the top and then column names.

i want the dataset with column names listed below

account no

in-date

in-Time

balance

type

code

ref

party

i have attached sample input file and also attached the desired output file as well

 

kindly please help me on this

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Does NOT look unstructured to me.  But it does look like someone accidentally opened it in an editor that replaces spaces with tab characters.  You can use the EXPANDTABS option of the INFILE statement to undo that.

 

Then you can LOOK at the actual file so the structure is clearer.

data _null_;
  infile "C:\Users\abernathyt\Downloads\test.txt" expandtabs ;
  input;
  list;
run;

Now you can see that the data is appearing in fixed column locations on the lines.  

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1                                 Description AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     101  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 137
2         assakndsklfdsfpsdskmcs'kfdskfnsd'vkmsdv;ls vlksnfdks'fjdsoifnsonvs;lnsohfsnskjfhso;fskfjslfks';fljsl
     101  kfjs;lfjslkfj;lksfjslkfjos;kufoeisocnds;ojvnoriecnwiocns 156
3         fhdsiupfhwfnsdpcmdwpofwecmsd'lkvnpsfmspcns[oijsncspncdsvnsnccd 62
4          0
5          0
6             account no          in-date   in-Time       balance  type   code       ref          party  94
7         *****************       *******   *******       ******* ******  ********   ******    ************* 98
8          0
9         0000000000001111        20230222  23:01:01      100.00- debit   Transfer   123456       00-1289-1234
     101  56-00 105
10                                           DR AMT:      100.00  CR AMT:            0.00  Net Amt:
     101      100.00- 111
11                                         32
12        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  56-00 105
13        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  46-04 105
14        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  56-00 105

You just need to do a little work to see which lines to IGNORE.

data test;
  infile "C:\downloads\test.txt" expandtabs truncover firstobs=4;
  input account_no $20. @;
  if (account_no in (' ','account no','*****************') then delete;
  input @25 date yymmdd8. @35 time time8.
        @45 balance trailsgn11. type $ 56-63 code $ 65-74 ref $75-85 part $86-105 
  ;
  format date yymmdd10. time tod8.;
run;

Results

OBS       account_no             date        time    balance    type       code       ref      part

  1    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
  2    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123456-00
  3    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123446-04
  4    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123456-00
  5    0000000000001111    2023-02-22    23:01:01      -100                                    xxxx-xxxx
  6    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
  7    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
  8    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
  9    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123456-00
 10    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123446-04
 11    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123456-00
 12    0000000000001111    2023-02-22    23:01:01      -100                                    xxxx-xxxx
 13    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
 14    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00



View solution in original post

16 REPLIES 16
Ksharp
Super User
option validvarname=any;  *Make sure variable name could contain blanks;
filename x 'c:\temp\test.txt' encoding='utf8' termstr=crlf;
data have;
infile x  dlm='09'x dsd truncover;
input @;
if anydigit(_infile_) ne 1 then delete;
input  ('account no'n  'in-date-time'n  balance	 type code_ref party)  (: $40.);
'in-date'n=scan(strip('in-date-time'n),1,' ','m');
'in-time'n=scan(strip('in-date-time'n),-1,' ','m');
code=scan(strip(code_ref),1,' ','m');
ref=scan(strip(code_ref),-1,' ','m');
drop 'in-date-time'n code_ref;
run;
data want;
retain 'account no'n  'in-date'n  'in-time'n  balance	 type code ref party;
set have;
run;
sg_kr
Obsidian | Level 7

Thanks for your response!

I tried to
Execute it but unfortunately i encountered an error like below
also there is an date column included inthe file as well

An error occurred executing the workspace job "Sample Data".
Integration technologies failed to submit the code. [Error] Failed to transcode data from U_UTF8_CE to U_LATIN1_CE encoding because it contained characters which are not supported by your SAS session encoding. Please review your encoding= and locale= SAS system options to ensure that they can accommodate the data that you want to process. A portion of the source string, in hex representation is:
[None] 7305623f3a4: 3b 2a 27 3b 2a 22 3b 2a 2f 3b 71 75 69 74
3b 72 *:***/,quit;rl
[None] 7305623f3b4: 75 6e 3b 0d 0a 4f 50 54 49 4f 4e 53 20 50 41 47 |un;..OPTIONS PAGI
[Error] Some code points did not transcode.

Tom
Super User Tom
Super User

Those HEX codes you showed are just normal characters, nothing that would cause transcoding issues.  In fact they just look like the string that SAS/Studio and/or Enterprise Guide sends after submitting your user written code in the hope of cleaning up any messes your code caused.

 

Why are you reading the file as if it contained UTF-8 characters?  Are you sure it does contain UTF-8 characters?

 

LATIN1 is a single byte encoding so it can only represent 256 characters. There are thousands of UTF-8 characters (since it can use up to 4 bytes for a character) so not all of them can be transcoded to LATIN1.

 

You need to be running SAS with ENCODING=UTF-8 instead of LATIN1 if you want to read non 7bit ASCII characters. 

Tom
Super User Tom
Super User

The file you shared does not have any non 7bit ASCII characters. So it will NOT cause any transcoding errors.

354  data _null_;
355    infile "C:\Users\xxx\Downloads\test (1).txt" expandtabs  ;
356    input;
357    if indexc(_infile_,collate(128,255));
358    list;
359  run;

NOTE: The infile "C:\Users\xxx\Downloads\test (1).txt" is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE: 40 records were read from the infile (system-specific pathname).
      The minimum record length was 0.
      The maximum record length was 156.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

 

Ksharp
Super User
As @Tom said, you need to change your sas encoding to be UTF-8 , it has nothing to do with sas code.

There are some characters in UTF-8 can not display in Latin1 encoding ,therefore you can not input these characters .
Tom
Super User Tom
Super User

Does NOT look unstructured to me.  But it does look like someone accidentally opened it in an editor that replaces spaces with tab characters.  You can use the EXPANDTABS option of the INFILE statement to undo that.

 

Then you can LOOK at the actual file so the structure is clearer.

data _null_;
  infile "C:\Users\abernathyt\Downloads\test.txt" expandtabs ;
  input;
  list;
run;

Now you can see that the data is appearing in fixed column locations on the lines.  

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1                                 Description AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     101  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 137
2         assakndsklfdsfpsdskmcs'kfdskfnsd'vkmsdv;ls vlksnfdks'fjdsoifnsonvs;lnsohfsnskjfhso;fskfjslfks';fljsl
     101  kfjs;lfjslkfj;lksfjslkfjos;kufoeisocnds;ojvnoriecnwiocns 156
3         fhdsiupfhwfnsdpcmdwpofwecmsd'lkvnpsfmspcns[oijsncspncdsvnsnccd 62
4          0
5          0
6             account no          in-date   in-Time       balance  type   code       ref          party  94
7         *****************       *******   *******       ******* ******  ********   ******    ************* 98
8          0
9         0000000000001111        20230222  23:01:01      100.00- debit   Transfer   123456       00-1289-1234
     101  56-00 105
10                                           DR AMT:      100.00  CR AMT:            0.00  Net Amt:
     101      100.00- 111
11                                         32
12        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  56-00 105
13        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  46-04 105
14        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  56-00 105

You just need to do a little work to see which lines to IGNORE.

data test;
  infile "C:\downloads\test.txt" expandtabs truncover firstobs=4;
  input account_no $20. @;
  if (account_no in (' ','account no','*****************') then delete;
  input @25 date yymmdd8. @35 time time8.
        @45 balance trailsgn11. type $ 56-63 code $ 65-74 ref $75-85 part $86-105 
  ;
  format date yymmdd10. time tod8.;
run;

Results

OBS       account_no             date        time    balance    type       code       ref      part

  1    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
  2    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123456-00
  3    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123446-04
  4    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123456-00
  5    0000000000001111    2023-02-22    23:01:01      -100                                    xxxx-xxxx
  6    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
  7    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
  8    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
  9    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123456-00
 10    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123446-04
 11    0000000000001111    2023-02-22    23:01:01      -100                                    00-1289-123456-00
 12    0000000000001111    2023-02-22    23:01:01      -100                                    xxxx-xxxx
 13    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00
 14    0000000000001111    2023-02-22    23:01:01      -100     debit    Transfer    123456    00-1289-123456-00



Ksharp
Super User

@Tom ,

if you are using the test.txt updated by OP, your code would not work.

 

Ksharp_0-1707869897082.png

 

Tom
Super User Tom
Super User

That file as two more variables.  Just adjust.

It also has some minor inconsistency in the position of the fields.  Not sure what caused that, probably whatever changed the spaces into tabs. But there is  enough empty space around the moved fields to make the same method work.  Just increase the width of the informats used.

Ksharp
Super User
"It also has some minor inconsistency in the position of the fields. "

Yes. That is reason why I could not use column input method.
Tom
Super User Tom
Super User

@Ksharp wrote:
"It also has some minor inconsistency in the position of the fields. "

Yes. That is reason why I could not use column input method.

Plus the fact that COLUMN MODE input does not allow using special INFORMATs like YYMMDD or TIME or TRAILSGN that are needed for some of those values. 

You cannot use LIST MODE because the missing values and the delimiter are both represented by spaces.

So that just leaves FORMATTED MODE for those variables.

 

data test;
  infile "C:\downloads\test (1).txt" expandtabs truncover firstobs=4;
/*
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         0000000000001111        20230222 20230222  23:01:01     100.00- debit   Transfer   123456       00-1
     101  289-123456-00 113
2         0000000000001111        20230222  20230222  23:01:01    100.00-                                 00-1
     101  289-123456-00 113
*/
  input account_no $20. @;
  if not (account_no in (' ','account no','*****************'));
  input @25 date yymmdd8. @34 in_date yymmdd10. @44 in_time time10.
        @55 balance trailsgn9. type $ 65-72 code $ 73-82 ref $84-96 part $97-113
  ;
  format date in_date yymmdd10. in_time tod8.;
run;
sg_kr
Obsidian | Level 7

Thanks Tom for the resolution.

Any chance , can we get the deleted items to another dataset . like test data should have all the account numbers data and  another dataset to have data which are deleted in the if statement

 

if (account_no in (' ','account no','*****************') then delete;

 

when i use output, im able to get them but my test data is empty.

Tom
Super User Tom
Super User

The lines with 'account no' or '*****************' in that position do not appear to me to have any other information.  Neither do the lines that are totally empty.

6             account no          in-date   in-Time       balance  type   code       ref          party  94
7         *****************       *******   *******       ******* ******  ********   ******    ************* 98
8          0

What information did you want to keep from the lines that are blank at the beginning and have stuff in other places.  What do those values mean?

10                                           DR AMT:      100.00  CR AMT:            0.00  Net Amt:
     101      100.00- 111

To me those just looked like sub-total lines, Can't you recreate them by adding up the values read from the actual data lines?

If they are important then you will need to make another variable to RETAIN the account number read from the lines above since it is empty on those lines.

sg_kr
Obsidian | Level 7

absoultely agree!! those values doesnt have any sense.

 

but we need to check what all  the data we deleted and thats the reason we want them in another dataset

 

can you make clear on how to use retain here?

 

thank again!!

Tom
Super User Tom
Super User

I don't understand.  

What "DATA" did you delete?

The DELETE statement just ends the data step iteration because it has determined the current line of the file has no data on it.

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 16 replies
  • 7548 views
  • 0 likes
  • 3 in conversation