Hi,
I'm trying to import a text file which is unstructured like header at the top and then column names.
i want the dataset with column names listed below
account no
in-date
in-Time
balance
type
code
ref
party
i have attached sample input file and also attached the desired output file as well
kindly please help me on this
Does NOT look unstructured to me. But it does look like someone accidentally opened it in an editor that replaces spaces with tab characters. You can use the EXPANDTABS option of the INFILE statement to undo that.
Then you can LOOK at the actual file so the structure is clearer.
data _null_;
  infile "C:\Users\abernathyt\Downloads\test.txt" expandtabs ;
  input;
  list;
run;
Now you can see that the data is appearing in fixed column locations on the lines.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1                                 Description AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     101  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 137
2         assakndsklfdsfpsdskmcs'kfdskfnsd'vkmsdv;ls vlksnfdks'fjdsoifnsonvs;lnsohfsnskjfhso;fskfjslfks';fljsl
     101  kfjs;lfjslkfj;lksfjslkfjos;kufoeisocnds;ojvnoriecnwiocns 156
3         fhdsiupfhwfnsdpcmdwpofwecmsd'lkvnpsfmspcns[oijsncspncdsvnsnccd 62
4          0
5          0
6             account no          in-date   in-Time       balance  type   code       ref          party  94
7         *****************       *******   *******       ******* ******  ********   ******    ************* 98
8          0
9         0000000000001111        20230222  23:01:01      100.00- debit   Transfer   123456       00-1289-1234
     101  56-00 105
10                                           DR AMT:      100.00  CR AMT:            0.00  Net Amt:
     101      100.00- 111
11                                         32
12        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  56-00 105
13        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  46-04 105
14        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  56-00 105
You just need to do a little work to see which lines to IGNORE.
data test;
  infile "C:\downloads\test.txt" expandtabs truncover firstobs=4;
  input account_no $20. @;
  if (account_no in (' ','account no','*****************') then delete;
  input @25 date yymmdd8. @35 time time8.
        @45 balance trailsgn11. type $ 56-63 code $ 65-74 ref $75-85 part $86-105 
  ;
  format date yymmdd10. time tod8.;
run;Results
OBS account_no date time balance type code ref part 1 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 2 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123456-00 3 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123446-04 4 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123456-00 5 0000000000001111 2023-02-22 23:01:01 -100 xxxx-xxxx 6 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 7 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 8 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 9 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123456-00 10 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123446-04 11 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123456-00 12 0000000000001111 2023-02-22 23:01:01 -100 xxxx-xxxx 13 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 14 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00
option validvarname=any;  *Make sure variable name could contain blanks;
filename x 'c:\temp\test.txt' encoding='utf8' termstr=crlf;
data have;
infile x  dlm='09'x dsd truncover;
input @;
if anydigit(_infile_) ne 1 then delete;
input  ('account no'n  'in-date-time'n  balance	 type code_ref party)  (: $40.);
'in-date'n=scan(strip('in-date-time'n),1,' ','m');
'in-time'n=scan(strip('in-date-time'n),-1,' ','m');
code=scan(strip(code_ref),1,' ','m');
ref=scan(strip(code_ref),-1,' ','m');
drop 'in-date-time'n code_ref;
run;
data want;
retain 'account no'n  'in-date'n  'in-time'n  balance	 type code ref party;
set have;
run;Thanks for your response!
I tried to
Execute it but unfortunately i encountered an error like below
also there is an date column included inthe file as well
An error occurred executing the workspace job "Sample Data".
Integration technologies failed to submit the code. [Error] Failed to transcode data from U_UTF8_CE to U_LATIN1_CE encoding because it contained characters which are not supported by your SAS session encoding. Please review your encoding= and locale= SAS system options to ensure that they can accommodate the data that you want to process. A portion of the source string, in hex representation is:
[None] 7305623f3a4: 3b 2a 27 3b 2a 22 3b 2a 2f 3b 71 75 69 74
3b 72 *:***/,quit;rl
[None] 7305623f3b4: 75 6e 3b 0d 0a 4f 50 54 49 4f 4e 53 20 50 41 47 |un;..OPTIONS PAGI
[Error] Some code points did not transcode.
Those HEX codes you showed are just normal characters, nothing that would cause transcoding issues. In fact they just look like the string that SAS/Studio and/or Enterprise Guide sends after submitting your user written code in the hope of cleaning up any messes your code caused.
Why are you reading the file as if it contained UTF-8 characters? Are you sure it does contain UTF-8 characters?
LATIN1 is a single byte encoding so it can only represent 256 characters. There are thousands of UTF-8 characters (since it can use up to 4 bytes for a character) so not all of them can be transcoded to LATIN1.
You need to be running SAS with ENCODING=UTF-8 instead of LATIN1 if you want to read non 7bit ASCII characters.
The file you shared does not have any non 7bit ASCII characters. So it will NOT cause any transcoding errors.
354  data _null_;
355    infile "C:\Users\xxx\Downloads\test (1).txt" expandtabs  ;
356    input;
357    if indexc(_infile_,collate(128,255));
358    list;
359  run;
NOTE: The infile "C:\Users\xxx\Downloads\test (1).txt" is:
      (system-specific pathname),
      (system-specific file attributes)
NOTE: 40 records were read from the infile (system-specific pathname).
      The minimum record length was 0.
      The maximum record length was 156.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
Does NOT look unstructured to me. But it does look like someone accidentally opened it in an editor that replaces spaces with tab characters. You can use the EXPANDTABS option of the INFILE statement to undo that.
Then you can LOOK at the actual file so the structure is clearer.
data _null_;
  infile "C:\Users\abernathyt\Downloads\test.txt" expandtabs ;
  input;
  list;
run;
Now you can see that the data is appearing in fixed column locations on the lines.
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1                                 Description AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
     101  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 137
2         assakndsklfdsfpsdskmcs'kfdskfnsd'vkmsdv;ls vlksnfdks'fjdsoifnsonvs;lnsohfsnskjfhso;fskfjslfks';fljsl
     101  kfjs;lfjslkfj;lksfjslkfjos;kufoeisocnds;ojvnoriecnwiocns 156
3         fhdsiupfhwfnsdpcmdwpofwecmsd'lkvnpsfmspcns[oijsncspncdsvnsnccd 62
4          0
5          0
6             account no          in-date   in-Time       balance  type   code       ref          party  94
7         *****************       *******   *******       ******* ******  ********   ******    ************* 98
8          0
9         0000000000001111        20230222  23:01:01      100.00- debit   Transfer   123456       00-1289-1234
     101  56-00 105
10                                           DR AMT:      100.00  CR AMT:            0.00  Net Amt:
     101      100.00- 111
11                                         32
12        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  56-00 105
13        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  46-04 105
14        0000000000001111        20230222  23:01:01      100.00-                                 00-1289-1234
     101  56-00 105
You just need to do a little work to see which lines to IGNORE.
data test;
  infile "C:\downloads\test.txt" expandtabs truncover firstobs=4;
  input account_no $20. @;
  if (account_no in (' ','account no','*****************') then delete;
  input @25 date yymmdd8. @35 time time8.
        @45 balance trailsgn11. type $ 56-63 code $ 65-74 ref $75-85 part $86-105 
  ;
  format date yymmdd10. time tod8.;
run;Results
OBS account_no date time balance type code ref part 1 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 2 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123456-00 3 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123446-04 4 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123456-00 5 0000000000001111 2023-02-22 23:01:01 -100 xxxx-xxxx 6 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 7 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 8 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 9 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123456-00 10 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123446-04 11 0000000000001111 2023-02-22 23:01:01 -100 00-1289-123456-00 12 0000000000001111 2023-02-22 23:01:01 -100 xxxx-xxxx 13 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00 14 0000000000001111 2023-02-22 23:01:01 -100 debit Transfer 123456 00-1289-123456-00
That file as two more variables. Just adjust.
It also has some minor inconsistency in the position of the fields. Not sure what caused that, probably whatever changed the spaces into tabs. But there is enough empty space around the moved fields to make the same method work. Just increase the width of the informats used.
@Ksharp wrote:
"It also has some minor inconsistency in the position of the fields. "
Yes. That is reason why I could not use column input method.
Plus the fact that COLUMN MODE input does not allow using special INFORMATs like YYMMDD or TIME or TRAILSGN that are needed for some of those values.
You cannot use LIST MODE because the missing values and the delimiter are both represented by spaces.
So that just leaves FORMATTED MODE for those variables.
data test;
  infile "C:\downloads\test (1).txt" expandtabs truncover firstobs=4;
/*
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         0000000000001111        20230222 20230222  23:01:01     100.00- debit   Transfer   123456       00-1
     101  289-123456-00 113
2         0000000000001111        20230222  20230222  23:01:01    100.00-                                 00-1
     101  289-123456-00 113
*/
  input account_no $20. @;
  if not (account_no in (' ','account no','*****************'));
  input @25 date yymmdd8. @34 in_date yymmdd10. @44 in_time time10.
        @55 balance trailsgn9. type $ 65-72 code $ 73-82 ref $84-96 part $97-113
  ;
  format date in_date yymmdd10. in_time tod8.;
run;
Thanks Tom for the resolution.
Any chance , can we get the deleted items to another dataset . like test data should have all the account numbers data and another dataset to have data which are deleted in the if statement
if (account_no in (' ','account no','*****************') then delete;
when i use output, im able to get them but my test data is empty.
The lines with 'account no' or '*****************' in that position do not appear to me to have any other information. Neither do the lines that are totally empty.
6 account no in-date in-Time balance type code ref party 94 7 ***************** ******* ******* ******* ****** ******** ****** ************* 98 8 0
What information did you want to keep from the lines that are blank at the beginning and have stuff in other places. What do those values mean?
10                                           DR AMT:      100.00  CR AMT:            0.00  Net Amt:
     101      100.00- 111
To me those just looked like sub-total lines, Can't you recreate them by adding up the values read from the actual data lines?
If they are important then you will need to make another variable to RETAIN the account number read from the lines above since it is empty on those lines.
absoultely agree!! those values doesnt have any sense.
but we need to check what all the data we deleted and thats the reason we want them in another dataset
can you make clear on how to use retain here?
thank again!!
I don't understand.
What "DATA" did you delete?
The DELETE statement just ends the data step iteration because it has determined the current line of the file has no data on it.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
