Reading COPYBOOK external file through BASE SAS Code

Abhi1212 · Posted 03-15-2016 05:06 AM

RW9 · Posted 03-15-2016 05:52 AM

Hi,

Well, I have never come across this format before, but doing some searching I found this:

http://docs.oracle.com/cd/B28359_01/gateways.111/b31053/c_examples.htm

Is this what you mean by copybook format, it looks a bit like it. If so I would start with this, where did you get the file - the reason being is it looks like all carriage returns have been removed. This makes reading the file that much more difficult, and indicates that it came from a Unix system - it will make your life much easier if you go back to that original file and transfer it using binary mode to preserve the carriage returns.

Secondly you appear to be missing the specification for the file, example E1 in that link, which details the file structure. Without that your guessing and its going to be quite difficult to process the data, if we had the spec for it, then the code is easier to write, simply take each character at a time for each block, but without it, so for example this line:

FILHDR14-02-2016 03:32:281394 ALH DC 003000252 TRNHDR000000RECHDR211 1ACMT

I assume that the first field is FLHDR, is it always that length, what does it mean, I assume that up to 2016 is a date, and then a time, then ALH. Then there is a big gap to DC, is this a new record, how does it fit in etc? So you see without knowing what the output should look like I am just guessing. If you want to read the file, then you can loop over character by character,

using input a $char.;, and there may be better ways depending on what its supposed to look like.

Abhi1212 · Posted 03-15-2016 06:40 AM

Hi,
Please refer to the attachment for the layout and my query for the data. The record starts from ALH and the delimiter is ^T. Yes as u said there is spaces we need to avoid those.
Then the other issue will be the variable length. In SAS the maximum variable length is 32K if the file exceeds then we will face issues. So I need a solution for that as well.
Every record starts from the 00000000000ALH is the ‘Bank code’ first record and end of record is ‘Creation Date/Time’

RW9 · Posted 03-15-2016 06:50 AM

Looks like @Kurt_Bremser has already provided a working example.

Kurt_Bremser · Posted 03-15-2016 06:13 AM

Try this:

data new;
infile "$HOME/FI2016021301394.dat" dsd truncover recfm=f lrecl=100 end=done;
length a $2000 a1 $100 b $2000;
retain a;
x = index(a,'ACMT');
do while (x = 0);
  input a1 $3000.;
  a = trim(a) !! a1;
  x = index(a,'ACMT');
end;
do while (x > 0);
  b = substr(a,1,x-1);
  output;
  a = substr(a,x+4);
  x = index(a,'ACMT');
end;
keep b;
if done then do;
  b = a;
  output;
end;
run;

data new4a;
set new (firstobs=2);
length c1-c100 $100. ;
array c(100) c1-c100 ;
do i=1 to 100;
  c(i)=scan(b,i,'^T');
end;
keep c1-c23;
run;

The first step reads the data in blocks and builds lines.

The second step discards the first line and dissects the line into columns.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Abhi1212 · Posted 03-15-2016 06:45 AM

Hi, Could you please explain me the code. Because I am very new to sas. It would be great if I get line by line explanation if possible for better understanding

Kurt_Bremser · Posted 03-15-2016 07:08 AM

data new;
infile "$HOME/FI2016021301394.dat" dsd truncover recfm=f lrecl=100 end=done;
* read the file as a binary stream (recfm=f) in 100 byte chunks, and set a marker variable (done) when the end is reached;
* define buffers for real and virtual lines and for the 100 byte chunks;
* $HOME/ is there because I'm running SAS on UNIX;
length a $2000 a1 $100 b $2000;
retain a; * make sure that a is never set to missing when the data step iterates;
x = index(a,'ACMT');
* look if the line separator is present in our global buffer;
* if not, read chunks and append until one appears;
do while (x = 0);
  input a1 $3000.;
  a = trim(a) !! a1;
  x = index(a,'ACMT');
end;
* now, dissect the global buffer until no line separator is found,
output the now correct lines;
do while (x > 0);
  b = substr(a,1,x-1);
  output;
  a = substr(a,x+4);
  x = index(a,'ACMT');
end;
keep b; * only keep the lines in the output, buffers are discarded;
if done then do; * when we have reached the end of the input;
  b = a;
  output;
  * output the final line (no 'ACMT' present);
  * if the input file ends with a 'ACMT', this would create an empty line;
end;
run;

The second data step is basically a copy of yours.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Abhi1212 · Posted 03-15-2016 01:28 PM

Thank you so much superb explanation. However, the data is not fitting into my table structure. Could you please refer to the attachment . Also, if the file exceeds maximum sas variable(32K) we need to find a solution for it. Please help

Kurt_Bremser · Posted 03-15-2016 04:54 PM

I tested my code with your attachment and got the same result your code produced.

The maximum size of 32k for character variables has NO impact on my code, as the size of the biffers is WAY below that.

You could only run into problems if the actual intended line size of the COPYBOOK data exceeds the 32k. With your test data this is not the case.

Run my code on your testdata and show testdata, actual and expected results if iot does not work.

What do you mean by "However, the data is not fitting into my table structure."?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Abhi1212 · Posted 03-16-2016 02:03 AM

Kurt_Bremser · Posted 03-16-2016 03:30 AM

Well, regarding maximum line size, I'd make a fair bet that none of the contributors here has ever encountered such a case in the wild. No database architect that I know would let something like this happen.

If you encounter more than 32k characters for a "line", then you need to extend the algorithm so that it works with a set of several 32k buffers that represent sections of the "line", and keep an additional index for the active section.

But I'd say you cross that bridge once you arrive there, and don't worry about it in the meantime.

Regarding your file structure:

So you know which columns are located where in the line.

In the second datastep, discard splitting the values in a DO loop.

Instead define your columns first with a length statetement:

length
  bank_code $14
  branch_code $9
  finance_id_1 $12
  finance_id_2 $12
  currency $3
  product_type $2
  status $3
  opening_date 6 /* 6 byte are enough for numeric date storage */
  ......
  outstanding_amount 8 /* maximum precision for numbers */
  ......
;

Then assign according to position:

bank_code = scan(b,1,'^T');
branch_code = scan(b,2,'^T');

and so on. Add additional input() function calls for dates and numbers.

Now, you might pack all that into a macro that accepts variable name, position and format information as parameters, and then call that macro repeatedly while iterating through a list (ie a data step with call execute) to automate the process, but that should be reserved for the time after you got it right manually.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Abhi1212 · Posted 03-16-2016 05:03 AM

Copuld you please edit the code and send me. Coz i am not quite clear on macro part. Also let me know what is the difference between your code and mine?

Kurt_Bremser · Posted 03-16-2016 06:10 AM

Example code:

data new;
infile "$HOME/FI2016021301394.dat" dsd truncover recfm=f lrecl=100 end=done;
length a $2000 a1 $100 b $2000;
retain a;
x = index(a,'ACMT');
do while (x = 0);
  input a1 $3000.;
  a = trim(a) !! a1;
  x = index(a,'ACMT');
end;
do while (x > 0);
  b = substr(a,1,x-1);
  output;
  a = substr(a,x+4);
  x = index(a,'ACMT');
end;
keep b;
if done then do;
  b = a;
  output;
end;
run;

data new4a;
set new (firstobs=2);
length
  bank_code $14
  branch_code $9
  finance_id_1 $12
  finance_id_2 $12
  currency $3
  product_type $2
  status $3
  opening_date 8 /* 8 since it is actually datetime */
  outstanding_amount 8 /* maximum precision for numbers */
;
format opening_date datetime19.;
bank_code = scan(b,1,'^T');
branch_code = scan(b,2,'^T');
finance_id_1 = scan(b,3,'^T');
finance_id_2 = scan(b,4,'^T');
currency = scan(b,5,'^T');
product_type = scan(b,6,'^T');
status = scan(b,7,'^T');
opening_date = input(scan(b,8,'^T'),ANYDTDTM19.);
outstanding_amount = input(scan(b,11,'^T'),20.2);
keep
  bank_code
  branch_code
  finance_id_1
  finance_id_2
  currency
  product_type
  status
  opening_date
  outstanding_amount
;
run;

proc print;
run;

From that, you can easily insert code for the rest of the columns.

Once that is working (as it is working here), you can think about automating it with use of the macro language.

But get it working for all columns with simple data step language first.

Output of proc print:

                        branch_                                             product_                                  outstanding_
  Obs    bank_code       code      finance_id_1   finance_id_2   currency     type     status          opening_date      amount

   1    00000000000A   AE0010010   LD1221300033   LD1221300033     AED         LO       CUR      31JUL2012:15:10:00      24128.84 
   2    00000000000A   AE0010005   LD1514700016   LD1514700016     AED         LO       CUR      27MAY2015:11:38:00     127368.05 
   3    00000000000A   AE0010005   LD1516000023   LD1516000023     AED         LO       CUR      09JUN2015:11:15:00     338268.79 
   4    00000000000A   AE0010256   LD1501400080   LD1501400080     AED         LO       CUR      14JAN2015:14:46:00    1980657.11 
   5    00000000000A   AE0010256   LD1509000159   LD1509000159     AED         LO       CUR      31MAR2015:19:56:00     295845.48 
   6    00000000000A   AE0010256   LD1503100199   LD1503100199     AED         LO       CUR      31JAN2015:20:27:00    1452671.04 
   7    00000000000A   AE0010009   LD1602700046   LD1602700046     AED         LO       CUR      27JAN2016:14:24:00     955111.98

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Abhi1212 · Posted 03-16-2016 08:39 AM

Initally you gave length to a1 as $100. Again in input y u gave a1 as $3000? and could you please xplain me what the below code does?
x = index(a,'ACMT');
do while (x = 0);
input a1 $100.;
a = trim(a) !! a1;
x = index(a,'ACMT');
end;
do while (x > 0);
b = substr(a,1,x-1);
output;
a = substr(a,x+4);
x = index(a,'ACMT');
end;

Kurt_Bremser · Posted 03-16-2016 09:05 AM

Congrats, you discovered a typo in my program that didn't cause a problem, but is incorrect nonetheless.

(truncover and lrecl=100 caused only 100 bytes to be read, although I erroneously used $3000.; since a1 only has a length of 100 in the PDV, it didn't matter)

Of course, a1 should be read with a $w. format that corresponds to the lrecl of the infile statement. This is probably the first thing that should be solved with a macro variable.

So it should read

  input a1 $100.;

So. lets skip through the code once again

x = index(a,'ACMT'); * check if the line separator is present in long buffer a;
do while (x = 0); * if not present;
  input a1 $100.; * read one short buffer a1;
  a = trim(a) !! a1; * append to long buffer a;
  x = index(a,'ACMT'); * check again if line separator is present;
end; * repeat;
 * keep in mind that x now already is the position of the line separator;
do while (x > 0); * as long as a line separator is present;
  b = substr(a,1,x-1); * extract line;
  output; * write to dataset;
  a = substr(a,x+4); * remove line from long buffer;
  x = index(a,'ACMT'); * check for additional line separator;
end; * repeat;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code

Re: Reading COPYBOOK external file through BASE SAS Code