03-15-2016 05:52 AM
Well, I have never come across this format before, but doing some searching I found this:
Is this what you mean by copybook format, it looks a bit like it. If so I would start with this, where did you get the file - the reason being is it looks like all carriage returns have been removed. This makes reading the file that much more difficult, and indicates that it came from a Unix system - it will make your life much easier if you go back to that original file and transfer it using binary mode to preserve the carriage returns.
Secondly you appear to be missing the specification for the file, example E1 in that link, which details the file structure. Without that your guessing and its going to be quite difficult to process the data, if we had the spec for it, then the code is easier to write, simply take each character at a time for each block, but without it, so for example this line:
FILHDR14-02-2016 03:32:281394 ALH DC 003000252 TRNHDR000000RECHDR211 1ACMT
I assume that the first field is FLHDR, is it always that length, what does it mean, I assume that up to 2016 is a date, and then a time, then ALH. Then there is a big gap to DC, is this a new record, how does it fit in etc? So you see without knowing what the output should look like I am just guessing. If you want to read the file, then you can loop over character by character,
using input a $char.;, and there may be better ways depending on what its supposed to look like.
03-15-2016 06:40 AM
03-15-2016 06:13 AM
data new; infile "$HOME/FI2016021301394.dat" dsd truncover recfm=f lrecl=100 end=done; length a $2000 a1 $100 b $2000; retain a; x = index(a,'ACMT'); do while (x = 0); input a1 $3000.; a = trim(a) !! a1; x = index(a,'ACMT'); end; do while (x > 0); b = substr(a,1,x-1); output; a = substr(a,x+4); x = index(a,'ACMT'); end; keep b; if done then do; b = a; output; end; run; data new4a; set new (firstobs=2); length c1-c100 $100. ; array c(100) c1-c100 ; do i=1 to 100; c(i)=scan(b,i,'^T'); end; keep c1-c23; run;
The first step reads the data in blocks and builds lines.
The second step discards the first line and dissects the line into columns.
03-15-2016 06:45 AM
03-15-2016 07:08 AM
data new; infile "$HOME/FI2016021301394.dat" dsd truncover recfm=f lrecl=100 end=done; * read the file as a binary stream (recfm=f) in 100 byte chunks, and set a marker variable (done) when the end is reached; * define buffers for real and virtual lines and for the 100 byte chunks;
* $HOME/ is there because I'm running SAS on UNIX; length a $2000 a1 $100 b $2000; retain a; * make sure that a is never set to missing when the data step iterates; x = index(a,'ACMT'); * look if the line separator is present in our global buffer; * if not, read chunks and append until one appears; do while (x = 0); input a1 $3000.; a = trim(a) !! a1; x = index(a,'ACMT'); end; * now, dissect the global buffer until no line separator is found, output the now correct lines; do while (x > 0); b = substr(a,1,x-1); output; a = substr(a,x+4); x = index(a,'ACMT'); end; keep b; * only keep the lines in the output, buffers are discarded; if done then do; * when we have reached the end of the input; b = a; output; * output the final line (no 'ACMT' present);
* if the input file ends with a 'ACMT', this would create an empty line; end; run;
The second data step is basically a copy of yours.
03-15-2016 01:28 PM
Thank you so much superb explanation. However, the data is not fitting into my table structure. Could you please refer to the attachment . Also, if the file exceeds maximum sas variable(32K) we need to find a solution for it. Please help
03-15-2016 04:54 PM
I tested my code with your attachment and got the same result your code produced.
The maximum size of 32k for character variables has NO impact on my code, as the size of the biffers is WAY below that.
You could only run into problems if the actual intended line size of the COPYBOOK data exceeds the 32k. With your test data this is not the case.
Run my code on your testdata and show testdata, actual and expected results if iot does not work.
What do you mean by "However, the data is not fitting into my table structure."?
03-16-2016 03:30 AM
Well, regarding maximum line size, I'd make a fair bet that none of the contributors here has ever encountered such a case in the wild. No database architect that I know would let something like this happen.
If you encounter more than 32k characters for a "line", then you need to extend the algorithm so that it works with a set of several 32k buffers that represent sections of the "line", and keep an additional index for the active section.
But I'd say you cross that bridge once you arrive there, and don't worry about it in the meantime.
Regarding your file structure:
So you know which columns are located where in the line.
In the second datastep, discard splitting the values in a DO loop.
Instead define your columns first with a length statetement:
length bank_code $14 branch_code $9 finance_id_1 $12 finance_id_2 $12 currency $3 product_type $2 status $3 opening_date 6 /* 6 byte are enough for numeric date storage */ ...... outstanding_amount 8 /* maximum precision for numbers */ ...... ;
Then assign according to position:
bank_code = scan(b,1,'^T'); branch_code = scan(b,2,'^T');
and so on. Add additional input() function calls for dates and numbers.
Now, you might pack all that into a macro that accepts variable name, position and format information as parameters, and then call that macro repeatedly while iterating through a list (ie a data step with call execute) to automate the process, but that should be reserved for the time after you got it right manually.
03-16-2016 05:03 AM
03-16-2016 06:10 AM
data new; infile "$HOME/FI2016021301394.dat" dsd truncover recfm=f lrecl=100 end=done; length a $2000 a1 $100 b $2000; retain a; x = index(a,'ACMT'); do while (x = 0); input a1 $3000.; a = trim(a) !! a1; x = index(a,'ACMT'); end; do while (x > 0); b = substr(a,1,x-1); output; a = substr(a,x+4); x = index(a,'ACMT'); end; keep b; if done then do; b = a; output; end; run; data new4a; set new (firstobs=2); length bank_code $14 branch_code $9 finance_id_1 $12 finance_id_2 $12 currency $3 product_type $2 status $3 opening_date 8 /* 8 since it is actually datetime */ outstanding_amount 8 /* maximum precision for numbers */ ;
format opening_date datetime19.;
bank_code = scan(b,1,'^T'); branch_code = scan(b,2,'^T'); finance_id_1 = scan(b,3,'^T'); finance_id_2 = scan(b,4,'^T'); currency = scan(b,5,'^T'); product_type = scan(b,6,'^T'); status = scan(b,7,'^T'); opening_date = input(scan(b,8,'^T'),ANYDTDTM19.); outstanding_amount = input(scan(b,11,'^T'),20.2); keep bank_code branch_code finance_id_1 finance_id_2 currency product_type status opening_date outstanding_amount ; run;
From that, you can easily insert code for the rest of the columns.
Once that is working (as it is working here), you can think about automating it with use of the macro language.
But get it working for all columns with simple data step language first.
Output of proc print:
branch_ product_ outstanding_ Obs bank_code code finance_id_1 finance_id_2 currency type status opening_date amount 1 00000000000A AE0010010 LD1221300033 LD1221300033 AED LO CUR 31JUL2012:15:10:00 24128.84 2 00000000000A AE0010005 LD1514700016 LD1514700016 AED LO CUR 27MAY2015:11:38:00 127368.05 3 00000000000A AE0010005 LD1516000023 LD1516000023 AED LO CUR 09JUN2015:11:15:00 338268.79 4 00000000000A AE0010256 LD1501400080 LD1501400080 AED LO CUR 14JAN2015:14:46:00 1980657.11 5 00000000000A AE0010256 LD1509000159 LD1509000159 AED LO CUR 31MAR2015:19:56:00 295845.48 6 00000000000A AE0010256 LD1503100199 LD1503100199 AED LO CUR 31JAN2015:20:27:00 1452671.04 7 00000000000A AE0010009 LD1602700046 LD1602700046 AED LO CUR 27JAN2016:14:24:00 955111.98
03-16-2016 08:39 AM
03-16-2016 09:05 AM - edited 03-16-2016 09:08 AM
Congrats, you discovered a typo in my program that didn't cause a problem, but is incorrect nonetheless.
(truncover and lrecl=100 caused only 100 bytes to be read, although I erroneously used $3000.; since a1 only has a length of 100 in the PDV, it didn't matter)
Of course, a1 should be read with a $w. format that corresponds to the lrecl of the infile statement. This is probably the first thing that should be solved with a macro variable.
So it should read
input a1 $100.;
So. lets skip through the code once again
x = index(a,'ACMT'); * check if the line separator is present in long buffer a; do while (x = 0); * if not present; input a1 $100.; * read one short buffer a1; a = trim(a) !! a1; * append to long buffer a; x = index(a,'ACMT'); * check again if line separator is present; end; * repeat; * keep in mind that x now already is the position of the line separator; do while (x > 0); * as long as a line separator is present; b = substr(a,1,x-1); * extract line; output; * write to dataset; a = substr(a,x+4); * remove line from long buffer; x = index(a,'ACMT'); * check for additional line separator; end; * repeat;