Hello SAS Community.
I've popped outside my 'ETL DI Studio bubble' to have hit a rather challenging but tricky SAS issue. i'm on SAS 9.4M2
We have a number of file from source which have a specific file specification. The files are named D0149,D0159,D0086 and so on. Each file has its own file spec, BUT, within the file itself, there are a header, a footer, and sandwiched in between is the data we need to read in. however, the first 3 characters of each line are numbered, and that number determines the file spec... per record.
for example file D0086 will look like this:
ZHV|99899|D0086001|D|NEEB|X|NEEB|29990998199924||||TR01| 196|9899999799996|V| 197|L99XXX9999|D| 198|01|29990528000000|65706.0|X|Y|T| ZPT|99899|3||1|29990998199924|
Group 196 and 197 has 2 fields and group 198 has 6..
The next day we might get similar data file D0086_2, but no 198 group, instead a 199 group which has 2 fields.
another example D0150
ZHV|99999|D0150001|M|LBSL|X|NEEB|20179568182352||||TR01| 288|127227343996|20179568|D|x| 289|0158|20179568|Z|| 290|XXXTEST011|||40|F|L&G CL27|LBSL|||||||||||K|20179568|20179568|20200101|||H|20179568| 293|L|C|AI|9.00||9||DE| 293|N|C|AI|9.00||7||| 08A|KXXA 12387|20179568|LBSL| ZPT|99999|6||1|20179568182352|
my thoughts are, the first 3 characters is what really tells us what the file spec is going to based on the 3 numbers, but only once the file (D0086, D0086_2 or D0150) is read in, so we have to know what files exist before reading them in and maybe have a control file of every group number and the possible fields with the field lengths?
i have tried a basic read in, but this still leaves the pipes in the data, and ideally would need it all on one line. this is not dynamic yet but a start.
would you agree with having the file spec as a control table like
FILE,GROUP,FIELD_NAME,LENGTH D0086,196,var_1,13, D0086,196,var_2,1, D0086,197,var_3,10, D0086,197,var_4,1, D0086,198,var_5,2, D0086,198,var_6,14, D0086,198,var_7,10, D0086,198,var_8,1, D0086,198,var_9,1, D0086,198,var_10,1
how we would dynamically build the code is another thing...
having had an attempt at the very basic read in but not with the desired results. how do we omit the header and footer?
has anyone else ever had data within a file which formats are spread over multiple rows?
data test;
infile 'some/file/location/D0086.dat' delimiter = '|' dsd missover firstobs = 2;
input group $3. @;
if group ='196' then input @1 group var_1 $13. var_2 $1.;
if group ='197' then input @1 group var_3 $10. var_4 $1.;
if group ='198' then input @1 group var_5 $2. var_6 $14. var_7 $10. ;
run;
any help is appreciated.
Matt
Hi @teelov
You say that the files are named D0086, D0149, D0010 and so on, but later you refer to files as A_1, A_2 and B_1, and you use these names in your control table too.
If the physical names are A_1 etc, then what is D0086? And is D0086 actually the first 5 chars of the third field in the header. I ask because I wonder if the header information could be used to determine the actual file specification.
Can a given record type (196, 197, 198 etc) exist in more than one file specification? and do you have information available to specify all possible record types within a given specification and the type/max length for all fields in a record of a given type. - I see that you have a type 290 with a lot of empty fields, and it is difficult to write a program to read these fields without knowing more.
in the fictional control table i mention above, this is something that is being made as we speak, for documentation purposes. group 290 can have around 29 fields populated.
the data specs are open to the public and are found here, an example of a D0150
https://dtc.mrasco.com/DataFlow.aspx?FlowCounter=0150&FlowVers=1&searchMockFlows=False
@teelov wrote:
in the fictional control table i mention above, this is something that is being made as we speak, for documentation purposes. group 290 can have around 29 fields populated.
the data specs are open to the public and are found here, an example of a D0150
https://dtc.mrasco.com/DataFlow.aspx?FlowCounter=0150&FlowVers=1&searchMockFlows=False
First the link does not go to a specific "flowcounter" for me. I land on a generic start page. So I am not quite sure I end up on the same page. Without an actual text file to read I do not have the time to try to parse an entire industries jargon and guess what the file may actually look like to provide any modifications to code.
It does appear that the "flows" have some information that should likely be used to create variable labels and the "Group Description" could be used for a custom format for the group to display such text instead of a pretty obscure 290.
I have to guess that things like the J0476 may be your variable names.
The "Flow Structure" part is extremely obscure for an outsider as you have a block of columns L1 -L8 with no apparent explanation; a "condition" but no way to tell how that "condition" would appear in the actual data
One approach would be to just read everything into SAS (somewhat blindly), then make a second pass and conditionally rename your variables and output conditionally, or whatever you need to do. If you need to drop the first and last records (header and footer), you could do that in your newly created input dataset.
You could also make assumptions like "There is a maximum of 50 fields per record", and "The longest character string is $100", and read everything in as a character field.
If you have a large sample of data, you could read all your data in, then go back and find the max char length per field, how many fields you need to read in, numeric vs character, etc... Basically do it in small steps first, then combine it back into a single datastep if desired.
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
This provider doesn't make it easy for you!
I'd create a data dictionary with the following elements (close to what Proc Contents would return)
source, group, var_num, var_name, var_type, length, informat, format, label
Required: source, group, var_num, var_name, var_type, length, informat
Above assumes that within a source a variable can only be within a single group. If this is not the case then an additional table would be required.
I guess you get the files delivered into some landing folder and you then create a folder listing which provides you the list of files to be processed.
If so then you've got now already two things:
1. A table with the full data dictionary for all possible source files
2. The list of source files for the day
If you combine above two tables then you have everything that's required to fully and dynamically generate data steps for reading all the source files into SAS tables.
What I would do: Always define all possible columns per source file whether there is now data for it on the day or not. This will give you stable table structures - which is what you want for further downstream processing.
As for skipping the header and especially the footer: You're already reading the first column independently. If you only OUTPUT if you've actually got digits only in the first column then you'll be skipping header and footer and any potentially empty record.
By combining the Data Dictionary (DD) and folder listing table you could dynamically generate code as below:
data test(drop=_group);
infile 'some/file/location/D0086.dat' delimiter = '|' dsd missover firstobs = 2;
attrib
var1 length=$13 informat=$13.
var2 length=$1 informat=$1.
var3 length=$10 informat=$10.
var4 length=$1 informat=$1.
;
retain _all_;
input _group $3. @;
if notdigit(_group)=0 then
do;
if _group ='196' then input var_1 var_2 ;
else if _group ='197' then input var_3 var_4 ;
else
do;
/**....here some logic for exception handling as a group has been missed...**/
return;
end;
output;
end;
run;
If you want to take above approach but have difficulties with implementing it then I'm happy to help if you provide sample data in the form of tested SAS data steps creating such data.
What would be required:
1. A table for the data dictionary
2. A table for the directory listing
3. A text file attached with a name and data that matches with the data dictionary and the directory listing
Change:
Added a retain statement after the attrib statement to deal with the hierarchical structure as mentioned by @ErikLund_Jensen
Hi @teelov
After a brief look at the documentation, it seems even more complicated. I have a suspicion that a given flow (or source, eg. D0086) contains a hierarchial structure of groups with the possibility of multiple occurrences like this:
Have
XYZ D1234
001 Meter A
002 Reading 1
002 Reading 2
002 Reading 3
001 Meter B
002 Reading 1
002 Reading 2
Wanted
D1234 A 1
D1234 A 2
D1234 A 3
D1234 B 1
D1234 B 2
If this is the case, then a file record is not a stand-alone unit, as ID values from one group must be retained to provide reference-ID's for the following groups.
A way of handling this is with this skeleton code, where the file is transformed to a relational structure:
data
meter (keep = Meter_iD ...)
reading (keep = Meter_ID Rreading ...);
length Meter_ID $10 ... ;
retain Meter_ID;
infile ..;
input;
if scan(_infile_,1,'|') = '001' then do;
Meter_ID = scan(_infile_,2,'|');
...
output meter;
end;
if scan(_infile_,1,'|') = '002' then do;
Reading = input(scan(_infile_,2,'|');
...
output reading;
end;
run;
Data_Item | Data_Flow | Data_Item_Group | ItemName | ItemLogicalType | ItemLogicalLen |
3 | 149 | 280 | MPAN Core | CHAR | 13 |
1254 | 149 | 280 | Effective from Settlement Date {MSMTD} | CHAR | 8 |
76 | 149 | 281 | Standard Settlement Configuration Id | CHAR | 4 |
300 | 149 | 281 | Effective from Settlement Date {SCON} | DATE | 8 |
78 | 149 | 778 | Time Pattern Regime | TIME | 14 |
4 | 149 | 1025 | Meter Id (Serial Number) | CHAR | 10 |
10 | 149 | 284 | Meter Register Id | CHAR | 2 |
679 | 149 | 284 | Register Mapping Coefficient | NUM | 1 |
1267 | 149 | 1023 | Metering System Non Settlement Functionality Code | CHAR | 10 |
1268 | 149 | 1023 | Effective From Settlement Date {MSNSFC} | DATE | 8 |
78 | 149 | 1024 | Time Pattern Regime | TIME | 14 |
4 | 149 | 283 | Meter Id (Serial Number) | CHAR | 10 |
10 | 149 | 1026 | Meter Register Id | CHAR | 2 |
679 | 149 | 1026 | Register Mapping Coefficient | NUM | 1 |
Can you please post this dictionary data as a SAS data step which creates a SAS table.
I also need an informat defined. TIME alone doesn't provide sufficient information which informat to use for reading a specific date/datetime/time string.
i like your thinking - i did have an idea but i don't quite know how to execute it yet.
i have finally go over half way with all the file specs, attached is a complete one, and a test file.
i was thinking something like a macro variable containing the var_1 var_2 var_3 and so on..depending on the group
thank you for your help, very much appreciated.
Building on what @ballardw posted here how such a code generator could look like for you.
options ps=max ls=max;
data DD;
infile datalines dlm=',' truncover;
input file $ group $ varnum field $ INFORMAT $;
datalines;
D0086,196,1,var_1,$13.
D0086,196,2,var_2,$1.
D0086,197,3,var_3,10.
D0086,197,4,var_4,$1.
D0086,198,5,var_5,date9.
D0086,198,6,var_6,$14.
D0086,198,7,var_7,$10.
D0086,198,8,var_8,$1.
D0086,198,9,var_9,1.,
D0086,198,10,var_10,1.
D0087,196,1,var_2,$.1
D0087,197,2,var_3,10.
D0087,197,3,var_4,$1.
D0087,198,4,var_7,$10.
D0087,198,5,var_8,$1.
D0087,198,6,var_9,1.,
D0087,198,7,var_10,1.
;
data dirlist;
infile datalines dlm=',' truncover;
input path :$300. ext_file :$30.;
datalines;
/data/landing,D0086_1.dat
/data/landing,D0086_2.dat
/data/landing,D0087.dat
;
proc sql;
create table control as
select
l.*, r.*
from DD l inner join dirlist r
on l.file=upcase(scan(ext_file,1,'._'))
order by path, ext_file, group, varnum
;
quit;
%let target_lib=staging;
filename codegen temp;
data _null_;
file print;
/* file codegen;*/
set control;
by path ext_file group varnum;
if first.ext_file then
do;
put
'data work.' file ';'
;
do until(last._ext_file);
set control(rename=(path=_path ext_file=_ext_file group=_group varnum=_varnum));
by _path _ext_file _group _varnum;
put
' attrib ' field 'informat=' informat ';'
;
end;
put
" retain _all_;" /
" infile '" path +(-1)"/" ext_file +(-1)"' dlm='|' truncover;" /
' input group_no :$3. @;' /
' if notdigit(group)=0 then do;' /
' select (group_no);'
;
end;
if first.group then
do;
put
" when('" group +(-1)"') input " @;
;
end;
put field ':' informat @;
if last.group then put ';';
if last.ext_file then do;
put
' otherwise put "WARNING: Unexpected group= " group;' /
' end;' /
' output;' /
' end;' /
'run;'
;
put
"proc append base=&target_lib.." file 'data=work.' file ';' /
'run;' /
'proc datasets lib=work nolist nowarn;'/
' delete ' file ';' /
'run;quit;' /
;
end;
run;
data _null_;
file codegen mod;
stop;
run;
%include codegen / source2;
filename codegen clear;
Above will generate code like:
data work.D0086 ;
attrib var_1 informat=$13. ;
attrib var_2 informat=$1. ;
attrib var_3 informat=10. ;
attrib var_4 informat=$1. ;
attrib var_5 informat=date9. ;
attrib var_6 informat=$14. ;
attrib var_7 informat=$10. ;
attrib var_8 informat=$1. ;
attrib var_9 informat=1. ;
attrib var_10 informat=1. ;
retain _all_;
infile '/data/landing/D0086_1.dat' dlm='|' truncover;
input group_no :$3. @;
if notdigit(group)=0 then do;
select (group_no);
when('196') input var_10 :1. var_2 :$1. ;
when('197') input var_3 :10. var_4 :$1. ;
when('198') input var_5 :date9. var_6 :$14. var_7 :$10. var_8 :$1. var_9 :1. var_10 :1. ;
otherwise put "WARNING: Unexpected group= " group;
end;
output;
end;
run;
proc append base=staging.D0086 data=work.D0086 ;
run;
proc datasets lib=work nolist nowarn;
delete D0086 ;
run;quit;
If you change the output destination to file codegen; then the %include statement at the end will execute the generated code.
See if this gives some starting point for this project.
data example; infile datalines dlm=',' truncover; input file $ group $ field $ INFORMAT $; datalines; D0086,ZPT,, D0086,ZHV,, D0086,196,var_1,$13. D0086,196,var_2,$.1 D0086,197,var_3,10. D0086,197,var_4,$1. D0086,198,var_5,date9. D0086,198,var_6,$14. D0086,198,var_7,$10. D0086,198,var_8,$1. D0086,198,var_9,1., D0086,198,var_10,1. D0087,ZZZ,, D0087,ZBB,, D0087,196,var_2,$.1 D0087,197,var_3,10. D0087,197,var_4,$1. D0087,198,var_7,$10. D0087,198,var_8,$1. D0087,198,var_9,1., D0087,198,var_10,1. ; proc sort data=example; by file group; run; data _null_; set example; by file group; file print; targetlib='mylib'; length str $ 100; str = cat('data ',catx('.',targetlib,file),';'); if first.file then do; put str; /* an INFILE statement would go here with path and options such as delimiter */ put "infile 'somepath/file.ext' dlm='|' <other options>;"; put "input group :$3. @;"; put "select (group);"; end; if first.group then do; str = cat('when (',quote(strip(group)),') input '); put str; end; if not missing(field) then str = catx(' ',field,':',informat); else str =''; put str; if last.group then put ';'; if last.file then do; put 'otherwise put "WARNING: Unexpected group= " group;'; put 'end;'; put 'run;'; end; run;
Your control could have more information. Note that I used INFORMAT with different SAS supplied informats. The output instead going to FILE Print, used for your review, should be written to a SAS Program file, FILEVAR option would let you write to a separate program file per File value. The example should give you enough of a hint to write a proper INFILE statement. If you need a different DATA name then provide that along with the FILE in the control data. Similarly the output library could be in the control set.
You did not clearly, at least enough for me to tell what your internal "header" indicator are or if anything needs to be read from them. If they were the ZHV and ZPT and you don't want to read them then add them to the control group as above. If you read values then include as the other groups.
I actually learned SAS/AF and SAS/FSP for exactly this purpose with the added kicker of adding in data maintenance for the control data set as "groups" were added/removed and the content of some of the "group" values changed over time as instruments were added/dropped from data logger channels and had to adjust by date and time for when the changes occurred. Actually I incorporated code that wrote a report for those changes so I could set the start date for the new stream.
I dealt with over 50 sites and as many as 25 "groups" for some of those sites.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.