Re: How do i create a Dynamic and Conditional Input statement

teelov · Posted 05-29-2019 12:24 PM

Hello SAS Community.

I've popped outside my 'ETL DI Studio bubble' to have hit a rather challenging but tricky SAS issue. i'm on SAS 9.4M2

We have a number of file from source which have a specific file specification. The files are named D0149,D0159,D0086 and so on. Each file has its own file spec, BUT, within the file itself, there are a header, a footer, and sandwiched in between is the data we need to read in. however, the first 3 characters of each line are numbered, and that number determines the file spec... per record.

for example file D0086 will look like this:

ZHV|99899|D0086001|D|NEEB|X|NEEB|29990998199924||||TR01|
196|9899999799996|V|
197|L99XXX9999|D|
198|01|29990528000000|65706.0|X|Y|T|
ZPT|99899|3||1|29990998199924|

Group 196 and 197 has 2 fields and group 198 has 6..

The next day we might get similar data file D0086_2, but no 198 group, instead a 199 group which has 2 fields.

another example D0150

ZHV|99999|D0150001|M|LBSL|X|NEEB|20179568182352||||TR01|
288|127227343996|20179568|D|x|
289|0158|20179568|Z||
290|XXXTEST011|||40|F|L&G CL27|LBSL|||||||||||K|20179568|20179568|20200101|||H|20179568|
293|L|C|AI|9.00||9||DE|
293|N|C|AI|9.00||7|||
08A|KXXA 12387|20179568|LBSL|
ZPT|99999|6||1|20179568182352|

my thoughts are, the first 3 characters is what really tells us what the file spec is going to based on the 3 numbers, but only once the file (D0086, D0086_2 or D0150) is read in, so we have to know what files exist before reading them in and maybe have a control file of every group number and the possible fields with the field lengths?

i have tried a basic read in, but this still leaves the pipes in the data, and ideally would need it all on one line. this is not dynamic yet but a start.

would you agree with having the file spec as a control table like

FILE,GROUP,FIELD_NAME,LENGTH

D0086,196,var_1,13,
D0086,196,var_2,1,
D0086,197,var_3,10,
D0086,197,var_4,1,
D0086,198,var_5,2,
D0086,198,var_6,14,
D0086,198,var_7,10,
D0086,198,var_8,1,
D0086,198,var_9,1,
D0086,198,var_10,1

how we would dynamically build the code is another thing...

having had an attempt at the very basic read in but not with the desired results. how do we omit the header and footer?

has anyone else ever had data within a file which formats are spread over multiple rows?

data test;
infile 'some/file/location/D0086.dat'  delimiter = '|' dsd missover firstobs = 2;
input group $3. @;
if group ='196' then input @1 group var_1 $13. var_2 $1.;
if group ='197' then input @1 group var_3 $10. var_4 $1.;
if group ='198' then input @1 group var_5 $2. var_6 $14.  var_7 $10. ;
run;

any help is appreciated.

Matt

ErikLund_Jensen · Posted 05-29-2019 12:53 PM

Hi @teelov

You say that the files are named D0086, D0149, D0010 and so on, but later you refer to files as A_1, A_2 and B_1, and you use these names in your control table too.

If the physical names are A_1 etc, then what is D0086? And is D0086 actually the first 5 chars of the third field in the header. I ask because I wonder if the header information could be used to determine the actual file specification.

Can a given record type (196, 197, 198 etc) exist in more than one file specification? and do you have information available to specify all possible record types within a given specification and the type/max length for all fields in a record of a given type. - I see that you have a type 290 with a lot of empty fields, and it is difficult to write a program to read these fields without knowing more.

teelov · Posted 05-29-2019 01:09 PM

amended, hopefully it reads better, there were meant to be the same thing. the actual files are D####

yes, the groups can exist in many different D#### files, but the groups will always have static fields for that group

teelov · Posted 05-29-2019 01:13 PM

in the fictional control table i mention above, this is something that is being made as we speak, for documentation purposes. group 290 can have around 29 fields populated.

the data specs are open to the public and are found here, an example of a D0150

https://dtc.mrasco.com/DataFlow.aspx?FlowCounter=0150&FlowVers=1&searchMockFlows=False

ballardw · Posted 05-31-2019 11:43 AM

@teelov wrote:

in the fictional control table i mention above, this is something that is being made as we speak, for documentation purposes. group 290 can have around 29 fields populated.

the data specs are open to the public and are found here, an example of a D0150

https://dtc.mrasco.com/DataFlow.aspx?FlowCounter=0150&FlowVers=1&searchMockFlows=False

First the link does not go to a specific "flowcounter" for me. I land on a generic start page. So I am not quite sure I end up on the same page. Without an actual text file to read I do not have the time to try to parse an entire industries jargon and guess what the file may actually look like to provide any modifications to code.

It does appear that the "flows" have some information that should likely be used to create variable labels and the "Group Description" could be used for a custom format for the group to display such text instead of a pretty obscure 290.

I have to guess that things like the J0476 may be your variable names.

The "Flow Structure" part is extremely obscure for an outsider as you have a block of columns L1 -L8 with no apparent explanation; a "condition" but no way to tell how that "condition" would appear in the actual data

noling · Posted 05-29-2019 12:57 PM

One approach would be to just read everything into SAS (somewhat blindly), then make a second pass and conditionally rename your variables and output conditionally, or whatever you need to do. If you need to drop the first and last records (header and footer), you could do that in your newly created input dataset.

You could also make assumptions like "There is a maximum of 50 fields per record", and "The longest character string is $100", and read everything in as a character field.

If you have a large sample of data, you could read all your data in, then go back and find the max char length per field, how many fields you need to read in, numeric vs character, etc... Basically do it in small steps first, then combine it back into a single datastep if desired.

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Patrick · Posted 05-30-2019 04:23 AM

@teelov

This provider doesn't make it easy for you!

I'd create a data dictionary with the following elements (close to what Proc Contents would return)

source, group, var_num, var_name, var_type, length, informat, format, label

Required: source, group, var_num, var_name, var_type, length, informat

Above assumes that within a source a variable can only be within a single group. If this is not the case then an additional table would be required.

I guess you get the files delivered into some landing folder and you then create a folder listing which provides you the list of files to be processed.

If so then you've got now already two things:

1. A table with the full data dictionary for all possible source files

2. The list of source files for the day

If you combine above two tables then you have everything that's required to fully and dynamically generate data steps for reading all the source files into SAS tables.

What I would do: Always define all possible columns per source file whether there is now data for it on the day or not. This will give you stable table structures - which is what you want for further downstream processing.

As for skipping the header and especially the footer: You're already reading the first column independently. If you only OUTPUT if you've actually got digits only in the first column then you'll be skipping header and footer and any potentially empty record.

By combining the Data Dictionary (DD) and folder listing table you could dynamically generate code as below:

data test(drop=_group);
  infile 'some/file/location/D0086.dat'  delimiter = '|' dsd missover firstobs = 2;
  attrib
    var1 length=$13 informat=$13.
    var2 length=$1  informat=$1.
    var3 length=$10 informat=$10.
    var4 length=$1  informat=$1.
    ;
  retain _all_;

  input _group $3. @;
    if notdigit(_group)=0 then 
      do;
        if _group ='196' then input var_1 var_2 ;
        else if _group ='197' then input var_3 var_4 ;
        else
          do;
            /**....here some logic for exception handling as a group has been missed...**/
            return;
          end;
        output;
      end;
run;

If you want to take above approach but have difficulties with implementing it then I'm happy to help if you provide sample data in the form of tested SAS data steps creating such data.

What would be required:

1. A table for the data dictionary

2. A table for the directory listing

3. A text file attached with a name and data that matches with the data dictionary and the directory listing

Change:

Added a retain statement after the attrib statement to deal with the hierarchical structure as mentioned by @ErikLund_Jensen

ErikLund_Jensen · Posted 05-30-2019 05:56 AM

Hi @teelov

After a brief look at the documentation, it seems even more complicated. I have a suspicion that a given flow (or source, eg. D0086) contains a hierarchial structure of groups with the possibility of multiple occurrences like this:

Have
XYZ D1234
001 Meter A
002 Reading 1
002 Reading 2
002 Reading 3
001 Meter B
002 Reading 1
002 Reading 2

Wanted
D1234 A 1
D1234 A 2
D1234 A 3
D1234 B 1
D1234 B 2

If this is the case, then a file record is not a stand-alone unit, as ID values from one group must be retained to provide reference-ID's for the following groups.

A way of handling this is with this skeleton code, where the file is transformed to a relational structure:

data 
	meter (keep = Meter_iD ...)
	reading (keep = Meter_ID Rreading ...);
	length Meter_ID $10 ... ;
	retain Meter_ID;
	infile ..;
	input;
	if scan(_infile_,1,'|') = '001' then do;
		Meter_ID = scan(_infile_,2,'|');
		...
		output meter;
	end;
	if scan(_infile_,1,'|') = '002' then do;
		Reading = input(scan(_infile_,2,'|');
		...
		output reading;
	end;
run;

teelov · Posted 05-30-2019 10:25 AM

i very much like your approach here. here is a example of a 149 structure

teelov · Posted 05-30-2019 10:25 AM

Data_Item	Data_Flow	Data_Item_Group	ItemName	ItemLogicalType	ItemLogicalLen
3	149	280	MPAN Core	CHAR	13
1254	149	280	Effective from Settlement Date {MSMTD}	CHAR	8
76	149	281	Standard Settlement Configuration Id	CHAR	4
300	149	281	Effective from Settlement Date {SCON}	DATE	8
78	149	778	Time Pattern Regime	TIME	14
4	149	1025	Meter Id (Serial Number)	CHAR	10
10	149	284	Meter Register Id	CHAR	2
679	149	284	Register Mapping Coefficient	NUM	1
1267	149	1023	Metering System Non Settlement Functionality Code	CHAR	10
1268	149	1023	Effective From Settlement Date {MSNSFC}	DATE	8
78	149	1024	Time Pattern Regime	TIME	14
4	149	283	Meter Id (Serial Number)	CHAR	10
10	149	1026	Meter Register Id	CHAR	2
679	149	1026	Register Mapping Coefficient	NUM	1

Patrick · Posted 05-30-2019 06:28 PM

@teelov

Can you please post this dictionary data as a SAS data step which creates a SAS table.

I also need an informat defined. TIME alone doesn't provide sufficient information which informat to use for reading a specific date/datetime/time string.

teelov · Posted 05-30-2019 10:16 AM

i like your thinking - i did have an idea but i don't quite know how to execute it yet.

i have finally go over half way with all the file specs, attached is a complete one, and a test file.

i was thinking something like a macro variable containing the var_1 var_2 var_3 and so on..depending on the group

thank you for your help, very much appreciated.

Patrick · Posted 05-30-2019 11:04 PM

Building on what @ballardw posted here how such a code generator could look like for you.

options ps=max ls=max;

data DD;
   infile datalines dlm=',' truncover;
   input file $ group $ varnum field $ INFORMAT $;
datalines;
D0086,196,1,var_1,$13.
D0086,196,2,var_2,$1.
D0086,197,3,var_3,10.
D0086,197,4,var_4,$1.
D0086,198,5,var_5,date9.
D0086,198,6,var_6,$14.
D0086,198,7,var_7,$10.
D0086,198,8,var_8,$1.
D0086,198,9,var_9,1.,
D0086,198,10,var_10,1.
D0087,196,1,var_2,$.1
D0087,197,2,var_3,10.
D0087,197,3,var_4,$1.
D0087,198,4,var_7,$10.
D0087,198,5,var_8,$1.
D0087,198,6,var_9,1.,
D0087,198,7,var_10,1.
;

data dirlist;
  infile datalines dlm=',' truncover;
  input path :$300. ext_file :$30.;
  datalines;
/data/landing,D0086_1.dat
/data/landing,D0086_2.dat
/data/landing,D0087.dat
;

proc sql;
  create table control as
    select 
      l.*, r.*
    from DD l inner join dirlist r
      on l.file=upcase(scan(ext_file,1,'._'))
    order by path, ext_file, group, varnum
    ;
quit;

%let target_lib=staging;
filename codegen temp;
data _null_;
  file print;
/*  file codegen;*/
  set control;
  by path ext_file group varnum;

  if first.ext_file then 
  do;
    put 
      'data work.' file ';' 
      ;
    do until(last._ext_file);
      set control(rename=(path=_path ext_file=_ext_file group=_group varnum=_varnum));
      by _path _ext_file _group _varnum;
      put
        '  attrib ' field 'informat=' informat ';'
        ; 
    end;

    put
      "  retain _all_;" /
      "  infile '" path +(-1)"/" ext_file +(-1)"' dlm='|' truncover;" /
      '  input group_no :$3. @;' /
      '  if notdigit(group)=0 then do;' /
      '    select (group_no);'
      ;
  end;

  if first.group then 
  do;
    put
      "      when('" group +(-1)"') input " @;
      ;
  end;

  put field ':' informat @;
  if last.group then put ';';

  if last.ext_file then do;
    put 
      '      otherwise put "WARNING: Unexpected group= " group;' /
      '    end;' /
      '    output;' /
      '  end;' /
      'run;' 
      ;
    put
      "proc append base=&target_lib.." file 'data=work.' file ';' /
      'run;' /
      'proc datasets lib=work nolist nowarn;'/
      '  delete ' file ';' /
      'run;quit;' /
      ;
  end;
run;

data _null_;
  file codegen mod;
  stop;
run;
%include codegen / source2;
filename codegen clear;

Above will generate code like:

data work.D0086 ;
  attrib var_1 informat=$13. ;
  attrib var_2 informat=$1. ;
  attrib var_3 informat=10. ;
  attrib var_4 informat=$1. ;
  attrib var_5 informat=date9. ;
  attrib var_6 informat=$14. ;
  attrib var_7 informat=$10. ;
  attrib var_8 informat=$1. ;
  attrib var_9 informat=1. ;
  attrib var_10 informat=1. ;
  retain _all_;
  infile '/data/landing/D0086_1.dat' dlm='|' truncover;
  input group_no :$3. @;
  if notdigit(group)=0 then do;
    select (group_no);
      when('196') input var_10 :1. var_2 :$1. ;
      when('197') input var_3 :10. var_4 :$1. ;
      when('198') input var_5 :date9. var_6 :$14. var_7 :$10. var_8 :$1. var_9 :1. var_10 :1. ;
      otherwise put "WARNING: Unexpected group= " group;
    end;
    output;
  end;
run;
proc append base=staging.D0086 data=work.D0086 ;
run;
proc datasets lib=work nolist nowarn;
  delete D0086 ;
run;quit;

If you change the output destination to file codegen; then the %include statement at the end will execute the generated code.

teelov · Posted 05-31-2019 06:39 AM

thank you so much. I'm going to take today and the weekend to look into this. i've never worked with data in this form before and its indutry data so i'm discovering more and more about it the more i start to read it in, the more i notice different scenarios i've not catered for

ballardw · Posted 05-30-2019 07:47 PM

See if this gives some starting point for this project.

data example;
   infile datalines dlm=',' truncover;
   input file $ group $ field $ INFORMAT $;
datalines;
D0086,ZPT,,
D0086,ZHV,,
D0086,196,var_1,$13.
D0086,196,var_2,$.1
D0086,197,var_3,10.
D0086,197,var_4,$1.
D0086,198,var_5,date9.
D0086,198,var_6,$14.
D0086,198,var_7,$10.
D0086,198,var_8,$1.
D0086,198,var_9,1.,
D0086,198,var_10,1.
D0087,ZZZ,,
D0087,ZBB,,
D0087,196,var_2,$.1
D0087,197,var_3,10.
D0087,197,var_4,$1.
D0087,198,var_7,$10.
D0087,198,var_8,$1.
D0087,198,var_9,1.,
D0087,198,var_10,1.
;

proc sort data=example;
   by file group;
run;

data _null_;
   set example;
   by file group;
   file print;
   targetlib='mylib';
   length str $ 100;
   str = cat('data ',catx('.',targetlib,file),';');
   if first.file then do;
      put str;
      /* an INFILE statement would go here with
         path and options such as delimiter
      */
      put "infile 'somepath/file.ext' dlm='|' <other options>;";
      put "input group :$3. @;";
      put "select (group);";
   end;
   if first.group then do;
      str = cat('when (',quote(strip(group)),') input ');
      put str;
   end;
   if not missing(field) then str = catx(' ',field,':',informat);
   else str ='';
   put str;
   if last.group then put ';';
   if last.file then do;
      put 'otherwise put "WARNING: Unexpected group= " group;';
      put 'end;';
      put 'run;';
   end;

run;

Your control could have more information. Note that I used INFORMAT with different SAS supplied informats. The output instead going to FILE Print, used for your review, should be written to a SAS Program file, FILEVAR option would let you write to a separate program file per File value. The example should give you enough of a hint to write a proper INFILE statement. If you need a different DATA name then provide that along with the FILE in the control data. Similarly the output library could be in the control set.

You did not clearly, at least enough for me to tell what your internal "header" indicator are or if anything needs to be read from them. If they were the ZHV and ZPT and you don't want to read them then add them to the control group as above. If you read values then include as the other groups.

I actually learned SAS/AF and SAS/FSP for exactly this purpose with the added kicker of adding in data maintenance for the control data set as "groups" were added/removed and the content of some of the "group" values changed over time as instruments were added/dropped from data logger channels and had to adjust by date and time for when the changes occurred. Actually I incorporated code that wrote a report for those changes so I could set the start date for the new stream.

I dealt with over 50 sites and as many as 25 "groups" for some of those sites.

Register Today!

SAS Training: Just a Click Away