BookmarkSubscribeRSS Feed
teelov
Quartz | Level 8

Hello SAS Community.

I've popped outside my 'ETL DI Studio bubble' to have hit a rather challenging but tricky SAS issue. i'm on SAS 9.4M2

 

We have a number of file from source which have a specific file specification. The files are named D0149,D0159,D0086 and so on. Each file has its own file spec, BUT, within the file itself, there are a header, a footer, and sandwiched in between is the data we need to read in. however, the first 3 characters of each line are numbered, and that number determines the file spec... per record.

 

for example file D0086 will look like this:

 

ZHV|99899|D0086001|D|NEEB|X|NEEB|29990998199924||||TR01|
196|9899999799996|V|
197|L99XXX9999|D|
198|01|29990528000000|65706.0|X|Y|T|
ZPT|99899|3||1|29990998199924|

 

Group 196 and 197 has 2 fields and group 198 has 6..


The next day we might get similar data file D0086_2, but no 198 group, instead a 199 group which has 2 fields.

 

another example D0150

 

ZHV|99999|D0150001|M|LBSL|X|NEEB|20179568182352||||TR01|
288|127227343996|20179568|D|x|
289|0158|20179568|Z||
290|XXXTEST011|||40|F|L&G CL27|LBSL|||||||||||K|20179568|20179568|20200101|||H|20179568|
293|L|C|AI|9.00||9||DE|
293|N|C|AI|9.00||7|||
08A|KXXA 12387|20179568|LBSL|
ZPT|99999|6||1|20179568182352|

 

my thoughts are, the first 3 characters is what really tells us what the file spec is going to based on the 3 numbers, but only once the file (D0086, D0086_2 or D0150) is read in, so we have to know what files exist before reading them in and maybe have a control file of every group number and the possible fields with the field lengths?

i have tried a basic read in, but this still leaves the pipes in the data, and ideally would need it all on one line. this is not dynamic yet but a start.

 

would you agree with having the file spec as a control table like

 

FILE,GROUP,FIELD_NAME,LENGTH

D0086,196,var_1,13,
D0086,196,var_2,1,
D0086,197,var_3,10,
D0086,197,var_4,1,
D0086,198,var_5,2,
D0086,198,var_6,14,
D0086,198,var_7,10,
D0086,198,var_8,1,
D0086,198,var_9,1,
D0086,198,var_10,1

 

how we would dynamically build the code is another thing...

 

having had an attempt at the very basic read in but not with the desired results. how do we omit the header and footer?

has anyone else ever had data within a file which formats are spread over multiple rows?

 

data test;
infile 'some/file/location/D0086.dat'  delimiter = '|' dsd missover firstobs = 2;
input group $3. @;
if group ='196' then input @1 group var_1 $13. var_2 $1.;
if group ='197' then input @1 group var_3 $10. var_4 $1.;
if group ='198' then input @1 group var_5 $2. var_6 $14.  var_7 $10. ;
run;

any help is appreciated.

 

Matt

20 REPLIES 20
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @teelov 

 

You say that the files are named D0086, D0149, D0010 and so on, but later you refer to files as A_1, A_2 and B_1, and you use these names in your control table too.

 

If the physical names are A_1 etc, then what is D0086? And is D0086 actually the first 5 chars of the third field in the header. I ask because I wonder if the header information could be used to determine the actual file specification.

 

Can a given record type (196, 197, 198 etc) exist in more than one file specification? and do you have information available to specify all possible record types within a given specification and the type/max length for all fields in a record of a given type. - I see that you have a type 290 with a lot of empty fields, and it is difficult to write a program to read these fields without knowing more.

teelov
Quartz | Level 8
amended, hopefully it reads better, there were meant to be the same thing. the actual files are D####

yes, the groups can exist in many different D#### files, but the groups will always have static fields for that group
teelov
Quartz | Level 8

in the fictional control table i mention above, this is something that is being made as we speak, for documentation purposes. group 290 can have around 29 fields populated.

the data specs are open to the public and are found here, an example of a D0150

 

https://dtc.mrasco.com/DataFlow.aspx?FlowCounter=0150&FlowVers=1&searchMockFlows=False

ballardw
Super User

@teelov wrote:

in the fictional control table i mention above, this is something that is being made as we speak, for documentation purposes. group 290 can have around 29 fields populated.

the data specs are open to the public and are found here, an example of a D0150

 

https://dtc.mrasco.com/DataFlow.aspx?FlowCounter=0150&FlowVers=1&searchMockFlows=False


First the link does not go to a specific "flowcounter" for me. I land on a generic start page. So I am not quite sure I end up on the same page. Without an actual text file to read I do not have the time to try to parse an entire industries jargon and guess what the file may actually look like to provide any modifications to code.

 

It does appear that the "flows" have some information that should likely be used to create variable labels and the "Group Description" could be used for a custom format for the group to display such text instead of a pretty obscure 290.

 

I have to guess that things like the J0476 may be your variable names.

 

The "Flow Structure" part is extremely obscure for an outsider as you have a block of columns L1 -L8 with no apparent explanation; a "condition" but no way to tell how that "condition" would appear in the actual data

 

noling
SAS Employee

One approach would be to just read everything into SAS (somewhat blindly), then make a second pass and conditionally rename your variables and output conditionally, or whatever you need to do. If you need to drop the first and last records (header and footer), you could do that in your newly created input dataset.

 

You could also make assumptions like "There is a maximum of 50 fields per record", and "The longest character string is $100", and read everything in as a character field. 

 

If you have a large sample of data, you could read all your data in, then go back and find the max char length per field, how many fields you need to read in, numeric vs character, etc... Basically do it in small steps first, then combine it back into a single datastep if desired.

 


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

Patrick
Opal | Level 21

@teelov 

This provider doesn't make it easy for you!

I'd create a data dictionary with the following elements (close to what Proc Contents would return)

source, group, var_num, var_name, var_type, length, informat, format, label

Required: source, group, var_num, var_name, var_type, length, informat

 

Above assumes that within a source a variable can only be within a single group. If this is not the case then an additional table would be required.

 

I guess you get the files delivered into some landing folder and you then create a folder listing which provides you the list of files to be processed.

 

If so then you've got now already two things:

1. A table with the full data dictionary for all possible source files

2. The list of source files for the day

 

If you combine above two tables then you have everything that's required to fully and dynamically generate data steps for reading all the source files into SAS tables.

 

What I would do: Always define all possible columns per source file whether there is now data for it on the day or not. This will give you stable table structures  - which is what you want for further downstream processing.

 

As for skipping the header and especially the footer: You're already reading the first column independently. If you only OUTPUT if you've actually got digits only in the first column then you'll be skipping header and footer and any potentially empty record.

 

By combining the Data Dictionary (DD) and folder listing table you could dynamically generate code as below:

data test(drop=_group);
  infile 'some/file/location/D0086.dat'  delimiter = '|' dsd missover firstobs = 2;
  attrib
    var1 length=$13 informat=$13.
    var2 length=$1  informat=$1.
    var3 length=$10 informat=$10.
    var4 length=$1  informat=$1.
    ;
  retain _all_;
input _group $3. @; if notdigit(_group)=0 then do; if _group ='196' then input var_1 var_2 ; else if _group ='197' then input var_3 var_4 ; else do; /**....here some logic for exception handling as a group has been missed...**/ return; end; output; end; run;

 

If you want to take above approach but have difficulties with implementing it then I'm happy to help if you provide sample data in the form of tested SAS data steps creating such data.

What would be required:

1. A table for the data dictionary

2. A table for the directory listing

3. A text file attached with a name and data that matches with the data dictionary and the directory listing

 

Change:

Added a retain statement after the attrib statement to deal with the hierarchical structure as mentioned by @ErikLund_Jensen 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @teelov 

 

After a brief look at the documentation, it seems even more complicated. I have a suspicion that a given flow (or source, eg. D0086) contains a hierarchial structure of groups with the possibility of multiple occurrences like this:

 

Have 
XYZ D1234
001 Meter A 
002 Reading 1
002 Reading 2
002 Reading 3
001 Meter B
002 Reading 1
002 Reading 2

 

Wanted 
D1234 A 1
D1234 A 2
D1234 A 3
D1234 B 1
D1234 B 2

 

If this is the case, then a file record is not a stand-alone unit, as ID values from one group must be retained to provide reference-ID's for the following groups. 

 

A way of handling this is with this skeleton code, where the file is transformed to a relational structure:

 

data 
	meter (keep = Meter_iD ...)
	reading (keep = Meter_ID Rreading ...);
	length Meter_ID $10 ... ;
	retain Meter_ID;
	infile ..;
	input;
	if scan(_infile_,1,'|') = '001' then do;
		Meter_ID = scan(_infile_,2,'|');
		...
		output meter;
	end;
	if scan(_infile_,1,'|') = '002' then do;
		Reading = input(scan(_infile_,2,'|');
		...
		output reading;
	end;
run;
teelov
Quartz | Level 8
i very much like your approach here. here is a example of a 149 structure

teelov
Quartz | Level 8
Data_ItemData_FlowData_Item_GroupItemNameItemLogicalTypeItemLogicalLen
3149280MPAN CoreCHAR13
1254149280Effective from Settlement Date {MSMTD}CHAR8
76149281Standard Settlement Configuration IdCHAR4
300149281Effective from Settlement Date {SCON}DATE8
78149778Time Pattern RegimeTIME14
41491025Meter Id (Serial Number)CHAR10
10149284Meter Register IdCHAR2
679149284Register Mapping CoefficientNUM1
12671491023Metering System Non Settlement Functionality CodeCHAR10
12681491023Effective From Settlement Date {MSNSFC}DATE8
781491024Time Pattern RegimeTIME14
4149283Meter Id (Serial Number)CHAR10
101491026Meter Register IdCHAR2
6791491026Register Mapping CoefficientNUM1
Patrick
Opal | Level 21

@teelov 

Can you please post this dictionary data as a SAS data step which creates a SAS table.

I also need an informat defined. TIME alone doesn't provide sufficient information which informat to use for reading a specific date/datetime/time string.

teelov
Quartz | Level 8

i like your thinking - i did have an idea but i don't quite know how to execute it yet.

 

i have finally go over half way with all the file specs, attached is a complete one, and a test file.

i was thinking something like a macro variable containing the var_1 var_2 var_3 and so on..depending on the group

thank you for your help, very much appreciated.

Patrick
Opal | Level 21

Building on what @ballardw posted here how such a code generator could look like for you.

 

options ps=max ls=max;

data DD;
   infile datalines dlm=',' truncover;
   input file $ group $ varnum field $ INFORMAT $;
datalines;
D0086,196,1,var_1,$13.
D0086,196,2,var_2,$1.
D0086,197,3,var_3,10.
D0086,197,4,var_4,$1.
D0086,198,5,var_5,date9.
D0086,198,6,var_6,$14.
D0086,198,7,var_7,$10.
D0086,198,8,var_8,$1.
D0086,198,9,var_9,1.,
D0086,198,10,var_10,1.
D0087,196,1,var_2,$.1
D0087,197,2,var_3,10.
D0087,197,3,var_4,$1.
D0087,198,4,var_7,$10.
D0087,198,5,var_8,$1.
D0087,198,6,var_9,1.,
D0087,198,7,var_10,1.
;

data dirlist;
  infile datalines dlm=',' truncover;
  input path :$300. ext_file :$30.;
  datalines;
/data/landing,D0086_1.dat
/data/landing,D0086_2.dat
/data/landing,D0087.dat
;

proc sql;
  create table control as
    select 
      l.*, r.*
    from DD l inner join dirlist r
      on l.file=upcase(scan(ext_file,1,'._'))
    order by path, ext_file, group, varnum
    ;
quit;

%let target_lib=staging;
filename codegen temp;
data _null_;
  file print;
/*  file codegen;*/
  set control;
  by path ext_file group varnum;

  if first.ext_file then 
  do;
    put 
      'data work.' file ';' 
      ;
    do until(last._ext_file);
      set control(rename=(path=_path ext_file=_ext_file group=_group varnum=_varnum));
      by _path _ext_file _group _varnum;
      put
        '  attrib ' field 'informat=' informat ';'
        ; 
    end;

    put
      "  retain _all_;" /
      "  infile '" path +(-1)"/" ext_file +(-1)"' dlm='|' truncover;" /
      '  input group_no :$3. @;' /
      '  if notdigit(group)=0 then do;' /
      '    select (group_no);'
      ;
  end;

  if first.group then 
  do;
    put
      "      when('" group +(-1)"') input " @;
      ;
  end;

  put field ':' informat @;
  if last.group then put ';';

  if last.ext_file then do;
    put 
      '      otherwise put "WARNING: Unexpected group= " group;' /
      '    end;' /
      '    output;' /
      '  end;' /
      'run;' 
      ;
    put
      "proc append base=&target_lib.." file 'data=work.' file ';' /
      'run;' /
      'proc datasets lib=work nolist nowarn;'/
      '  delete ' file ';' /
      'run;quit;' /
      ;
  end;
run;

data _null_;
  file codegen mod;
  stop;
run;
%include codegen / source2;
filename codegen clear;

 

Above will generate code like:

 

data work.D0086 ;
  attrib var_1 informat=$13. ;
  attrib var_2 informat=$1. ;
  attrib var_3 informat=10. ;
  attrib var_4 informat=$1. ;
  attrib var_5 informat=date9. ;
  attrib var_6 informat=$14. ;
  attrib var_7 informat=$10. ;
  attrib var_8 informat=$1. ;
  attrib var_9 informat=1. ;
  attrib var_10 informat=1. ;
  retain _all_;
  infile '/data/landing/D0086_1.dat' dlm='|' truncover;
  input group_no :$3. @;
  if notdigit(group)=0 then do;
    select (group_no);
      when('196') input var_10 :1. var_2 :$1. ;
      when('197') input var_3 :10. var_4 :$1. ;
      when('198') input var_5 :date9. var_6 :$14. var_7 :$10. var_8 :$1. var_9 :1. var_10 :1. ;
      otherwise put "WARNING: Unexpected group= " group;
    end;
    output;
  end;
run;
proc append base=staging.D0086 data=work.D0086 ;
run;
proc datasets lib=work nolist nowarn;
  delete D0086 ;
run;quit;

 

If you change the output destination to file codegen; then the %include statement at the end will execute the generated code.

teelov
Quartz | Level 8
thank you so much. I'm going to take today and the weekend to look into this. i've never worked with data in this form before and its indutry data so i'm discovering more and more about it the more i start to read it in, the more i notice different scenarios i've not catered for
ballardw
Super User

See if this gives some starting point for this project.

 

data example;
   infile datalines dlm=',' truncover;
   input file $ group $ field $ INFORMAT $;
datalines;
D0086,ZPT,,
D0086,ZHV,,
D0086,196,var_1,$13.
D0086,196,var_2,$.1
D0086,197,var_3,10.
D0086,197,var_4,$1.
D0086,198,var_5,date9.
D0086,198,var_6,$14.
D0086,198,var_7,$10.
D0086,198,var_8,$1.
D0086,198,var_9,1.,
D0086,198,var_10,1.
D0087,ZZZ,,
D0087,ZBB,,
D0087,196,var_2,$.1
D0087,197,var_3,10.
D0087,197,var_4,$1.
D0087,198,var_7,$10.
D0087,198,var_8,$1.
D0087,198,var_9,1.,
D0087,198,var_10,1.
;

proc sort data=example;
   by file group;
run;

data _null_;
   set example;
   by file group;
   file print;
   targetlib='mylib';
   length str $ 100;
   str = cat('data ',catx('.',targetlib,file),';');
   if first.file then do;
      put str;
      /* an INFILE statement would go here with
         path and options such as delimiter
      */
      put "infile 'somepath/file.ext' dlm='|' <other options>;";
      put "input group :$3. @;";
      put "select (group);";
   end;
   if first.group then do;
      str = cat('when (',quote(strip(group)),') input ');
      put str;
   end;
   if not missing(field) then str = catx(' ',field,':',informat);
   else str ='';
   put str;
   if last.group then put ';';
   if last.file then do;
      put 'otherwise put "WARNING: Unexpected group= " group;';
      put 'end;';
      put 'run;';
   end;

run;

 

Your control could have more information. Note that I used INFORMAT with different SAS supplied informats. The output instead going to FILE Print, used for your review, should be written to a SAS Program file, FILEVAR option would let you write to a separate program file per File value. The example should give you enough of a hint to write a proper INFILE statement. If you need a different DATA name then provide that along with the FILE in the control data. Similarly the output library could be in the control set.

 

You did not clearly, at least enough for me to tell what your internal "header" indicator are or if anything needs to be read from them. If they were the ZHV and ZPT and you don't want to read them then add them to the control group as above. If you read values then include as the other groups.

 

I actually learned SAS/AF and SAS/FSP for exactly this purpose with the added kicker of adding in data maintenance for the control data set as "groups" were added/removed and the  content of some of the "group" values changed over time as instruments were added/dropped from data logger channels and had to adjust by date and time for when the changes occurred. Actually I incorporated code that wrote a report for those changes so I could set the start date for the new stream.

I dealt with over 50 sites and as many as 25 "groups" for some of those sites.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 20 replies
  • 4702 views
  • 6 likes
  • 5 in conversation