Solved: Re: Get structure of a file before importing it into SAS

jpprovost · Posted 10-23-2019 10:13 PM

Hi there,

I was looking for a way to verify that a file is "correctly" imported into SAS. I have a SAS program that do data steps (with infile) to import data from a .txt or a .csv file. However, the program is running "from a long time" now and I was thinking today : if a new column is added to the file (because many files are produced (aka dumped) automatically by servers), how would I be aware of it?

Since the SAS program specify the position, field name and format, how could I set-up like a "flag" to say "Hey, there's more column in the file that you set in the infile"?

TL;DR: Is there a way to get the structure of the file (e.g. number of headers and lines) before importing it?

(I don't know if the options GUESSINGROWS=1 is appropriate in this case).

I found this post, but I don't understand if it could be a solution to my question.

https://communities.sas.com/t5/SAS-Programming/Reading-header-and-trailer-data-from-text-file/td-p/5...

As always, thanks in advance for your kindness.

ChrisNZ · Posted 10-23-2019 10:31 PM

Could you just read the first row (with the headers) and count the number of columns?

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 10-23-2019 10:31 PM

Could you just read the first row (with the headers) and count the number of columns?

High-Performance SAS Coding - Third Edition

jpprovost · Posted 10-23-2019 10:39 PM

@ChrisNZ wrote:

Could you just read the first row (with the headers) and count the number of columns?

Gosh. I was thinking so "out-of-the-box" that I realize, with your reply, how "easy" it could be. My bad (I'm new with SAS).

Is there a way to specify to SAS to only "infile" the two-first rows?

EDIT: I found that I could use the OBS options (e.g. OBS=2). Sorry for the inconveniance @ChrisNZ !

ChrisNZ · Posted 10-23-2019 10:52 PM

Even one row is enough.

Pseudo code:

data T;
  infile F dlm=',';
  input;
  call symputx('NB_COLS', countw(_INFILE_,',') );
  stop;
run;

High-Performance SAS Coding - Third Edition

jpprovost · Posted 10-23-2019 10:54 PM

Will try it for sure tomorrow moring.

What is this option ?

countw

ChrisNZ · Posted 10-23-2019 11:16 PM

Look up this function, it counts words.

High-Performance SAS Coding - Third Edition

ScottBass · Posted 10-23-2019 11:06 PM

Various approaches:

filename temp temp;

* create csv ;
proc export data=sashelp.class file=temp dbms=csv;
run;

* check csv ;
data _null_;
   infile temp;
   input;
   put _infile_;
run;

* option 1: count delimiters in the header ;
* this assumes there is no edge case of a quoted delimiter in the header ;
data _null_;
   infile temp;
   input;
   call symputx('num_cols',countc(_infile_,',')+1,'G');
   stop;
run;

%put &=num_cols;

* option 2: import the header and count columns ;
proc import file=temp out=class dbms=csv replace;
   getnames=yes;
run;

data _null_;
   if 0 then set class;
   length varname $32;
   do until (varname eq 'varname');
      num_cols+1;
      call vnext(varname);
   end;
   call symputx('num_cols',num_cols-1,'G');
   stop;
run;

%put &=num_cols;

* option 3: dictionary tables ;
* this can have a performance issue if you have a lot of allocated librefs esp. RBDMS ;
proc sql noprint;
   select count(0) into :num_cols trimmed
   from dictionary.columns
   where upcase(libname)='WORK' and upcase(memname)='CLASS';
quit;

%put &=num_cols;

* option 4: proc contents ;
proc contents data=class out=columns (keep=name) noprint;
run;

%let num_cols=%nobs(columns);
%put &=num_cols;

What you do with &num_cols in your downstream code is up to you.

I once had to do something similar, where I had to read a delimited file created by a web service. The structure could be changed by the customer via a web interface.

In that scenario, there was a max number of columns available from the web service, along with known attributes. But the customer could choose a subset, as well as reorder the columns.

I created a metadata table in the code via datalines (the metadata was a bit like proc contents output). I used approach #1 to get the list of column names from the header, as well as the order of the columns (a sort column). I then inner joined with the metadata to get the known column attributes and the correct order from the header itself, then generated the correct attrib statements and code to import the dynamic data.

HTH...

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

ScottBass · Posted 10-23-2019 11:35 PM

@ScottBass wrote:

I once had to do something similar, where I had to read a delimited file created by a web service. The structure could be changed by the customer via a web interface.

In that scenario, there was a max number of columns available from the web service, along with known attributes. But the customer could choose a subset, as well as reorder the columns.

I created a metadata table in the code via datalines (the metadata was a bit like proc contents output). I used approach #1 to get the list of column names from the header, as well as the order of the columns (a sort column). I then inner joined with the metadata to get the known column attributes and the correct order from the header itself, then generated the correct attrib statements and code to import the dynamic data.

HTH...

Something like this:

data class;
   * reorder columns ;
   format sex name weight height age;
   set sashelp.class;
run;

* create metadata (pretend this was via datalines) ;
proc contents data=sashelp.class out=metadata (keep=name type length label) noprint;
run;

* create a CSV with a subset of the columns ;
data subset / view=subset;
   set class;
   keep name age sex;
run;

filename temp temp;
proc export data=subset file=temp dbms=csv replace;
run;

* get the structure of this csv file ;
data columns;
   length name $32;
   infile temp;
   input;
   do i=1 to countc(_infile_,',')+1;
      name=scan(_infile_,i,',');
      sort+1;
      output;
   end;
   keep name sort;
   stop;
run;

proc sql;
   create table attribs as
   select c.name, m.type, m.length, m.label
   from metadata m
   join columns c
   on upcase(m.name)=upcase(c.name)
   order by c.sort;
quit;

%macro attribs;
%let label=%trim(&label);
attrib &name
length=%if (&type eq 2) %then $&length; %else &length; 
label="&label"
;
%mend;

data final;
   %loop_control(control=attribs,mname=attribs);
   infile temp dlm=',' dsd firstobs=2;
   input (_all_) (:);
run;

https://github.com/scottbass/SAS/blob/master/Macro/loop_control.sas

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

jpprovost · Posted 10-24-2019 09:30 AM

@ScottBass wrote:

@ScottBass wrote:

I once had to do something similar, where I had to read a delimited file created by a web service. The structure could be changed by the customer via a web interface.

In that scenario, there was a max number of columns available from the web service, along with known attributes. But the customer could choose a subset, as well as reorder the columns.

I created a metadata table in the code via datalines (the metadata was a bit like proc contents output). I used approach #1 to get the list of column names from the header, as well as the order of the columns (a sort column). I then inner joined with the metadata to get the known column attributes and the correct order from the header itself, then generated the correct attrib statements and code to import the dynamic data.

HTH...

Something like this:
data class;
   * reorder columns ;
   format sex name weight height age;
   set sashelp.class;
run;

* create metadata (pretend this was via datalines) ;
proc contents data=sashelp.class out=metadata (keep=name type length label) noprint;
run;

* create a CSV with a subset of the columns ;
data subset / view=subset;
   set class;
   keep name age sex;
run;

filename temp temp;
proc export data=subset file=temp dbms=csv replace;
run;

* get the structure of this csv file ;
data columns;
   length name $32;
   infile temp;
   input;
   do i=1 to countc(_infile_,',')+1;
      name=scan(_infile_,i,',');
      sort+1;
      output;
   end;
   keep name sort;
   stop;
run;

proc sql;
   create table attribs as
   select c.name, m.type, m.length, m.label
   from metadata m
   join columns c
   on upcase(m.name)=upcase(c.name)
   order by c.sort;
quit;

%macro attribs;
%let label=%trim(&label);
attrib &name
length=%if (&type eq 2) %then $&length; %else &length; 
label="&label"
;
%mend;

data final;
   %loop_control(control=attribs,mname=attribs);
   infile temp dlm=',' dsd firstobs=2;
   input (_all_) (:);
run;
https://github.com/scottbass/SAS/blob/master/Macro/loop_control.sas

I will keep this in my bookmark for further usage. Very useful.

I just open some files dumped by the server and I realize that there's no headers. I take a look at the code (made by someone else) and I realize that field names are user-defined. Since the file do not have any headers, do you think I could use le length of the first line as an indicator of "columns"?

By example, the code for loading thoses files looks like :

data file1 ;
  infile myfile.txt MISSOVER LRECL = 153 ;
  input
	@1   Field1    10.
	@11  Field2    $15.
	@26  Field3   $1.
	@27  Field4  $1.
	@28  Field5   $2.
	@30  Field6   $6.
	@36  Field7   $60.
	@96  Field8   $2.
	@98  Field9   9.
	@107 Field10   13.
	@120 Field11   11.
	@131 Field12   12.
	@143 Field13   $10.
	@153 Field14  $1.
	@154 Field15  $3.
	;

run ;

Thanks for the help!

Tom · Posted 10-24-2019 09:49 AM

For that kind of structure you will probably want to check that the length of the lines is appropriate. For example your posted code is truncating the input lines at 153 bytes and then trying to read data up to byte number 156.

Also check key variables to make sure they have valid values. For example FIELD14 might only allow values of 'A' or 'B'. If something else appears in byte number 153 of the line then it is a good indication that the file is not right.

PS: Never use the antiquated MISSOVER option when using formatted input. If your input line had only 155 bytes your input of FIELD15 would be missing, even if bytes 154 and 155 where not blanks. Use the more robust TRUNCOVER option instead.

jpprovost · Posted 10-24-2019 03:19 PM

How to check the length of a line?

Tom · Posted 10-24-2019 03:24 PM

There is an INFILE option that will set a variable with the length of the line. You can also test the _INFILE_ automatic variable.

infile 'myfile.txt' length=line_length1 ;
input @;
line_length2=length(_infile_);
line_length3=lengthn(_infile_);

The difference between the two methods is the first one will count any trailing spaces on the line. The difference between the LENGTH() and LENGTHN() functions is when the line only contains blanks. In that case LENGTH() will return 1 and LENGTHN() will return 0.

jpprovost · Posted 10-24-2019 03:27 PM

@Tom wrote:

There is an INFILE option that will set a variable with the length of the line. You can also test the _INFILE_ automatic variable.
infile 'myfile.txt' length=line_length1 ;
input @;
line_length2=length(_infile_);
line_length3=lengthn(_infile_);
The difference between the two methods is the first one will count any trailing spaces on the line. The difference between the LENGTH() and LENGTHN() functions is when the line only contains blanks. In that case LENGTH() will return 1 and LENGTHN() will return 0.

Hi @Tom ,

I tried to do this :

data t;
infile myfile.txt truncover;
input @;
run;

And it returns no observations, but the obs columns seems to count the total of "lines".

I tried your code, and it gives me all 155 in both columns.

However, I was wondering what the is the usage of this :

length=line_length1 ;

Tom · Posted 10-24-2019 03:58 PM

Sounds like it did what you asked it to do then.

jpprovost · Posted 10-24-2019 04:01 PM

Yes, a big thanks to you Tom.
But I don't understand what the length=line_length1 does?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away