Solved: Share: Rewrite the IMPORT procedure (partly)

Slash · Posted 07-24-2018 02:09 AM

Background:

Recently, I got a task to read the text data to SAS automatically. And the text data is transferred via FTP from other application. The users select the data they want on the application, then click the "send" button, and the data will be send to the FTP. So, I write a SAS program to scan the FTP and find the text data, then download it to the local machine. Because the data is selected by different user for different purpose, the text data format is not same. So, I can only use the IMPORT procedure to read the data to SAS data set.

Problem:

1. Some ID variables are composed of more than 15 numbers, for example: 986509856125436590. The IMPORT procedure recognizes it as a numeric variable cause its value is all digits. And yet the numeric variable can not store a such big number, it will be distorted.

2. Some code values are fixed length, and composed of digits. for example: 001, 018, 023. The IMPORT procedure also recognizes it as a numeric variable, which is not we expected.

3. The IMPORT procedure doesn't support double characters delimiter, it will generate extra variables.

Solution in a general way:

1. For problem 1 and 2, I can ask the application guy to add quotation marks to the character values.

2. For problem 3, just use another data step to drop the extra variables.

My Solution:

As we know, when the IMPORT procedure reads the delimited file, actually the data is read by data step. The job of IMPORT procedure is to generate the data step code. So, Why not generate it by myself?

So, I write a macro to do this. Here is the result:

the text data:

Result of IMPORT procedure:

Result of import macro:

proc import datafile='D:\test.txt' out=test_import dbms=dlm replace;
    delimiter='||';
    getnames=yes;
run;


%custom_import_data(%str(D:\test.txt),%str(||),test_my_import,get_names=yes);

And Here is the macro code, it does solve my problem for now, I'm not sure it can import any kind of delimited file correctly. Maybe there is a bug I didn't notice, and I'm sure there is a way to improve the performance. So, welcome to update it.

%macro custom_import_data(file_path,              /*path of delimited file.*/
                           file_dlm,               /*delimiter, can be more than two characters.*/
                           out_dsn,                /*name of output SAS data set.*/
                           file_encoding=wlatin1,  /*encoding of delimitered file.*/
                           get_names=no,           /*whether generates variable names from the first row. */
                           last_col_dlm=no,        /*whether the line is end with the delimiter.*/
                           guessing_rows=1000);   /*the number of rows to scan.*/

%local i;

filename read "&file_path"  encoding="&file_encoding";
data _null_;
    infile read obs=1;
    input;
    %if %lowcase(&last_col_dlm)=no %then var_num=count(_infile_,"&file_dlm")+1;
    %else var_num=count(_infile_,"&file_dlm");
    ;
    call symput('var_num',compress(var_num));
    %if %lowcase(&get_names)=yes %then %do;
        do i=1 to var_num;		
            call symput(cats('var',i),strip(scan(_infile_,i,"&file_dlm")));
        end;
    %end;
run;
%put The number of variables: &var_num;

data work._null_;
    infile read _infile_=line 
        %if %lowcase(&get_names)=no %then firstobs=1;
        %else firstobs=2;
        end=last obs=&guessing_rows;
    
    array var_type {&var_num} $ 1;
    array var_length {&var_num};
    array var_informat {&var_num} $ 20;
    array var_format {&var_num} $ 20;

    length value $ 1000;
    retain var_: p_date1 p_date2 p_date3 p_time p_datetime1 p_datetime2 p_num1 p_num2;
    if _n_=1 then do;
        p_date1=prxparse('/^\d{4}(\\|-|\/)\d{1,2}(\\|-|\/)\d{1,2}$/');
        p_date2=prxparse('/^\d{1,2}(\\|-|\/)\d{1,2}(\\|-|\/)\d{4}$/');
        p_date3=prxparse('/^\d{1,2}\D{3}\d{4}$/');
        p_time=prxparse('/^\d{1,2}:\d{1,2}:\d{1,2}$/');
        p_datetime1=prxparse('/^\d{4}(\\|-|\/)\d{1,2}(\\|-|\/)\d{1,2}(:|\s)*\d{1,2}:\d{1,2}:\d{1,2}$/');
        p_datetime2=prxparse('/^\d{1,2}\S{3}\d{4}(:|\s)*\d{1,2}:\d{1,2}:\d{1,2}$/');
        p_num1=prxparse('/^-?[1-9]\d{0,10}(\.\d+)?$/');
        p_num2=prxparse('/^-?0(\.\d+)?$/');
    end;

    input;
    do i=1 to &var_num;
        value=scan(line,i,"&file_dlm");
		if var_type{i}='$' then do;
			len=length(strip(value));

			if var_type{i}='n' then do;
				var_type{i}='$';
				var_length{i}=len;
                var_informat{i}=cats('$',len,'.');
                var_format{i}=cats('$',len,'.');
			end;
			else do;
	            var_type{i}='$';
	            if len>var_length{i} then do;
	                var_length{i}=len;
	                var_informat{i}=cats('$',len,'.');
	                var_format{i}=cats('$',len,'.');
	            end;
			end;
		end;
		else do;
	        if prxmatch(p_date1,strip(value))>0 then do;
	            var_type{i}='n';
	            var_length{i}=8;
	            var_informat{i}='yymmdd10.';
	            var_format{i}='yymmdd10.';
	        end;
	        else if prxmatch(p_date2,strip(value))>0 then do;
	            var_type{i}='n';
	            var_length{i}=8;
	            var_informat{i}='mmddyy10.';
	            var_format{i}='mmddyy10.';
	        end;
	        else if prxmatch(p_date3,strip(value))>0 then do;
	            var_type{i}='n';
	            var_length{i}=8;
	            var_informat{i}='date9.';
	            var_format{i}='date9.';
	        end;
	        else if prxmatch(p_time,strip(value))>0 then do;
	            var_type{i}='n';
	            var_length{i}=8;
	            var_informat{i}='time9.';
	            var_format{i}='time9.';
	        end;
	        else if prxmatch(p_datetime1,strip(value))>0 or prxmatch(p_datetime2,strip(value))>0 then do;
	            var_type{i}='n';
	            var_length{i}=8;
	            var_informat{i}='anydtdtm40.';
	            var_format{i}='datetime19.';
	        end;
	        else if prxmatch(p_num1,strip(value))>0 or prxmatch(p_num2,strip(value))>0 or strip(value)='' then do;
	            var_type{i}='n';
	            var_length{i}=8;
	            var_informat{i}='best12.';
	            var_format{i}='best12.';
	        end;
	        else do;
				len=length(strip(value));

				if var_type{i}='n' then do;
					var_type{i}='$';
					var_length{i}=len;
	                var_informat{i}=cats('$',len,'.');
	                var_format{i}=cats('$',len,'.');
				end;
				else do;
		            var_type{i}='$';
		            if len>var_length{i} then do;
		                var_length{i}=len;
		                var_informat{i}=cats('$',len,'.');
		                var_format{i}=cats('$',len,'.');
		            end;
				end;
	        end;
		end;
		
    end;
    if last or _n_=&guessing_rows then do;
        output;
        do i=1 to &var_num;
            call symput(cats('var_type',i),strip(var_type{i}));
            call symput(cats('var_length',i),strip(var_length{i}));
            call symput(cats('var_informat',i),strip(var_informat{i}));
            call symput(cats('var_format',i),strip(var_format{i}));
        end;
    end;
    keep var_:;
run;

data &out_dsn;
    infile read dlmstr="&file_dlm" dsd 
        %if %lowcase(&get_names)=no %then firstobs=1;
        %else firstobs=2;
        ;
    length
        %do i=1 %to &var_num;			
			%if &&var_type&i=n %then %let var_type&i=;
            %if %lowcase(&get_names)=yes %then "&&var&i"n &&var_type&i &&var_length&i;
            %else var&i &&var_type&i &&var_length&i;
        %end;
        ;
    format
        %do i=1 %to &var_num;
            %if %lowcase(&get_names)=yes %then "&&var&i"n &&var_format&i;
            %else var&i &&var_format&i;
        %end;
        ;
    input
        %do i=1 %to &var_num;
            %if %lowcase(&get_names)=yes %then "&&var&i"n : &&var_informat&i;
            %else var&i : &&var_informat&i;
        %end;
        ;
run;

%mend custom_import_data;

Tom · Posted 07-24-2018 10:53 AM

How exactly are they generating the files? Can you improve that process to also produce metadata that can be used to drive your reading of the files? The biggest problem with a simple delimited text file is that the only metadata it has is the header row.

One problem you seem to have is that it appears you are trying to use a two byte sequence '||' as delimiter between fields. SAS infile statement does not support treating a string as the delimiter option. Can you change the process that is generating the files to use a single character as the delimiter? If not you might need to pre-process the files to convert them into properly formatted delimited files. NOTE: There is a DLMSTR= option on the INFILE statement you can use instead of the DLM= option.

If they are using consistent headers for the columns then follow the advice of @ballardw and maintain your own metadata to drive the process. For example you might have a metadata table like this:

data fields ;
  length name $32 type_length $6 informat format $41 label $256 ;
  infile cards dsd dlm='|' truncover ;
  input name -- label;
cards;
SerialNO|$20|||Serial Number (digit string)
product_id|$5|||Product Identifier
date1|8|anydtdte.|yymmdd10.|First Date
datetime1|8|anydtdtm.|datetime20.|First Time Stamp
;

You can then read just the header line from the new file and merge the names with the field metadata. Now you can use the field metadata to write the data step to read the file.

data file_fields ;
  infile 'myfile.txt' dsd obs=1 dlmstr='||' ;
  varnum +1;
  input name :$32. @@ ;
run;

proc sql ;
  create table file_metadata as 
    select a.varnum,b.* 
    from file_fields a 
    left join fields b
    on upcase(a.name) = upcase(b.name)
    order by a.varnum
  ;
quit;

filename code temp;
data _null_;
  set file_metadata end=eof;
  if _n_=1 then first_name=name ;
  retain first_name ;
  put 'attrib ' name 'length=' type_length @ ;
  if not missing(informat) then put informat= @;
  if not missing(format) then put format= @;
  if not missing(label) then put label= :$quote. @;
  put ';' ;
  if eof then put 'input ' first_name '-- ' name ';' ;
run;

data want ;
  infile 'myfile.txt' dsd firstobs=2 dlmstr='||' truncover ;
%include code / source2 ;
run;

View solution in original post

RW9 · Posted 07-24-2018 03:56 AM

Why?? Write a datastep import program to read it correctly. From your proc import, in the log you will see the datastep which is created by the guessing procedure proc import. As that is not accurate, take that generated code and apply the correct import algorithms. Its seems like you have written all that to guess what the data looks like, where is your data import specification, signed off and agreed by both parties? Without this they could post you a picture of their computer with the data on screen, how will you code that? Just madness.

Specification/Documentation->Agreement->Build->LifeCycle

Or if you prefer:

Failure or spiraling resource costs.

Slash · Posted 07-24-2018 04:39 AM

The real problem is:

All my work is background, the program runs every 10 minutes, and the user access the data from the Client machine through EG. So, they can not modify the read code. There are many users send data to SAS platform, and I don't have time to correct the code every time.

RW9 · Posted 07-24-2018 04:57 AM

"So, they can not modify the read code. There are many users send data to SAS platform, and I don't have time to correct the code every time." - even more reason to have agreements in place. Either you can set yourself up to handle anything thats comes in - which will cost you a lot of time and fail most of it, or you can fix an agreement in between that users should conform to. Just a simple thing, if they send the data with a * delimiter, what then, are you gogin to program to find what the delimiter is or are you going to agree up front that it will always be a pipe? This is the basis for any data transfer, Document/Spec->Agreement->Development->Lifecycle.

ballardw · Posted 07-24-2018 10:29 AM

Does that serial number variable occur in all of your data sets? Many of your data sets? Some?

If your users are extracting data from known sources then perhaps you need a data base of the variable names (column heading values in the text files), and properties such as character/numeric, length, informat and format that should be associated. Assuming you don't have multiple variables with the same name, i.e. SerialNo it would be relatively easy to use that database/ dataset of descriptors and parse the header line to do what you accomplish.

If the other application is "nice" it might even have an existing file or source that you could use to get the descriptors and related information from.

Tom · Posted 07-24-2018 10:53 AM

How exactly are they generating the files? Can you improve that process to also produce metadata that can be used to drive your reading of the files? The biggest problem with a simple delimited text file is that the only metadata it has is the header row.

One problem you seem to have is that it appears you are trying to use a two byte sequence '||' as delimiter between fields. SAS infile statement does not support treating a string as the delimiter option. Can you change the process that is generating the files to use a single character as the delimiter? If not you might need to pre-process the files to convert them into properly formatted delimited files. NOTE: There is a DLMSTR= option on the INFILE statement you can use instead of the DLM= option.

If they are using consistent headers for the columns then follow the advice of @ballardw and maintain your own metadata to drive the process. For example you might have a metadata table like this:

data fields ;
  length name $32 type_length $6 informat format $41 label $256 ;
  infile cards dsd dlm='|' truncover ;
  input name -- label;
cards;
SerialNO|$20|||Serial Number (digit string)
product_id|$5|||Product Identifier
date1|8|anydtdte.|yymmdd10.|First Date
datetime1|8|anydtdtm.|datetime20.|First Time Stamp
;

You can then read just the header line from the new file and merge the names with the field metadata. Now you can use the field metadata to write the data step to read the file.

data file_fields ;
  infile 'myfile.txt' dsd obs=1 dlmstr='||' ;
  varnum +1;
  input name :$32. @@ ;
run;

proc sql ;
  create table file_metadata as 
    select a.varnum,b.* 
    from file_fields a 
    left join fields b
    on upcase(a.name) = upcase(b.name)
    order by a.varnum
  ;
quit;

filename code temp;
data _null_;
  set file_metadata end=eof;
  if _n_=1 then first_name=name ;
  retain first_name ;
  put 'attrib ' name 'length=' type_length @ ;
  if not missing(informat) then put informat= @;
  if not missing(format) then put format= @;
  if not missing(label) then put label= :$quote. @;
  put ';' ;
  if eof then put 'input ' first_name '-- ' name ';' ;
run;

data want ;
  infile 'myfile.txt' dsd firstobs=2 dlmstr='||' truncover ;
%include code / source2 ;
run;

Share: Rewrite the IMPORT procedure (partly)

Re: Share: Rewrite the IMPORT procedure (partly)

Re: Share: Rewrite the IMPORT procedure (partly)

Re: Share: Rewrite the IMPORT procedure (partly)

Re: Share: Rewrite the IMPORT procedure (partly)

Re: Share: Rewrite the IMPORT procedure (partly)

Re: Share: Rewrite the IMPORT procedure (partly)

Classroom Training Available!