DATA Step, Macro, Functions and more

Converting Raw Files with Dynamically Calculated Layout

Reply
Trusted Advisor
Posts: 1,300

Converting Raw Files with Dynamically Calculated Layout

Problem:

I receive a file from outside client that is a extract from mysql dumped to text.  As the database grows the format of the file is bound to alter based of the data input into the clients survey application.  Because of the boundless nature of the input a dynamic fixed length reading step seems to be the most productionable way to convert this data.

In other words - The input column will not change in any way from day-to-day however the length of the fields will change as data is collected (to be the length of the longest entity of the column).  As you can see there are delimiters between the columns however there is no restriction to the data entered so the delimiter cannot be necesarily relied upon from a row-to-row basis.

Desired Outcome:

Properly converted sas data to later be utilized for reporting purposes.

Obslist_nameoriginal_filenameemail_addrmemberjobnojob_nameaequestionchoicecomment
108232011survey_list_20110823.csvthis_is@anemailaddress.comMEM1FT93This is a jobs namehjsimpsonHow satisfied were you with the accuracy of your order?4
208232011survey_list_20110823.csvthis_is@anemailaddress.comMEM1FT93This is a jobs namehjsimpsonHow satisfied were you with the timeliness of your order?3This is some sort of random comment that the user has decided to leave in response to this question.
308232011survey_list_20110823.csvthis_is@anemailaddress.comMEM1FT93This is a jobs namehjsimpsonHow satisfied were you with you account manager?4
408232011survey_list_20110823.csvthis_is@anemailaddress.comMEM1FT93This is a jobs namehjsimpsonHow satisfied were you with the products and services Company XX offers?3This is another example of a random comment.  This is is quite long so yeah, I dunno, what do you think about it?  I was just starting to think to myself that this is just about long enough.  Okay, well mayke just a little bit longer then....  Hmm, okay.
508232011survey_list_20110823.csvthis_is@anemailaddress.comMEM1FT93This is a jobs namehjsimpsonHow satisfied were you with the overall experience of working with Company XX?4The comments could potentially contain any characters they wanted so I am not going to be able to use the | character to delimit them.

My Solution and Purpose for Posting:

This is my conclusion to the problem at hand, I thought of a few others but this is the one I ended up writing.  It works as expected, just looking to see what others might come up with.  Enjoy Smiley Happy

data input;

infile cards truncover;

input data $512.;

array _end[10] _end1-_end10;

if _n_=1 then

  do _i=1 to dim(_end);

   retain _end:;

   if _i=1 then _end[_i]=find(data,'+',2);

   else _end[_i]=find(data,'+',_end[_i-1]+1);

  end;

if substr(data,1,1)='+' then delete;

if _n_=2 then delete;

length list_name original_filename email_addr member jobno job_name ae question choice comment $256;

array _var[10] list_name original_filename email_addr member jobno job_name ae question choice comment;

do _j=1 to dim(_var);

  if _j=1 then _var[_j]=strip(substr(data,2,_end[_j]-2));

  else _var[_j]=strip(substr(data,_end[_j-1]+2,_end[_j]-_end[_j-1]-2));

end;

drop data _:;

cards;

+-----------+--------------------------+----------------------------+--------+-------+---------------------+-----------+--------------------------------------------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| list_name | original_filename        | email_addr                 | member | jobno | name                | ae        | content                                                                        | choice | comment                                                                                                                                                                                                                                                        |

+-----------+--------------------------+----------------------------+--------+-------+---------------------+-----------+--------------------------------------------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the accuracy of your order?                        |      4 |                                                                                                                                                                                                                                                                |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the timeliness of your order?                      |      3 | This is some sort of random comment that the user has decided to leave in response to this question.                                                                                                                                                           |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with you account manager?                               |      4 |                                                                                                                                                                                                                                                                |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the products and services Company XX offers?       |      3 | This is another example of a random comment.  This is is quite long so yeah, I dunno, what do you think about it?  I was just starting to think to myself that this is just about long enough.  Okay, well mayke just a little bit longer then....  Hmm, okay. |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the overall experience of working with Company XX? |      4 | The comments could potentially contain any characters they wanted so I am not going to be able to use the | character to delimit them.                                                                                                                         |

+-----------+--------------------------+----------------------------+--------+-------+---------------------+-----------+--------------------------------------------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

;

run;

Super User
Posts: 9,682

Converting Raw Files with Dynamically Calculated Layout

Do you try to use proc import ?

Ksharp

Trusted Advisor
Posts: 1,300

Converting Raw Files with Dynamically Calculated Layout

Since my real data contains over 100 million rows and about every 20 many rows there is just a erronious line of data with just the + and - marks I would rather save the time having to reread the file after a proc import to remove those lines.  So i guess the answer is no, I did not try using proc import, it may work, I'm not well very well versed with the procedure.  Can it automatically determine layouts on fixed length files?  I would think not?  How would it interpret the text fields that contain the faux-delimiter pipes?

PROC Star
Posts: 7,363

Converting Raw Files with Dynamically Calculated Layout

I liked Ksharp's suggestion, thus imported the file using proc import, then pressed function key 4 to retrieve the resulting code, modified it and came up with the following:

data want (drop=deleteme);

  infile 'C:\Art\fetext.txt' delimiter = '|'

   MISSOVER DSD lrecl=32767 firstobs=2 ;

  informat deleteme $1. ;

  informat list_name best32. ;

  informat original_filename $24. ;

  informat email_addr $26. ;

  informat member $4. ;

  informat jobno $4. ;

  informat name $19. ;

  informat ae $9. ;

  informat content $87. ;

  informat choice best32. ;

  informat comment $330. ;

  format list_name best12. ;

  format original_filename $24. ;

  format email_addr $26. ;

  format member $4. ;

  format jobno $4. ;

  format name $19. ;

  format ae $9. ;

  format content $87. ;

  format choice best12. ;

  format comment $330. ;

    input

                deleteme $

                list_name

                original_filename $

                email_addr $

                member $

                jobno $

                name $

                ae $

                content $

                choice

                comment $

    ;

  if not missing(list_name);

run;

While you may not want or need all of the formats and informats, the resulting file appears to be quite clean.

Trusted Advisor
Posts: 1,300

Converting Raw Files with Dynamically Calculated Layout

Using the code above I end up truncating the data in the last comment field which contains a '|' the same as the delimiter being used.

PROC Star
Posts: 7,363

Converting Raw Files with Dynamically Calculated Layout

data want (drop=deletemSmiley Happy;

  infile 'C:\Art\fetext.txt' delimiter = '|'

   TRUNCOVER DSD lrecl=32767 firstobs=2 ;

  informat deleteme $1. ;

  informat list_name best32. ;

  informat original_filename $24. ;

  informat email_addr $26. ;

  informat member $4. ;

  informat jobno $4. ;

  informat name $19. ;

  informat ae $9. ;

  informat content $87. ;

  informat choice best32. ;

  informat comment $330. ;

  informat deleteme2 $330. ;

  informat deleteme3 $330. ;

  informat deleteme4 $330. ;

  format list_name best12. ;

  format original_filename $24. ;

  format email_addr $26. ;

  format member $4. ;

  format jobno $4. ;

  format name $19. ;

  format ae $9. ;

  format content $87. ;

  format choice best12. ;

  format comment $330. ;

    input

                deleteme $

                list_name

                original_filename $

                email_addr $

                member $

                jobno $

                name $

                ae $

                content $

                choice

                comment $

                deleteme2 $

                deleteme3 $

                deleteme4 $

    ;

  comment=catx( "!",comment,deleteme2,deleteme3,deleteme4);

  if not missing(list_name);

run;

Respected Advisor
Posts: 3,777

Re: Converting Raw Files with Dynamically Calculated Layout

Looks to me like all the info needed to define start and end column numbers to read the data with column input is contained in line 1 of the file.  You can use code gen to create an input statement.  Or you can write something that uses start and LEN(end-start+1) with $varying.  Bottom line trying to use delimited input when the data contain unquoted delimiters is going to be messy.

filename FT15F001 temp lrecl=1024;

data fields;

   infile FT15F001 obs=1;

   c=1;

   input;

   do field=1 by 1;

      c = find(_infile_,'+',c);

      if c eq 0 then leave;

      output;

      c = c + 1;

      end;     

   stop;

   parmcards4;

+-----------+--------------------------+----------------------------+--------+-------+---------------------+-----------+--------------------------------------------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| list_name | original_filename        | email_addr                 | member | jobno | name                | ae        | content                                                                        | choice | comment                                                                                                                                                                                                                                                        |

+-----------+--------------------------+----------------------------+--------+-------+---------------------+-----------+--------------------------------------------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the accuracy of your order?                        |      4 |                                                                                                                                                                                                                                                                |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the timeliness of your order?                      |      3 | This is some sort of random comment that the user has decided to leave in response to this question.                                                                                                                                                           |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with you account manager?                               |      4 |                                                                                                                                                                                                                                                                |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the products and services Company XX offers?       |      3 | This is another example of a random comment.  This is is quite long so yeah, I dunno, what do you think about it?  I was just starting to think to myself that this is just about long enough.  Okay, well mayke just a little bit longer then....  Hmm, okay. |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the overall experience of working with Company XX? |      4 | The comments could potentially contain any characters they wanted so I am not going to be able to use the | character to delimit them.                                                                                                                         |

+-----------+--------------------------+----------------------------+--------+-------+---------------------+-----------+--------------------------------------------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

;;;;

   run;

proc print;

   run;

data info(keep=field start end);

   set fields end=eof;

   if not eof then set fields(firstobs=2 keep=c rename=c=c2);

   start = c + 1;

   end   = c2- 1;

   if eof then delete;

   run;

proc print;

   run;

PROC Star
Posts: 7,363

Re: Converting Raw Files with Dynamically Calculated Layout

DN,

Nice approach .. as always!  But why stop there?  How about:

filename FT15F001 temp lrecl=1024;

filename incfile temp;

data fields;

   infile FT15F001 obs=2;

   file incfile;

   array cc(20);

   retain cc: field;

   length fname $32;

   length fmt $10;

   c=1;

   input;

   if _n_ eq 1 then do;

     do field=1 to 20 by 1;

      c = find(_infile_,'+',c);

      if c eq 0 then leave;

            cc(field)=c+1;

      output;

      c = c + 1;

     end;

   end;

   else do i=1 to field-2;

     if i eq 1 then put 'input ';

     fname=strip(substr(_infile_,cc(i)+1,cc(i+1)-cc(i)-2));

     put fname @;

     put ' $ ' @;

     fmt=catt(cc(i)+1,'-',cc(i+1)-2);

     put fmt;

   end;

   if _n_ eq 2 then put ';';

   parmcards4;

+-----------+--------------------------+----------------------------+--------+-------+---------------------+-----------+--------------------------------------------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| list_name | original_filename        | email_addr                 | member | jobno | name                | ae        | content                                                                        | choice | comment                                                                                                                                                                                                                                                        |

+-----------+--------------------------+----------------------------+--------+-------+---------------------+-----------+--------------------------------------------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the accuracy of your order?                        |      4 |                                                                                                                                                                                                                                                                |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the timeliness of your order?                      |      3 | This is some sort of random comment that the user has decided to leave in response to this question.                                                                                                                                                           |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with you account manager?                               |      4 |                                                                                                                                                                                                                                                                |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the products and services Company XX offers?       |      3 | This is another example of a random comment.  This is is quite long so yeah, I dunno, what do you think about it?  I was just starting to think to myself that this is just about long enough.  Okay, well mayke just a little bit longer then....  Hmm, okay. |

| 08232011  | survey_list_20110823.csv | this_is@anemailaddress.com | MEM1   | FT93  | This is a jobs name | hjsimpson | How satisfied were you with the overall experience of working with Company XX? |      4 | The comments could potentially contain any characters they wanted so I am not going to be able to use the | character to delimit them.                                                                                                                         |

+-----------+--------------------------+----------------------------+--------+-------+---------------------+-----------+--------------------------------------------------------------------------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

;;;;

   run;

data want;

  infile FT15F001 firstobs=4;

  %include incfile;

  if substr(list_name,1,1) ne "-";

run;

Super User
Posts: 9,682

Re: Converting Raw Files with Dynamically Calculated Layout

Art'T Why not use & modify to neglect the first delimiter in comment?

data want (drop=deleteme);

  infile 'C:\Art\fetext.txt' delimiter = '|'

   MISSOVER DSD lrecl=32767 firstobs=2 ;

  informat deleteme $1. ;

  informat list_name best32. ;

  informat original_filename $24. ;

  informat email_addr $26. ;

  informat member $4. ;

  informat jobno $4. ;

  informat name $19. ;

  informat ae $9. ;

  informat content $87. ;

  informat choice best32. ;

  format list_name best12. ;

  format original_filename $24. ;

  format email_addr $26. ;

  format member $4. ;

  format jobno $4. ;

  format name $19. ;

  format ae $9. ;

  format content $87. ;

  format choice best12. ;

    input

                deleteme $

                list_name

                original_filename $

                email_addr $

                member $

                jobno $

                name $

                ae $

                content $

                choice

                comment & $330.

    ;

  if not missing(list_name);

run;

Ksharp

Ask a Question
Discussion stats
  • 8 replies
  • 276 views
  • 0 likes
  • 4 in conversation