BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
serge68
Calcite | Level 5

I'm parsing data from a txt file and formatting into a csv. Part of that, is to combine multiple observations into a single row. There is no common identifier between the two rows, so I'm trying to use the difference in the number of fields present(https://communities.sas.com/t5/General-SAS-Programming/Merging-observations-into-a-single-one/m-p/41... )

 

Some of the lines I'm looking to output, have some of their fields missing(those prefixed with NODE or INT in my code). Which is fine, as I don't need them for the final csv file, but I can't seem to work around how to handle that missing data.

 

If both NODE and the INT fields are missing, it returns data, however if just NODE is missing, that line doesn't get returned.

 

Its also truncating the leading spaces caused by these missing fields, throwing my other fields out of whack. I've tried using $CHAR132; in my length statement, but get his error –

   ERROR 391-185: Expecting a variable length specification.                   

   ERROR 202-322: The option or parameter is not recognized and will be ignored.

 

 

Ultimately my output csv should look like –

Dataset,  HeldNAME, HeldNUM, HeldDate, HeldTIME, HeldJESID, HeldFROM, HeldFOR, HolderNAME, HolderJESID, HolderAT        

 

With the Dataset value being the third value(i.e. the ??? in this example) on the line that begins with "DATASET  NAME:   ???"

 

This is where I'm at thus far -

 

data File1;                                                    
retain dataset;                                                
infile IFILE missover;                                         
length NODE IntDATE IntFROM IntFOR     
       HeldNAME HeldNUM HeldDate HeldTIME HeldJESID            
       HeldFROM HeldFOR HeldF HeldS HeldC                      
       HolderNAME HolderT HolderJESID HolderH HolderAT        
       $132;                                                   
                                                                
input NODE -- HolderAT;                                        
if find(IntDATE,':','i') ge 1 OR                               
   NODE = '--NODE--'          OR                               
   NODE = 'NAME  '            OR                                
   NODE = '--------'          then delete;                     
 
if missing(IntFOR) then do;         
    dataset = IntFROM;              
    end;                            
else output;                        
                                     
                                    
proc print;      

                    

 

 

Attached is a sample of my input file.

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

I think you're going about this in a difficult way.  There should be no need to assign a length or informat of $132 to every variable. In the example below, I've assigned an INFORMAT specific to each variable.

 

And you can use a naked INPUT statement to populate the _INFILE_ automatic variable.  It has a trailing @, for re-reading purposes.  Items 1 and 2 below is  a bit like your  logic, but the rest basically does conditional reading of subgroups of variables.

 

  1. if the line is blank, or characters 2 through 9 contain '  NAME  ','--NODE--', or '--------' then delete the input record
  2. If the input line has the string "DATASET NAME:" then input @ 'DATASET NAME:' DSNAME;   Then delete the input record
  3. If it otherwise has non-blanks in characters 1-9, read NODE
  4. If it has non-blanks in characters 1-20, read INTDATE,  INTFROM, INTFOR
  5. Then read all the HELD... and HOLDER... variables.

 

data want;
  infile 'c:\temp\t.txt' firstobs=2 truncover;

  informat dsname $40.  node $8. intdate $5. intfrom hhmmss8. intfor stimer6.
       HeldNAME $8. HeldNUM 5.0 HeldDate $5. HeldTIME time8.0 HeldJESID $8.           
       HeldFROM time8.0 HeldFOR stimer6. HeldF HeldS HeldC $1.                      
       HolderNAME $8. HolderT $1. HolderJESID $8. HolderH $1. HolderAT time8.0
  ;
  retain dsname  -- intfor;
  input @;
  if _infile_=' ' then delete;
  else if substr(_infile_,2,8) in ('--NODE--','  NAME  ','--------') then delete;

  else if _infile_=: ' DATASET NAME:' then do;
   call missing (of dsname--intfor);
   input @'DATASET NAME:' dsname $40.;
   delete;
  end;

  if not (_infile_=:'         ') then input @2 node @;
  if not (_infile_=:'                    ') then input intdate intfrom intfor @;
  input heldname--holderat;

  format intfrom time8.0 intfor mmss5.0
         heldtime time8.0 heldfrom time8.0 heldfor mmss5.0
         holderat time8.0;
run;

 

 

Notes:

  1. I use trailing @ signs, to hold the _INFILE_ variable for re-reading.  This allows the conditional reading of variables, depending on the structure of the data line.
  2. Because an obs is constructed from a variable number of lines, there is a retain statement for all the variables that are not read in with every data line (i.e.  dsname--intfor).
  3. Because of the retain statement, there is a call missing(of dsname--intfor) every time a new dataset name is encountered.

 

If you really want a CSV file, then set up a FILE statement and a PUT statement at the bottom of the data step. Also you can change DATA WANT to DATA _NULL_.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

7 REPLIES 7
Shmuel
Garnet | Level 18

Considering your note:

I've tried using $CHAR132; in my length statement ...

pay attention, you shall define:

length var_name  $132;
format var_name $132. ;
/* informat var_name $132. ; ??? */

As to your code, you can read just first 8 charachers to test  it and continue according to result:

 

data File1; 
  retain dataset;
  infile IFILE missover;                                         
  length NODE IntDATE IntFROM IntFOR     
       HeldNAME HeldNUM HeldDate HeldTIME HeldJESID            
       HeldFROM HeldFOR HeldF HeldS HeldC                      
       HolderNAME HolderT HolderJESID HolderH HolderAT        
       $132; 

   input @2 test $8.  @;  /* read first 8 characters and keep stay on line */
      
  if test  in ('--NODE--'  'NAME  ' '--------') then input; /* skip line */

  if test = 'DATASET ' then 
     input   @14 dataset $132. ;    
  else do;
     input @1 test $1. @2 NODE -- HolderAT;
     output;
  end;                                     
  drop test;
run;                   

 You may need to refine my code, as I have not tested it.

serge68
Calcite | Level 5

Some progress, thanks. Though its is picking up lines it shouldn't and I can't see why.

 

This statement seems to not work for the '  NAME  '. Its not skipping this line.

 

if test  in ('--NODE--'  '  NAME  ' '--------') then input; /* skip line */

 

I've tried playing around with different variations of the spaces within it, and yet it always gets included when it shouldn't.

 

 

 Thanks

Shmuel
Garnet | Level 18

Maybe you need change line to:

input @1 test $8.  @;     /* instead @2 */

The text will be left justified.

and change line:

if test  in ('--NODE--'  'NAME    ' '--------') then input; /* skip line */

where 'NAME    ' - is left justified, up to 8 characters.

 

You need also retain NODE INTDATE INTFROM INTFOR variables as they exist only on first data line.

 

 

mkeintz
PROC Star

I think you're going about this in a difficult way.  There should be no need to assign a length or informat of $132 to every variable. In the example below, I've assigned an INFORMAT specific to each variable.

 

And you can use a naked INPUT statement to populate the _INFILE_ automatic variable.  It has a trailing @, for re-reading purposes.  Items 1 and 2 below is  a bit like your  logic, but the rest basically does conditional reading of subgroups of variables.

 

  1. if the line is blank, or characters 2 through 9 contain '  NAME  ','--NODE--', or '--------' then delete the input record
  2. If the input line has the string "DATASET NAME:" then input @ 'DATASET NAME:' DSNAME;   Then delete the input record
  3. If it otherwise has non-blanks in characters 1-9, read NODE
  4. If it has non-blanks in characters 1-20, read INTDATE,  INTFROM, INTFOR
  5. Then read all the HELD... and HOLDER... variables.

 

data want;
  infile 'c:\temp\t.txt' firstobs=2 truncover;

  informat dsname $40.  node $8. intdate $5. intfrom hhmmss8. intfor stimer6.
       HeldNAME $8. HeldNUM 5.0 HeldDate $5. HeldTIME time8.0 HeldJESID $8.           
       HeldFROM time8.0 HeldFOR stimer6. HeldF HeldS HeldC $1.                      
       HolderNAME $8. HolderT $1. HolderJESID $8. HolderH $1. HolderAT time8.0
  ;
  retain dsname  -- intfor;
  input @;
  if _infile_=' ' then delete;
  else if substr(_infile_,2,8) in ('--NODE--','  NAME  ','--------') then delete;

  else if _infile_=: ' DATASET NAME:' then do;
   call missing (of dsname--intfor);
   input @'DATASET NAME:' dsname $40.;
   delete;
  end;

  if not (_infile_=:'         ') then input @2 node @;
  if not (_infile_=:'                    ') then input intdate intfrom intfor @;
  input heldname--holderat;

  format intfrom time8.0 intfor mmss5.0
         heldtime time8.0 heldfrom time8.0 heldfor mmss5.0
         holderat time8.0;
run;

 

 

Notes:

  1. I use trailing @ signs, to hold the _INFILE_ variable for re-reading.  This allows the conditional reading of variables, depending on the structure of the data line.
  2. Because an obs is constructed from a variable number of lines, there is a retain statement for all the variables that are not read in with every data line (i.e.  dsname--intfor).
  3. Because of the retain statement, there is a call missing(of dsname--intfor) every time a new dataset name is encountered.

 

If you really want a CSV file, then set up a FILE statement and a PUT statement at the bottom of the data step. Also you can change DATA WANT to DATA _NULL_.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
serge68
Calcite | Level 5

This is great. It pretty much gives me what I'm after. Just one thing I didn't account for in the example I provided. There are instances where only the HOLDER values are present(i.e. there could be multiple HOLDER values for a single HELD value).

 

These are getting displayed with incorrect values. Not sure how I can account for that?

 

I've attached a new sample input file which reflects this(Contentions02.txt). Doing a find on JAST0064 takes you to a line I'm referring to.

 

 

 

Thanks….

mkeintz
PROC Star

Apply the same principles I described earlier.

 

  1. Because variables HELDNAME--HELDC are sometime inherited from the prior line, they need to be RETAINed.
  2. If columns 35-95 are not blank, then input heldname--heldc
  3. And the last input statement now should only include the HOLDER... variables.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
serge68
Calcite | Level 5

Great. This is exactly what I was after. Appreciate all your assistance.

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 900 views
  • 0 likes
  • 3 in conversation