Merging Obervations

Accepted Solution Solved
Reply
Contributor
Posts: 20
Accepted Solution

Merging Obervations

I'm parsing data from a txt file and formatting into a csv. Part of that, is to combine multiple observations into a single row. There is no common identifier between the two rows, so I'm trying to use the difference in the number of fields present(https://communities.sas.com/t5/General-SAS-Programming/Merging-observations-into-a-single-one/m-p/41... )

 

Some of the lines I'm looking to output, have some of their fields missing(those prefixed with NODE or INT in my code). Which is fine, as I don't need them for the final csv file, but I can't seem to work around how to handle that missing data.

 

If both NODE and the INT fields are missing, it returns data, however if just NODE is missing, that line doesn't get returned.

 

Its also truncating the leading spaces caused by these missing fields, throwing my other fields out of whack. I've tried using $CHAR132; in my length statement, but get his error –

   ERROR 391-185: Expecting a variable length specification.                   

   ERROR 202-322: The option or parameter is not recognized and will be ignored.

 

 

Ultimately my output csv should look like –

Dataset,  HeldNAME, HeldNUM, HeldDate, HeldTIME, HeldJESID, HeldFROM, HeldFOR, HolderNAME, HolderJESID, HolderAT        

 

With the Dataset value being the third value(i.e. the ??? in this example) on the line that begins with "DATASET  NAME:   ???"

 

This is where I'm at thus far -

 

data File1;                                                    
retain dataset;                                                
infile IFILE missover;                                         
length NODE IntDATE IntFROM IntFOR     
       HeldNAME HeldNUM HeldDate HeldTIME HeldJESID            
       HeldFROM HeldFOR HeldF HeldS HeldC                      
       HolderNAME HolderT HolderJESID HolderH HolderAT        
       $132;                                                   
                                                                
input NODE -- HolderAT;                                        
if find(IntDATE,':','i') ge 1 OR                               
   NODE = '--NODE--'          OR                               
   NODE = 'NAME  '            OR                                
   NODE = '--------'          then delete;                     
 
if missing(IntFOR) then do;         
    dataset = IntFROM;              
    end;                            
else output;                        
                                     
                                    
proc print;      

                    

 

 

Attached is a sample of my input file.


Accepted Solutions
Solution
‎11-13-2017 10:53 AM
Trusted Advisor
Posts: 1,139

Re: Merging Obervations

[ Edited ]

I think you're going about this in a difficult way.  There should be no need to assign a length or informat of $132 to every variable. In the example below, I've assigned an INFORMAT specific to each variable.

 

And you can use a naked INPUT statement to populate the _INFILE_ automatic variable.  It has a trailing @, for re-reading purposes.  Items 1 and 2 below is  a bit like your  logic, but the rest basically does conditional reading of subgroups of variables.

 

  1. if the line is blank, or characters 2 through 9 contain '  NAME  ','--NODE--', or '--------' then delete the input record
  2. If the input line has the string "DATASET NAME:" then input @ 'DATASET NAME:' DSNAME;   Then delete the input record
  3. If it otherwise has non-blanks in characters 1-9, read NODE
  4. If it has non-blanks in characters 1-20, read INTDATE,  INTFROM, INTFOR
  5. Then read all the HELD... and HOLDER... variables.

 

data want;
  infile 'c:\temp\t.txt' firstobs=2 truncover;

  informat dsname $40.  node $8. intdate $5. intfrom hhmmss8. intfor stimer6.
       HeldNAME $8. HeldNUM 5.0 HeldDate $5. HeldTIME time8.0 HeldJESID $8.           
       HeldFROM time8.0 HeldFOR stimer6. HeldF HeldS HeldC $1.                      
       HolderNAME $8. HolderT $1. HolderJESID $8. HolderH $1. HolderAT time8.0
  ;
  retain dsname  -- intfor;
  input @;
  if _infile_=' ' then delete;
  else if substr(_infile_,2,8) in ('--NODE--','  NAME  ','--------') then delete;

  else if _infile_=: ' DATASET NAME:' then do;
   call missing (of dsname--intfor);
   input @'DATASET NAME:' dsname $40.;
   delete;
  end;

  if not (_infile_=:'         ') then input @2 node @;
  if not (_infile_=:'                    ') then input intdate intfrom intfor @;
  input heldname--holderat;

  format intfrom time8.0 intfor mmss5.0
         heldtime time8.0 heldfrom time8.0 heldfor mmss5.0
         holderat time8.0;
run;

 

 

Notes:

  1. I use trailing @ signs, to hold the _INFILE_ variable for re-reading.  This allows the conditional reading of variables, depending on the structure of the data line.
  2. Because an obs is constructed from a variable number of lines, there is a retain statement for all the variables that are not read in with every data line (i.e.  dsname--intfor).
  3. Because of the retain statement, there is a call missing(of dsname--intfor) every time a new dataset name is encountered.

 

If you really want a CSV file, then set up a FILE statement and a PUT statement at the bottom of the data step. Also you can change DATA WANT to DATA _NULL_.

View solution in original post


All Replies
Trusted Advisor
Posts: 1,683

Re: Merging Obervations

[ Edited ]

Considering your note:

I've tried using $CHAR132; in my length statement ...

pay attention, you shall define:

length var_name  $132;
format var_name $132. ;
/* informat var_name $132. ; ??? */

As to your code, you can read just first 8 charachers to test  it and continue according to result:

 

data File1; 
  retain dataset;
  infile IFILE missover;                                         
  length NODE IntDATE IntFROM IntFOR     
       HeldNAME HeldNUM HeldDate HeldTIME HeldJESID            
       HeldFROM HeldFOR HeldF HeldS HeldC                      
       HolderNAME HolderT HolderJESID HolderH HolderAT        
       $132; 

   input @2 test $8.  @;  /* read first 8 characters and keep stay on line */
      
  if test  in ('--NODE--'  'NAME  ' '--------') then input; /* skip line */

  if test = 'DATASET ' then 
     input   @14 dataset $132. ;    
  else do;
     input @1 test $1. @2 NODE -- HolderAT;
     output;
  end;                                     
  drop test;
run;                   

 You may need to refine my code, as I have not tested it.

Contributor
Posts: 20

Re: Merging Obervations

Some progress, thanks. Though its is picking up lines it shouldn't and I can't see why.

 

This statement seems to not work for the '  NAME  '. Its not skipping this line.

 

if test  in ('--NODE--'  '  NAME  ' '--------') then input; /* skip line */

 

I've tried playing around with different variations of the spaces within it, and yet it always gets included when it shouldn't.

 

 

 Thanks

Trusted Advisor
Posts: 1,683

Re: Merging Obervations

Maybe you need change line to:

input @1 test $8.  @;     /* instead @2 */

The text will be left justified.

and change line:

if test  in ('--NODE--'  'NAME    ' '--------') then input; /* skip line */

where 'NAME    ' - is left justified, up to 8 characters.

 

You need also retain NODE INTDATE INTFROM INTFOR variables as they exist only on first data line.

 

 

Solution
‎11-13-2017 10:53 AM
Trusted Advisor
Posts: 1,139

Re: Merging Obervations

[ Edited ]

I think you're going about this in a difficult way.  There should be no need to assign a length or informat of $132 to every variable. In the example below, I've assigned an INFORMAT specific to each variable.

 

And you can use a naked INPUT statement to populate the _INFILE_ automatic variable.  It has a trailing @, for re-reading purposes.  Items 1 and 2 below is  a bit like your  logic, but the rest basically does conditional reading of subgroups of variables.

 

  1. if the line is blank, or characters 2 through 9 contain '  NAME  ','--NODE--', or '--------' then delete the input record
  2. If the input line has the string "DATASET NAME:" then input @ 'DATASET NAME:' DSNAME;   Then delete the input record
  3. If it otherwise has non-blanks in characters 1-9, read NODE
  4. If it has non-blanks in characters 1-20, read INTDATE,  INTFROM, INTFOR
  5. Then read all the HELD... and HOLDER... variables.

 

data want;
  infile 'c:\temp\t.txt' firstobs=2 truncover;

  informat dsname $40.  node $8. intdate $5. intfrom hhmmss8. intfor stimer6.
       HeldNAME $8. HeldNUM 5.0 HeldDate $5. HeldTIME time8.0 HeldJESID $8.           
       HeldFROM time8.0 HeldFOR stimer6. HeldF HeldS HeldC $1.                      
       HolderNAME $8. HolderT $1. HolderJESID $8. HolderH $1. HolderAT time8.0
  ;
  retain dsname  -- intfor;
  input @;
  if _infile_=' ' then delete;
  else if substr(_infile_,2,8) in ('--NODE--','  NAME  ','--------') then delete;

  else if _infile_=: ' DATASET NAME:' then do;
   call missing (of dsname--intfor);
   input @'DATASET NAME:' dsname $40.;
   delete;
  end;

  if not (_infile_=:'         ') then input @2 node @;
  if not (_infile_=:'                    ') then input intdate intfrom intfor @;
  input heldname--holderat;

  format intfrom time8.0 intfor mmss5.0
         heldtime time8.0 heldfrom time8.0 heldfor mmss5.0
         holderat time8.0;
run;

 

 

Notes:

  1. I use trailing @ signs, to hold the _INFILE_ variable for re-reading.  This allows the conditional reading of variables, depending on the structure of the data line.
  2. Because an obs is constructed from a variable number of lines, there is a retain statement for all the variables that are not read in with every data line (i.e.  dsname--intfor).
  3. Because of the retain statement, there is a call missing(of dsname--intfor) every time a new dataset name is encountered.

 

If you really want a CSV file, then set up a FILE statement and a PUT statement at the bottom of the data step. Also you can change DATA WANT to DATA _NULL_.

Contributor
Posts: 20

Re: Merging Obervations

This is great. It pretty much gives me what I'm after. Just one thing I didn't account for in the example I provided. There are instances where only the HOLDER values are present(i.e. there could be multiple HOLDER values for a single HELD value).

 

These are getting displayed with incorrect values. Not sure how I can account for that?

 

I've attached a new sample input file which reflects this(Contentions02.txt). Doing a find on JAST0064 takes you to a line I'm referring to.

 

 

 

Thanks….

Trusted Advisor
Posts: 1,139

Re: Merging Obervations

Apply the same principles I described earlier.

 

  1. Because variables HELDNAME--HELDC are sometime inherited from the prior line, they need to be RETAINed.
  2. If columns 35-95 are not blank, then input heldname--heldc
  3. And the last input statement now should only include the HOLDER... variables.
Contributor
Posts: 20

Re: Merging Obervations

Great. This is exactly what I was after. Appreciate all your assistance.

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 180 views
  • 0 likes
  • 3 in conversation