Re: Parsing column names and corresponding rows from a messy text file

emssas · Posted 10-19-2021 06:21 PM

I am a relatively new SAS programmer who is using SAS Enterprise Guide to import a messy text file that contains unstructured data. I need help creating a data step that splits the data into appropriate columns and corresponding rows of data. The actual file contains many more columns and over 1 million rows of data. The good news is that I only need 15 columns of data. The bad news is that some of the columns and rows are listed in another column's row values.

Raw Data file:

6=FIX.3.3; 5=1057; 25=8; 24=798; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:38:16.753; 
24=802; 5=1079; 25=8; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:39:50.053; 115=XPAS; 
6=FIX.3.3; 5=1095; 25=8; 24=803; 55=RBCC_MID_PROD; 56=TOP_MID_PROD; 33=N; 52=20210915-12:39:50.139; 115=XNAS; 1133=G; 
24=837; 5=1192; 25=8; 95=stiev; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:56:23.691; 115=XNYS; 1133=G; 8101=CARE; 
6=FIX.3.3; 5=1189; 25=8; 95=mwonka; 24=844; 55=RBCC_TOP_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:59:36.825; 115=XPAS; 1133=G; 8101=CARE; 8115=AOZ;

SAS Code:

proc import
    file="...\SAS_Sample.log"
    out=messy_data
    dbms=tab
    replace;
    delimiter=";";
run;

data clean_data;
    set messy_data;
    **Need help here**

Here is the SAS output:

Here is what I need the SAS data to look like:

Any help would be greatly appreciated!!!

ballardw · Posted 10-19-2021 08:45 PM

Proc Import as you have discovered will not read this.

Data step code will be needed. Unfortunately your values have things that are not valid variable names, i.e. 6=. 6 is not a valid name or else we could use an uncommon option called Named input.

You can create variables that you would have to name and reference everwhere as "6"n but I strongly suggest not doing so. One mistyped quote and you can generate a lot of errors; a missing n or a space between the quote and the n and another set of errors.

Do you have any documentation for this file? Such as lengths of variables, just how many of those 6= key values there are? Is any of those keys going to present on every single row?

emssas · Posted 10-20-2021 01:34 PM

Super User Ballard,

Thanks for the response. There are over 1 million of the "6=" variables in the original data set. All of the variable names begin with "X=" where X is an integer (number). I only need to use 15 specific variables (columns) for my analysis. The tricky part is that not all columns are populated for every row (transaction). I am thinking the scan() function may need to be utilized.

The length and values for each variable differs, but it is fine to use character values for all columns as the numbers have meaning behind them via a different metadata file.

Tom · Posted 10-19-2021 11:53 PM

Not hard to read. Making sense of it would require a better description of what it means.

data tall;
  infile csv dsd dlm=';' truncover;
  length row col number 8 string $40 ;
  row+1;
  do col=1 by 1 ;
    input string @;
    if string=' ' then leave;
    number=input(scan(string,1,'='),32.);
    string=scan(string,2,'=');
    output;
   end;
run;

Partial results

Obs    row    col    number    string

  1     1       1        6     FIX.3.3
  2     1       2        5     1057
  3     1       3       25     8
  4     1       4       24     798
  5     1       5       55     RBCC_MID_PROD
  6     1       6       56     TOP_MUD_PROD
  7     1       7       33     N
  8     1       8       52     20210915-12:38:16.753
  9     2       1       24     802
 10     2       2        5     1079
 11     2       3       25     8
 12     2       4        6     FIX.3.3
 13     2       5       55     RBCC_MID_PROD
 14     2       6       56     TOP_MUD_PROD
 15     2       7       33     N
 16     2       8       52     20210915-12:39:50.053
 17     2       9      115     XPAS

What is the meaning of the rows?

Does the order on the line have any meaning?

What is the meaning of the numbers in front of the equal signs?
What is the meaning of the text after the equal signs?

Do you want to reduce it back to the same number of rows?

proc sort; by row number col ; run;
proc transpose data=tall out=wide(drop=_name_) prefix=VAR;
  by row;
  id number;
  var string;
run;
proc print data=wide;
run;

Results

Obs  row  VAR5   VAR6    VAR24  VAR25  VAR33          VAR52              VAR55         VAR56      VAR115  VAR1133  VAR95   VAR8101  VAR8115

 1    1   1057  FIX.3.3   798     8      N    20210915-12:38:16.753  RBCC_MID_PROD  TOP_MUD_PROD
 2    2   1079  FIX.3.3   802     8      N    20210915-12:39:50.053  RBCC_MID_PROD  TOP_MUD_PROD   XPAS
 3    3   1095  FIX.3.3   803     8      N    20210915-12:39:50.139  RBCC_MID_PROD  TOP_MID_PROD   XNAS      G
 4    4   1192  FIX.3.3   837     8      N    20210915-12:56:23.691  RBCC_MID_PROD  TOP_MUD_PROD   XNYS      G     stiev    CARE
 5    5   1189  FIX.3.3   844     8      N    20210915-12:59:36.825  RBCC_TOP_PROD  TOP_MUD_PROD   XPAS      G     mwonka   CARE      AOZ

emssas · Posted 10-20-2021 03:33 PM

Super User Tom,

Thank you for the response. Here are the answers to your questions:

What is the meaning of the rows?

For example, "6=FIX3.3" should read 6 as the column name and FIX3.3 as the row value.

Does the order on the line have any meaning?

Order does not have any meaning.

What is the meaning of the numbers in front of the equal signs?

The meaning of the numbers in front of the equal signs is merely the column name.

What is the meaning of the text after the equal signs?

The meaning of the text after the equal signs is the row value.

Do you want to reduce it back to the same number of rows?

Yes, I would like to reduce it back to the same number of rows.

Thanks for your help!

Ksharp · Posted 10-20-2021 07:46 AM

options validvarname=any;
data want;
infile cards dlm='  ;' truncover;
input 
'6'n= $100. 
'5'n= $100.
'25'n= $100.
'95'n= $100.
'24'n= $100.
'55'n= $100.
'56'n= $100.
'33'n= $100.
'52'n= $100.
'115'n= $100.
'1133'n= $100.
'8101'n= $100.
'8115'n= $100.
;
cards4;
6=FIX.3.3; 5=1057; 25=8; 24=798; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:38:16.753; 
24=802; 5=1079; 25=8; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:39:50.053; 115=XPAS; 
6=FIX.3.3; 5=1095; 25=8; 24=803; 55=RBCC_MID_PROD; 56=TOP_MID_PROD; 33=N; 52=20210915-12:39:50.139; 115=XNAS; 1133=G; 
24=837; 5=1192; 25=8; 95=stiev; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:56:23.691; 115=XNYS; 1133=G; 8101=CARE; 
6=FIX.3.3; 5=1189; 25=8; 95=mwonka; 24=844; 55=RBCC_TOP_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:59:36.825; 115=XPAS; 1133=G; 8101=CARE; 8115=AOZ; 
;;;;

emssas · Posted 10-20-2021 03:36 PM

Super User Ksharp,

Thank you for the response. Your code is great. However, the actual data set contains over 1 million rows of data. Therefore, manually inputting cards or datalines is not feasible. Do you have any suggestions for me?

Thanks for your help!

Tom · Posted 10-20-2021 03:50 PM

DATALINES (also known as CARDS) are used here to share examples of data to demonstrate the code.

To read from an existing text file use the INFILE statement and replace the DATALINES; statement and the lines of data with a RUN: statement instead.

data want;
   infile 'myfile.txt' .... ;
   ...
run;

If the following things are true:

None of the values contain either the = or the ; character
None of the values are totally empty (only spaces)

Then you can read the NAME/VALUE pairs directly by using both = and ; as delmiters.

Read the values into a tall structure with NAME as one variable and VALUE as another.

You can even add an IF statement to only output the NAME values you want to use in this dataset.

data tall;
  infile 'myfile.xt' dlm='=;' truncover ;
  length row 8 name $8 value $50 ;
  row + 1;
  do until(name=' ');
    input name value @ ;
    if name in ('5','6','234') then output;
  end;
run;

Now you can use TRANSPOSE to convert it back into one observation (if you need that).

proc transpose data=tall out=wide(drop=_name_) prefix=var;
  by row;
  id name;
  var value;
run;

You might want to put use a WHERE statement in the PROC TRANSPOSE step to do the filtering of variables instead of using the IF/THEN in the first step. Then you can keep all of the data in the TALL dataset and create different subsets from it for other analyses later without having to go back to the raw text file.

Ksharp · Posted 10-21-2021 07:47 AM

As @Tom point out using INFILE statement to replace CARDS .
If you need all the column name ,you need go through this TXT file and using a macro variable to hold them .

Kurt_Bremser · Posted 10-21-2021 08:12 AM

Don't use non-standard names that force you to always use name literals; an erroneous omitting of the ''n can lead to funny results
Since you need to parse the input anyway, take the opportunity to set correct attributes of variables immediately

data want;
infile datalines dlm=';' truncover ;
length
  item $100
  name $5
  value $95
  var5 $50
  var6 $50
  var52 8
;
format
  var52 e8601dt23.3
;
input item @;
do while (item ne " ");
  name = scan(item,1,"=");
  value = scan(item,2,"=");
  select (name);
    when ("5") var5 = value;
    when ("6") var6 = value;
    when ("52") var52 = dhms(input(scan(value,1,"-"),yymmdd8.),0,0,input(scan(value,2,"-"),time12.));
    otherwise;
  end;
  input item @;
end;
drop item name value;
datalines4;
6=FIX.3.3; 5=1057; 25=8; 24=798; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:38:16.753; 
24=802; 5=1079; 25=8; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:39:50.053; 115=XPAS; 
6=FIX.3.3; 5=1095; 25=8; 24=803; 55=RBCC_MID_PROD; 56=TOP_MID_PROD; 33=N; 52=20210915-12:39:50.139; 115=XNAS; 1133=G; 
24=837; 5=1192; 25=8; 95=stiev; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:56:23.691; 115=XNYS; 1133=G; 8101=CARE; 
6=FIX.3.3; 5=1189; 25=8; 95=mwonka; 24=844; 55=RBCC_TOP_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:59:36.825; 115=XPAS; 1133=G; 8101=CARE; 8115=AOZ;
;;;;

Remove the DATALINES block, and replace the DATALINES keyword in the INFILE statement with the path to your file.

Add additional definitions (LENGTH, FORMAT) and entries in the SELECT as needed.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

emssas · Posted 10-28-2021 10:52 AM

Kurt,

@Kurt_Bremser Thanks for the reply. I used your code, which worked very well on my small dataset. However, when I run the code on a text (.log) file that contains 10,500+ rows of data, I get an empty data set. Here is the code:

data fixerdata;
infile '\\My Documents\prod.log' dlm=';' truncover;
length
    item $100
    name $5
    value $95
    var8001 $20
    var17 $30
    var11106 $2
    var54 $4
    var9713 $6
    var60 8
    var32 $4
    var120 $4
    var31 6.11
    var15 $4
   var8147 $100
   var8100 $15
   var55 $5
   var65 $10
;
format
   var60 e8601dt25
;
input item @;
do while (item ne " ");
    name = scan(item,1,"=");
    value = scan(item,2,"=");
    select (name);
        when ("8001") var8001 = value;
        when ("17") var17 = value;
        when ("11106") var11106 = value;
        when ("54") var54 = value;
        when ("9713") var9713 = value;
        when ("60") var60 = dhms(input(scan(value,1,"-"),yymmdd8.),0,0,input(scan(value,2,"-"),time12.));
        when ("32") var32 = value;
        when ("120") var120 = value;
        when ("31") var31 = value;
        when ("15") var15 = value;
        when ("8147") var8147 = value;
        when ("8100") var8100 = value;
        when ("55") var55 = value;
        when ("65") var65 = value;
        otherwise;
    end;
    input item @;
end;
drop item name value;
run;

After I run the above code, I get this output in SAS Enterprise Guide 7.1:

NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR

NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
      63:22   
NOTE: Variable e8601dt25 is uninitialized.
NOTE: The infile '\PROD.log' is:
      Filename=\\PROD.log,
      RECFM=V,LRECL=32767,File Size (bytes)=10750433,
      Last Modified=25Oct2021:11:11:42,
      Create Time=27Oct2021:14:23:26

NOTE: 10567 records were read from the infile '\\My Documents\PROD.log'.
      The minimum record length was 90.
      The maximum record length was 2055.
NOTE: The data set WORK.FIXERDATA has 10567 observations and 14 variables.
NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.04 seconds

Please advise!

Tom · Posted 10-28-2021 11:00 AM

The SAS log will show more information. It should show the number of lines read from the text file and the length of the longest and shortest. From your output looks like there is only one line and it did not contain any of those key names.

emssas · Posted 10-28-2021 11:11 AM

Tom,

Here is the log file info:

NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
      63:22   
NOTE: Variable e8601dt25 is uninitialized.
NOTE: The infile '\\\My Documents\PROD.log' is:
      Filename=\\My Documents\PROD.log,
      RECFM=V,LRECL=32767,File Size (bytes)=10750433,
      Last Modified=25Oct2021:11:11:42,
      Create Time=27Oct2021:14:23:26

NOTE: 10567 records were read from the infile '\\\My Documents\PROD.log'.
      The minimum record length was 90.
      The maximum record length was 2055.
NOTE: The data set WORK.FIXERDATA has 10567 observations and 14 variables.
NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.04 seconds

Kurt_Bremser · Posted 10-28-2021 11:45 AM

Modify the code for testing:

data fixerdata;
infile '\\My Documents\prod.log' dlm=';' truncover obs=10;
length
    item $100
    name $5
    value $95
    var8001 $20
    var17 $30
    var11106 $2
    var54 $4
    var9713 $6
    var60 8
    var32 $4
    var120 $4
    var31 6.11
    var15 $4
   var8147 $100
   var8100 $15
   var55 $5
   var65 $10
;
format
   var60 e8601dt25.
;
input item @;
put _infile_;
do while (item ne " ");
    name = scan(item,1,"=");
    value = scan(item,2,"=");
    select (name);
        when ("8001") var8001 = value;
        when ("17") var17 = value;
        when ("11106") var11106 = value;
        when ("54") var54 = value;
        when ("9713") var9713 = value;
        when ("60") var60 = dhms(input(scan(value,1,"-"),yymmdd8.),0,0,input(scan(value,2,"-"),time12.));
        when ("32") var32 = value;
        when ("120") var120 = value;
        when ("31") var31 = value;
        when ("15") var15 = value;
        when ("8147") var8147 = value;
        when ("8100") var8100 = value;
        when ("55") var55 = value;
        when ("65") var65 = value;
        otherwise;
    end;
    input item @;
end;
drop item name value;
run;

This will put your first 10 lines to the log.

Also note the dot I added in the FORMAT statement. The missing dot was the cause for the "uninitialized" message.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

emssas · Posted 10-28-2021 12:25 PM

Kurt,

Thanks. I ran the modified code and still am seeing blank values in the 10 rows. Here is the log file with no errors and no warnings. Here are the six Notes from the log:

NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR

NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
      63:22   

NOTE: The infile '\My Documents\PROD.log' is:
      Filename=\\My Documents\PROD.log,
      RECFM=V,LRECL=32767,File Size (bytes)=10750433,
      Last Modified=25Oct2021:11:11:42,
      Create Time=27Oct2021:14:23:26

NOTE: 10 records were read from the infile '\\My Documents\PROD.log'.
      The minimum record length was 90.
      The maximum record length was 119.

NOTE: The data set WORK.FIXERDATA has 10 observations and 14 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds

Here is a sample of what one line of input data (.log) file looks like in Notepad:

15=USD 17=00004885118TRTR1.1.1 31=779.530000 32=100 54=1 55=LML 60=20210414-14:47:46 65= 120=USD 8001=EXEC_FEW 8100= 8147=Xxxxxxxx.Xxxxxxx@XXXX.XXX 9713=815215 11106=40

In the input data, the 65= and 8100= have blank values and need to populate in the SAS output data as having blank values.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!