I am a relatively new SAS programmer who is using SAS Enterprise Guide to import a messy text file that contains unstructured data. I need help creating a data step that splits the data into appropriate columns and corresponding rows of data. The actual file contains many more columns and over 1 million rows of data. The good news is that I only need 15 columns of data. The bad news is that some of the columns and rows are listed in another column's row values.
Raw Data file:
6=FIX.3.3; 5=1057; 25=8; 24=798; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:38:16.753; 24=802; 5=1079; 25=8; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:39:50.053; 115=XPAS; 6=FIX.3.3; 5=1095; 25=8; 24=803; 55=RBCC_MID_PROD; 56=TOP_MID_PROD; 33=N; 52=20210915-12:39:50.139; 115=XNAS; 1133=G; 24=837; 5=1192; 25=8; 95=stiev; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:56:23.691; 115=XNYS; 1133=G; 8101=CARE; 6=FIX.3.3; 5=1189; 25=8; 95=mwonka; 24=844; 55=RBCC_TOP_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:59:36.825; 115=XPAS; 1133=G; 8101=CARE; 8115=AOZ;
SAS Code:
proc import
file="...\SAS_Sample.log"
out=messy_data
dbms=tab
replace;
delimiter=";";
run;
data clean_data;
set messy_data;
**Need help here**
Here is the SAS output:
Here is what I need the SAS data to look like:
Any help would be greatly appreciated!!!
Proc Import as you have discovered will not read this.
Data step code will be needed. Unfortunately your values have things that are not valid variable names, i.e. 6=. 6 is not a valid name or else we could use an uncommon option called Named input.
You can create variables that you would have to name and reference everwhere as "6"n but I strongly suggest not doing so. One mistyped quote and you can generate a lot of errors; a missing n or a space between the quote and the n and another set of errors.
Do you have any documentation for this file? Such as lengths of variables, just how many of those 6= key values there are? Is any of those keys going to present on every single row?
Super User Ballard,
Thanks for the response. There are over 1 million of the "6=" variables in the original data set. All of the variable names begin with "X=" where X is an integer (number). I only need to use 15 specific variables (columns) for my analysis. The tricky part is that not all columns are populated for every row (transaction). I am thinking the scan() function may need to be utilized.
The length and values for each variable differs, but it is fine to use character values for all columns as the numbers have meaning behind them via a different metadata file.
Not hard to read. Making sense of it would require a better description of what it means.
data tall;
infile csv dsd dlm=';' truncover;
length row col number 8 string $40 ;
row+1;
do col=1 by 1 ;
input string @;
if string=' ' then leave;
number=input(scan(string,1,'='),32.);
string=scan(string,2,'=');
output;
end;
run;
Partial results
Obs row col number string 1 1 1 6 FIX.3.3 2 1 2 5 1057 3 1 3 25 8 4 1 4 24 798 5 1 5 55 RBCC_MID_PROD 6 1 6 56 TOP_MUD_PROD 7 1 7 33 N 8 1 8 52 20210915-12:38:16.753 9 2 1 24 802 10 2 2 5 1079 11 2 3 25 8 12 2 4 6 FIX.3.3 13 2 5 55 RBCC_MID_PROD 14 2 6 56 TOP_MUD_PROD 15 2 7 33 N 16 2 8 52 20210915-12:39:50.053 17 2 9 115 XPAS
What is the meaning of the rows?
Does the order on the line have any meaning?
What is the meaning of the numbers in front of the equal signs?
What is the meaning of the text after the equal signs?
Do you want to reduce it back to the same number of rows?
proc sort; by row number col ; run;
proc transpose data=tall out=wide(drop=_name_) prefix=VAR;
by row;
id number;
var string;
run;
proc print data=wide;
run;
Results
Obs row VAR5 VAR6 VAR24 VAR25 VAR33 VAR52 VAR55 VAR56 VAR115 VAR1133 VAR95 VAR8101 VAR8115 1 1 1057 FIX.3.3 798 8 N 20210915-12:38:16.753 RBCC_MID_PROD TOP_MUD_PROD 2 2 1079 FIX.3.3 802 8 N 20210915-12:39:50.053 RBCC_MID_PROD TOP_MUD_PROD XPAS 3 3 1095 FIX.3.3 803 8 N 20210915-12:39:50.139 RBCC_MID_PROD TOP_MID_PROD XNAS G 4 4 1192 FIX.3.3 837 8 N 20210915-12:56:23.691 RBCC_MID_PROD TOP_MUD_PROD XNYS G stiev CARE 5 5 1189 FIX.3.3 844 8 N 20210915-12:59:36.825 RBCC_TOP_PROD TOP_MUD_PROD XPAS G mwonka CARE AOZ
Super User Tom,
Thank you for the response. Here are the answers to your questions:
What is the meaning of the rows?
For example, "6=FIX3.3" should read 6 as the column name and FIX3.3 as the row value.
Does the order on the line have any meaning?
Order does not have any meaning.
What is the meaning of the numbers in front of the equal signs?
The meaning of the numbers in front of the equal signs is merely the column name.
What is the meaning of the text after the equal signs?
The meaning of the text after the equal signs is the row value.
Do you want to reduce it back to the same number of rows?
Yes, I would like to reduce it back to the same number of rows.
Thanks for your help!
options validvarname=any; data want; infile cards dlm=' ;' truncover; input '6'n= $100. '5'n= $100. '25'n= $100. '95'n= $100. '24'n= $100. '55'n= $100. '56'n= $100. '33'n= $100. '52'n= $100. '115'n= $100. '1133'n= $100. '8101'n= $100. '8115'n= $100. ; cards4; 6=FIX.3.3; 5=1057; 25=8; 24=798; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:38:16.753; 24=802; 5=1079; 25=8; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:39:50.053; 115=XPAS; 6=FIX.3.3; 5=1095; 25=8; 24=803; 55=RBCC_MID_PROD; 56=TOP_MID_PROD; 33=N; 52=20210915-12:39:50.139; 115=XNAS; 1133=G; 24=837; 5=1192; 25=8; 95=stiev; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:56:23.691; 115=XNYS; 1133=G; 8101=CARE; 6=FIX.3.3; 5=1189; 25=8; 95=mwonka; 24=844; 55=RBCC_TOP_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:59:36.825; 115=XPAS; 1133=G; 8101=CARE; 8115=AOZ; ;;;;
Super User Ksharp,
Thank you for the response. Your code is great. However, the actual data set contains over 1 million rows of data. Therefore, manually inputting cards or datalines is not feasible. Do you have any suggestions for me?
Thanks for your help!
DATALINES (also known as CARDS) are used here to share examples of data to demonstrate the code.
To read from an existing text file use the INFILE statement and replace the DATALINES; statement and the lines of data with a RUN: statement instead.
data want;
infile 'myfile.txt' .... ;
...
run;
If the following things are true:
Then you can read the NAME/VALUE pairs directly by using both = and ; as delmiters.
Read the values into a tall structure with NAME as one variable and VALUE as another.
You can even add an IF statement to only output the NAME values you want to use in this dataset.
data tall;
infile 'myfile.xt' dlm='=;' truncover ;
length row 8 name $8 value $50 ;
row + 1;
do until(name=' ');
input name value @ ;
if name in ('5','6','234') then output;
end;
run;
Now you can use TRANSPOSE to convert it back into one observation (if you need that).
proc transpose data=tall out=wide(drop=_name_) prefix=var;
by row;
id name;
var value;
run;
You might want to put use a WHERE statement in the PROC TRANSPOSE step to do the filtering of variables instead of using the IF/THEN in the first step. Then you can keep all of the data in the TALL dataset and create different subsets from it for other analyses later without having to go back to the raw text file.
data want;
infile datalines dlm=';' truncover ;
length
item $100
name $5
value $95
var5 $50
var6 $50
var52 8
;
format
var52 e8601dt23.3
;
input item @;
do while (item ne " ");
name = scan(item,1,"=");
value = scan(item,2,"=");
select (name);
when ("5") var5 = value;
when ("6") var6 = value;
when ("52") var52 = dhms(input(scan(value,1,"-"),yymmdd8.),0,0,input(scan(value,2,"-"),time12.));
otherwise;
end;
input item @;
end;
drop item name value;
datalines4;
6=FIX.3.3; 5=1057; 25=8; 24=798; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:38:16.753;
24=802; 5=1079; 25=8; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:39:50.053; 115=XPAS;
6=FIX.3.3; 5=1095; 25=8; 24=803; 55=RBCC_MID_PROD; 56=TOP_MID_PROD; 33=N; 52=20210915-12:39:50.139; 115=XNAS; 1133=G;
24=837; 5=1192; 25=8; 95=stiev; 6=FIX.3.3; 55=RBCC_MID_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:56:23.691; 115=XNYS; 1133=G; 8101=CARE;
6=FIX.3.3; 5=1189; 25=8; 95=mwonka; 24=844; 55=RBCC_TOP_PROD; 56=TOP_MUD_PROD; 33=N; 52=20210915-12:59:36.825; 115=XPAS; 1133=G; 8101=CARE; 8115=AOZ;
;;;;
Remove the DATALINES block, and replace the DATALINES keyword in the INFILE statement with the path to your file.
Add additional definitions (LENGTH, FORMAT) and entries in the SELECT as needed.
Kurt,
@Kurt_Bremser Thanks for the reply. I used your code, which worked very well on my small dataset. However, when I run the code on a text (.log) file that contains 10,500+ rows of data, I get an empty data set. Here is the code:
data fixerdata;
infile '\\My Documents\prod.log' dlm=';' truncover;
length
item $100
name $5
value $95
var8001 $20
var17 $30
var11106 $2
var54 $4
var9713 $6
var60 8
var32 $4
var120 $4
var31 6.11
var15 $4
var8147 $100
var8100 $15
var55 $5
var65 $10
;
format
var60 e8601dt25
;
input item @;
do while (item ne " ");
name = scan(item,1,"=");
value = scan(item,2,"=");
select (name);
when ("8001") var8001 = value;
when ("17") var17 = value;
when ("11106") var11106 = value;
when ("54") var54 = value;
when ("9713") var9713 = value;
when ("60") var60 = dhms(input(scan(value,1,"-"),yymmdd8.),0,0,input(scan(value,2,"-"),time12.));
when ("32") var32 = value;
when ("120") var120 = value;
when ("31") var31 = value;
when ("15") var15 = value;
when ("8147") var8147 = value;
when ("8100") var8100 = value;
when ("55") var55 = value;
when ("65") var65 = value;
otherwise;
end;
input item @;
end;
drop item name value;
run;
After I run the above code, I get this output in SAS Enterprise Guide 7.1:
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
63:22
NOTE: Variable e8601dt25 is uninitialized.
NOTE: The infile '\PROD.log' is:
Filename=\\PROD.log,
RECFM=V,LRECL=32767,File Size (bytes)=10750433,
Last Modified=25Oct2021:11:11:42,
Create Time=27Oct2021:14:23:26
NOTE: 10567 records were read from the infile '\\My Documents\PROD.log'.
The minimum record length was 90.
The maximum record length was 2055.
NOTE: The data set WORK.FIXERDATA has 10567 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.04 seconds
Please advise!
The SAS log will show more information. It should show the number of lines read from the text file and the length of the longest and shortest. From your output looks like there is only one line and it did not contain any of those key names.
Tom,
Here is the log file info:
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
63:22
NOTE: Variable e8601dt25 is uninitialized.
NOTE: The infile '\\\My Documents\PROD.log' is:
Filename=\\My Documents\PROD.log,
RECFM=V,LRECL=32767,File Size (bytes)=10750433,
Last Modified=25Oct2021:11:11:42,
Create Time=27Oct2021:14:23:26
NOTE: 10567 records were read from the infile '\\\My Documents\PROD.log'.
The minimum record length was 90.
The maximum record length was 2055.
NOTE: The data set WORK.FIXERDATA has 10567 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.04 seconds
Modify the code for testing:
data fixerdata;
infile '\\My Documents\prod.log' dlm=';' truncover obs=10;
length
item $100
name $5
value $95
var8001 $20
var17 $30
var11106 $2
var54 $4
var9713 $6
var60 8
var32 $4
var120 $4
var31 6.11
var15 $4
var8147 $100
var8100 $15
var55 $5
var65 $10
;
format
var60 e8601dt25.
;
input item @;
put _infile_;
do while (item ne " ");
name = scan(item,1,"=");
value = scan(item,2,"=");
select (name);
when ("8001") var8001 = value;
when ("17") var17 = value;
when ("11106") var11106 = value;
when ("54") var54 = value;
when ("9713") var9713 = value;
when ("60") var60 = dhms(input(scan(value,1,"-"),yymmdd8.),0,0,input(scan(value,2,"-"),time12.));
when ("32") var32 = value;
when ("120") var120 = value;
when ("31") var31 = value;
when ("15") var15 = value;
when ("8147") var8147 = value;
when ("8100") var8100 = value;
when ("55") var55 = value;
when ("65") var65 = value;
otherwise;
end;
input item @;
end;
drop item name value;
run;
This will put your first 10 lines to the log.
Also note the dot I added in the FORMAT statement. The missing dot was the cause for the "uninitialized" message.
Kurt,
Thanks. I ran the modified code and still am seeing blank values in the 10 rows. Here is the log file with no errors and no warnings. Here are the six Notes from the log:
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
63:22
NOTE: The infile '\My Documents\PROD.log' is:
Filename=\\My Documents\PROD.log,
RECFM=V,LRECL=32767,File Size (bytes)=10750433,
Last Modified=25Oct2021:11:11:42,
Create Time=27Oct2021:14:23:26
NOTE: 10 records were read from the infile '\\My Documents\PROD.log'.
The minimum record length was 90.
The maximum record length was 119.
NOTE: The data set WORK.FIXERDATA has 10 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
Here is a sample of what one line of input data (.log) file looks like in Notepad:
15=USD 17=00004885118TRTR1.1.1 31=779.530000 32=100 54=1 55=LML 60=20210414-14:47:46 65= 120=USD 8001=EXEC_FEW 8100= 8147=Xxxxxxxx.Xxxxxxx@XXXX.XXX 9713=815215 11106=40
In the input data, the 65= and 8100= have blank values and need to populate in the SAS output data as having blank values.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.