Re: Add Variable Header - help!

deleted_user · Posted 12-06-2007 10:25 PM

Hi

I've got data that aligns up perfectly. The file i'm accessing is over 100GB in which I'm extracting daily data from (may turn into weekly but anywho).

It's a error report I'm doing for applications however I've hit a wall on accuracy of the reports. I've found the correct column i need however am unable to identify it!!

Example Data:

Header1,Header2,Header3
data1,data2,data3,data4
data1,data2,data3
data1,data2,data3,data4

The Variable Headers don't have the same amount of columns of actual data in the file.

How can I add a (Header4) next to the Header3.
I was thinking a substr scan for Header3 and replace with Header3, Header4 for the delimiter to make it another column?

Not sure if this will work as so far it's failed.

The end result will be a PROC FREQ report with amout of errors for amount of transactions.

Anyone have any ideas? (I'm using MainFrame) if that helps!

Message was edited by: JohnR

deleted_user · Posted 12-07-2007 09:11 AM

Are you sure data1,data2,data3,data4 isn't really "data1","data2, data3", "data4" ?

I have a situation where a data field has embedded column delimiters and even carriage returns in it. This makes the original data file appear to have a variable number of fields, but it actually doesn't.

Another thing, how often does header1,header2,header3 occur? Is it simply the first observation (record) in the file? or are they intermingled.

I have another situation where the first 5 fields in a record (observation) are "headers", but there is a variable number of fields with variable formats following. One of the header fields can be, and is, used to identify the type of the record; so, we parse that field to identify the record type and then output that type of record to its own dataset.

example code:

data type1 (keep= ... )
type2 (keep= ... )
type3 (keep= ... )
;

attrib field1 length= informat= format= label= ;
attrib field2 length= informat= format= label= ;
...

infile input_data delimiter='7F'x dsd truncover;

input field1 field2 field3 field4 field5 @ ;

record_type = scan(field3, 1 , " " );
record_type = scan(record_type, 2 , "/" );

select (record_type);
when ('type_one') link type_one;
when ('type_two') link type_two;
when ('type_three') link type_three;
otherwise;
end;
return;

type_one:
attrib t1f1 ... ;
attrib t1f2 ... ;
...

input t1f1 t1f2 t1f3 ... ;
output type1;
return;

type_two:
attrib t2f1 ... ;
attrib t2f2 ... ;
attrib t2f3 ... ;

input t2f1 t2f2 t2f3 ... ;
output type2 ;
return;

type_three:
attrib t3f1 ... ;
attrib t3f2 ... ;
attrib t3f3 ... ;

input t3f1 t3f2 t3f3 ... ;
output type3 ;
return;

run;
quit;

deleted_user · Posted 12-09-2007 06:01 PM

Hi Chuck

Thanks for your help, I'll have a closer look at the data and see if i can work it out.

There is actaully about 22 Headers with 23 Data fields and some of the data is extensive so i can't use an exact colum length.

The last datafield is the data I want so will work on that.

Will work with what you've given me and see how I go.

Cheers

Cynthia_sas · Posted 12-09-2007 11:43 PM

Hi:
You might be able to use either PROC IMPORT or a DATA step program to read your data.

For example, let's say that you had the following CSV file stored in the location c:\temp\wronghdr.csv and it looked like:
[pre]
"name","v1","v2","v3"
"alan",11,12,13,14
"bob",21,22,,24
"carl",,32,33,34
"dave",41,42,43
"ed",51,52,53,54
[/pre]

Note how there are only 4 "headers" or variable names on line 1, but there are sometimes 5 data items on a line. Note also, how the data starts on line 2; and how at least 2 of the rows have values for all 5 columns, but some of the rows do NOT have values for all 5 columns.

If you did use PROC IMPORT to read the above file, one possible code method would be:
[pre]
proc import datafile="C:\temp\wronghdr.csv"
out=work.diffexamp
dbms=csv
replace;
datarow=2;
getnames=no;
run;
[/pre]
and the output would be:
[pre]
what happens with proc import

Obs VAR1 VAR2 VAR3 VAR4 VAR5

1 alan 11 12 13 14
2 bob 21 22 . 24
3 carl . 32 33 34
4 dave 41 42 43 .
5 ed 51 52 53 54
[/pre]

But, what if you REALLY want the NAME column to be NAME and not VAR1 -- you can see how PROC IMPORT names the variables when you skip over line 1.

For the most control over naming variables, especially in a tricky situation like yours, DATA step program code does give you the most control.

SAS has 4 different ways to read "flat files" or "text files" into a SAS dataset -- using DATA step program code. The code methods all involve the INPUT statement and if you look in the documentation for a topic entitled, "Statements: INPUT Statement", you will see a description of how to use each type:
Column Input
List Input
Formatted Input
Named Input

You can use whatever variable names you want, if you use an INPUT statement. In addition to using only one INPUT method or another, you can also mix different types of INPUT statements in the same program. If you write your own program, you can skip over line 1 (presumably the one with the headers), so it won't matter that the data has 22 headers on line 1, but 23 data fields, because in the INPUT statement, you get to name the variables/columns yourself, so if you KNOW that there are 23 data fields, you can list 23 variable names on your INPUT statement. Then if a single row does not have all 23 variables listed, you can use MISSOVER or TRUNCOVER (whichever is appropriate to your type of INPUT statement) to set the variable values to missing.

A SAS DATA step program to skip over line 1 and start looking for the first observation on line 2 in order to read the above data file and use variable names of your own choosing would be:
[pre]
data My_Col_Names;
infile 'c:\temp\wronghdr.csv' missover dsd firstobs=2;
input name $ var1 var2 var3 var4;
run;

proc print data=My_Col_Names;
title 'Correct column names no matter what was in line 1';
run;
[/pre]

and the result of the PROC PRINT would look like (note how the variable names from the INPUT statement were used, since line 1 was skipped over -- using the FIRSTOBS=2 option):
[pre]
Correct column names no matter what was in line 1

Obs name var1 var2 var3 var4

1 alan 11 12 13 14
2 bob 21 22 . 24
3 carl . 32 33 34
4 dave 41 42 43 .
5 ed 51 52 53 54
[/pre]

Your data may not look like what's above in the example of "wronghdr.csv", but in SAS there is usually a way to read just about ANY kind of file. You may not be able to use the EG automated wizards, but with code, you can read almost any text file into SAS format.

So, this is a perfect example of how you could get absolutely on-target help from Tech Support. They could take a look at your data file (if you sent them a sample) and make recommendations about the most appropriate form of INPUT statement to use.

cynthia

deleted_user · Posted 12-13-2007 07:27 PM

Thanks for you help Cynthia!

Unfortunately we're unable to use PROC IMPORT (not sure why, using mainframe may be disabled).

the INFILE statement works great however i'm unable to get my data.

only way i can grab my file is using the SET statement.

I've tried doing the basic

DATA newdata;
SET INDN.COL;
run;

infile 'newdata' missover dsd firstobs=2;
input name $ var1 var2 var3 var4;run;

proc print data=newdata;
run;

However another unfortunate thing is that my .COL (SAS DATA) file is over 100Gigabytes... I can't do the above step as it's to large to use a temporary dataset.

If i had a statement such as if date eq '17507' then output;
It then does output but then the next step ignores the VAR i'm putting in as the SET statement only identifies 34 vars and cuts the rest of data out for the temp dataset.

I'm still working on it will keep you posted 🙂

Cheers
John

Cynthia_sas · Posted 12-13-2007 11:26 PM

John:
If your data are in a SAS dataset, then the INFILE/INPUT approach is entirely WRONG and PROC IMPORT is entirely WRONG. So ignore just about everything in my previous post.

If your data are in a SAS dataset, then the SET statement is the right way to go.
One thing you can try to see what's up with your file is run a PROC CONTENTS to see the specific list of column names. Even if you have a big file, PROC CONTENTS is not reading the file directly, but only the descriptor portion of the file to produce host-specific information, data-set information (sort order, index info, etc) and a list of variables:
[pre]
proc contents data=INDN.COL;
title 'what is this file';
run;
[/pre]

I don't understand what you mean by this statement: "It then does output but then the next step ignores the VAR i'm putting in as the SET statement only identifies 34 vars and cuts the rest of data out for the temp dataset." What does your LIBNAME statement look like for INDN??? What are the data set attributes for INDN -- do you have a LIBNAME statement or is INDN a JCL DD statement?? The SAS companion for your operating system (z/OS or MVS/TSO) should have information on how you can distinguish the difference between a SAS library dataset on the mainframe and a flat file on the mainframe.

If your big file is a VSAM file or a DB2 table or a SoftwareAG Adabas table (or Oracle, or Sybase, or ...) and you want to read it with SAS, then you have to make sure that you are using the SAS/Access product to directly read that kind of file type or you are using the correct syntax to treat the 3rd party file as though it were a SAS dataset.

Try this experiment (which should work on the mainframe, too) to get an idea of what SAS dataset information looks like:
[pre]
proc contents data=sashelp.class;
title 'SASHELP.CLASS BEFORE adding variable';
run;

proc print data=sashelp.class noobs;
title 'Print all obs in SASHELP.CLASS';
run;

data work.newdata;
set sashelp.class;
age_in_5_yrs = age + 5;
if sex = 'M' then projected_height = height * 1.05;
else if sex = 'F' then projected_height = height * 1.025;
run;

proc contents data=work.newdata;
title 'Compare this Contents listing with the first one to see the new variables';
run;

proc print data=work.newdata;
title 'Compare this report to the previous report';
run;
[/pre]

If you run the above code, you will see that in the "before" print of the SAS dataset, the variables are NAME, SEX, AGE, HEIGHT, WEIGHT. In the second dataset -- WORK.NEWDATA, the variables are those 5, plus AGE_IN_5_YRS and PROJECTED_HEIGHT. Because SASHELP.CLASS is already a SAS dataset, I can use it in a SET statement. The new file WORK.NEWDATA will be based on SASHELP.CLASS, but I am adding 2 new variables or columns to the new file -- AGE_IN_5_YRS is calculated based on AGE + 5 and PROJECTED_HEIGHT is set based on gender -- figuring that boys will grow more than girls over 5 years -- it really doesn't matter -- the variables are really arbitrary. What is important to note is that I can READ SASHELP.CLASS using a SET statement. INFILE/INPUT statements would be completely irrelevant in this context.

I apologize for misunderstanding the nature of your data. PROC IMPORT and INFILE/INPUT are not techniques you should use to read a SAS dataset or any file which can be read through the LIBNAME engine for a product (like Oracle, DB2, VSAM, etc). You only have 2 choices ... if your file is a "sequential" file then you have to parse and read it with INFILE/INPUT statements; if your file is a SAS dataset or a table that can be read with a SAS/Access product (and you have that SAS/Access product), then you should be able to use a SET statement. You would NOT mix SET and INFILE/INPUT statements to read a SAS dataset.

I think you should contact Tech Support for more in-depth help.

cynthia

deleted_user · Posted 12-16-2007 07:48 PM

Thankyou for all your help Cynthia!

I've located the problem of all my frustration!!

We have two major data sets.
Log Data (all data of the environments) and a SAS.DATASET
Someone in our database team runs there script which copies data from log data to the sas dataset for us to all use.

I've located his script and found out that for the variable 'Pathname' (the variable i wanted all the data for) anything after a ',' he basically deletes.

Therefore no matter WHAT I was doing i wasn't going to get my data 🙂

I've got permission and i'm now going to just use my flat files so therefore the infile/input statements I can now use and am successfully.

My raw data is TAB delimited so DLM='05'X works a charm 🙂

Thankyou again

Add Variable Header - help!