BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
daisy6
Quartz | Level 8

Hi SAS experts,

 

I need to import multiply csv files with different headers into SAS. For example,

the first csv file with 13 rows of headers. The variable names begin at 14th row, and data are the follows. The variable names of the second csv file begin at 15th row. The variable name of the third csv file begin at 13th row, ........etc. 

 

There are more than 45 variables in each csv file and the order of variables are not the same,but all the csv files of the first variable name are "Item Type".  The csv files look like the follows:

 

Export author: Wendy Smith
Export date: 11/29/201717 07:46:44
Data Source name: "Department"
Item Bank name: "Shoes"
Query name: "Shoes 2017"
Query type: Item Characteristics: Collection
Query Item type: Style

Visual query specific information:
Selected test versions:
Version name: "1.00" Shoes name: "S10316456MT_1" Modified: 01/06/2016
Version name: "1.00" Shoes name: "S10316456MT_2" Modified: 01/06/2016
Version name: "1.00" Shoes name: "S10316567MT_3" Modified: 01/12/2016

 

"Item Type"|"Item ID"|"AA"|" BB"|"CC"|"DD"|"EE"|"FF"|"GG"|"Date"|"Owner"|"Paper Type"|"HH"|"II"|"Rev. Order"|"TT"|Type"|"Source"|"Status"|"Translation Date"|"Translator"|"UDA 2 - 2016"|"UDA 3"|"UDA 3"|"Item Recorded"|"Response"|" Weight"|"Statistical History Name"|"Language"|" Start"|" End"|"Group"|" Bench"|"Level"|"N"|" Value"|"Measure"|"KK"

data are the following......................

 

I don't want to delete the headers manually. How do I do? Thanks for any suggestions.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

If the files are not too large then it is probably easiest to just copy the file to a new file without the variable number of top lines.  Then you can just let PROC IMPORT attempt to make some sense out of this strange headers and try to guess what type of data each column contains.

 

So if you existing file is name myfile.dat then here is a data step that will copy it to a temporary file.

filename copy temp;
data _null_;
  infile 'myfile.dat' ;
  file copy ;
  if _n_ = 1 then do until (found) ;
    input @;
    if upcase(_infile_) =: '"ITEM TYPE"' then found=1;
    else input;
  end;
  input;
  put _infile_;
run;

You can then write a simple PROC IMPORT step to read from the temporary file COPY.

 

 

View solution in original post

8 REPLIES 8
ballardw
Super User

If that is an example from an actual file you have more problems.

If I understand that

"Item Type"|"Item ID"|"AA"|" BB"|"CC"|"DD"|"EE"|"FF"|"GG"|"Date"|"Owner"|"Paper Type"|"HH"|"II"|"Rev. Order"|"TT"|Type"|"Source"|"Status"|"Translation Date"|"Translator"|"UDA 2 - 2016"|"UDA 3"|"UDA 3"|"Item Recorded"|"Response"|" Weight"|"Statistical History Name"|"Language"|" Start"|" End"|"Group"|" Bench"|"Level"|"N"|" Value"|"Measure"|"KK"

Is supposed to be the actual data column headers please look at the text starting here in the middle of that string:

 

"TT"|Type"|"Source"|"Status"|"Translation Date"|

Notice there is no " before Type. That means that if you have another missing " there is possibility of creating something like

 

|"EE"|"FF"|GG"|"Date"|Owner"|

which result in the | between Date and Owner NOT being treated as a delimiter and will cause a number of data issues in the resulting file.

 

 

Here is an example of reading inline data, finding the "header row", writing the header and data out to another file and then importing the result back.

 

data _null_;
   retain startflag 0;
   file "C:\junk.txt";
   input;
   if _infile_ =:'"Item Type"' then startflag=1;
   if startflag then put _infile_;
datalines;
Export author: Wendy Smith
Export date: 11/29/201717 07:46:44
Data Source name: "Department"
Item Bank name: "Shoes"
Query name: "Shoes 2017"
Query type: Item Characteristics: Collection
Query Item type: Style

Visual query specific information:
Selected test versions:
Version name: "1.00" Shoes name: "S10316456MT_1" Modified: 01/06/2016
Version name: "1.00" Shoes name: "S10316456MT_2" Modified: 01/06/2016
Version name: "1.00" Shoes name: "S10316567MT_3" Modified: 01/12/2016

 

"Item Type"|"Item ID"|"AA"|" BB"|"CC"|"DD"|"EE"|"FF"|"GG"|"Date"|"Owner"|"Paper Type"|"HH"|"II"|"Rev. Order"|"TT"|Type"|"Source"|"Status"|"Translation Date"|"Translator"|"UDA 2 - 2016"|"UDA 3"|"UDA 3"|"Item Recorded"|"Response"|" Weight"|"Statistical History Name"|"Language"|" Start"|" End"|"Group"|" Bench"|"Level"|"N"|" Value"|"Measure"|"KK"
"Item Type"|"Item ID"|"AA"|" BB"|"CC"|"DD"|"EE"|"FF"|"GG"|"Date"|"Owner"|"Paper Type"|"HH"|"II"|"Rev. Order"|"TT"|Type"|"Source"|"Status"|"Translation Date"|"Translator"|"UDA 2 - 2016"|"UDA 3"|"UDA 3"|"Item Recorded"|"Response"|" Weight"|"Statistical History Name"|"Language"|" Start"|" End"|"Group"|" Bench"|"Level"|"N"|" Value"|"Measure"|"KK"
"Item Type"|"Item ID"|"AA"|" BB"|"CC"|"DD"|"EE"|"FF"|"GG"|"Date"|"Owner"|"Paper Type"|"HH"|"II"|"Rev. Order"|"TT"|Type"|"Source"|"Status"|"Translation Date"|"Translator"|"UDA 2 - 2016"|"UDA 3"|"UDA 3"|"Item Recorded"|"Response"|" Weight"|"Statistical History Name"|"Language"|" Start"|" End"|"Group"|" Bench"|"Level"|"N"|" Value"|"Measure"|"KK"
;
run;

proc import  datafile="C:\junk.txt" 
     out=work.example
     dbms=DLM;
     datarow=2;
     Delimiter='|';
     guessingrows = max;
run;

There are lots of examples on this site of providing lists of data sets for processing.

 

I would recommend creating a separate text file (the File in the data step) so you can check on things like that header row issue. And then the proc import should reference a different data set in the OUT= for each file read.

 

Note that your variable names are going to be ugly in places with random _ leading, trailing or in the middle because of spaces.

I duplicated the header row to have something to read as you did not provide any example data.

Tom
Super User Tom
Super User

In general it is easiest to read delimited files using a data step.  That way you have complete control over the definition of the variables and what informat (if any) to use and what format (if any) to attach to the variables. So just copy the header row.

 

"Item Type"|"Item ID"|"AA"|" BB"|"CC"|"DD"|"EE"|"FF"|"GG"|"Date"|"Owner"|"Paper Type"|"HH"|"II"|"Rev. Order"|"TT"|Type"|"Source"|"Status"|"Translation Date"|"Translator"|"UDA 2 - 2016"|"UDA 3"|"UDA 3"|"Item Recorded"|"Response"|" Weight"|"Statistical History Name"|"Language"|" Start"|" End"|"Group"|" Bench"|"Level"|"N"|" Value"|"Measure"|"KK"

Change it into a LENGTH statement to define your variables. You can clean up the names to be unique and valid SAS names. I have no idea what those variables are, but you should and you should definitely have a better idea than PROC IMPORT would, since it can only guess by what it sees in this particular example set of data.

 

 

length 
  Item_Type $10 
  Item_ID $12 
  AA BB CC DD EE FF GG 8
  Date 8 
  Owner $20 
  Paper_Type $10
  HH II 8
  Rev_Order  $10
  TT 8
  Type $10
  Source $20
  Status $10
  Translation_Date  8
  Translator $30
  UDA2_2016 8
  UDA3 8
  UDA3x 8
  Item_Recorded  $30
  Response 8
  Weight 8
  Statistical_History_Name $30
  Language $10
  Start 8
  End 8
  Group $10 
  Bench $10
  Level 8
  N 8
  Value 8 
  Measure 8
  KK 8
;

 

 

If any of the variables, like the dates, need an INFORMAT or FORMAT attached then add INFORMAT and/or FORMAT statement. Then add an INPUT statement.  So your data step will look something like this for your example.

 

.

 

data want ;
  infile 'myfile.csv' dsd dlm='|' truncover firstobs=14 ;
  length ...... ;
  informat date Translation_Date mmddyy.;
  format date Translation_Date date9. ;
  input Item_Type -- KK ;
run;

If you just have three files then just count where the first data line is and adjust the FIRSTOBS= option on the INFILE statement.  Otherwise perhaps you can just let the data step figure out where the header row is by looking at the beginning of the line.

data want ;
  infile 'myfile.csv' dsd dlm='|' truncover ;
  length ...... ;
  informat date Translation_Date mmddyy.;
  format date Translation_Date date9. ;
  if _n_=1 then do until (lowcase(item_type)='item type');
    input item_type ;
  end;
  input Item_Type -- KK ;
run;

 

daisy6
Quartz | Level 8
Thanks for the suggestion. But the variables name changes and each file may not have the same numbers of variable.
RW9
Diamond | Level 26 RW9
Diamond | Level 26

That file is Not a CSV file.  That file is in no way shape or form anything to do with a CSV file.  A CSV file is a standard file format which is defined as:

A plain text file containing one row optional row of headers, then subsequent individual rows of data items, separated by commas.

 

Nothing in your file conforms to that, it is Not a CSV file.

 

Therefore, now we are talking about a general non-specific file type, you need to write code to match the structure of that file type.  If you get this data from a third party they should provide a document on what the structure is.  From that document you can create a datastep import program, which, following the rules in the document, reads in the data, and parses it out into a usable format.  I would say, if you only want the delimited data at the end then do something along these lines:

data inter;
  infile "somedata.dlm";
  length varname varresult $2000;
  input;
  retain read 0;
  if index(_infile_,"Item Type")>0 or read then do;
    read=1;
    do i=1 to countw(_infile_,"|") by 2;
       varname=scan(_infile_,i,"|");
       varresult=scan(_infile_,i+1,"|");
     end;
  end;
run;

This will give you a dataset of varname and varresult which are the pairs in the delimited list.  You can then process this dataset further, transposing up the data for instance.  The same code can be used on different files, however get it working for one first off.

 

However, make sure you go back to whoever asked you to do this and say due to the data being in a poor format, it will take X amount of resources more to handle it.  Me, I would send it back to source and tell them to use a standard file format rather than making something up themselves, would make everyones life easier.

daisy6
Quartz | Level 8

Thanks for all the replies. I need to make clarification here. I do not want these info which are always on the first 13-15 rows of each file:


Export author: Wendy Smith
Export date: 11/29/201717 07:46:44
Data Source name: "Department"
Item Bank name: "Shoes"
Query name: "Shoes 2017"
Query type: Item Characteristics: Collection
Query Item type: Style

Visual query specific information:
Selected test versions:
Version name: "1.00" Shoes name: "S10316456MT_1" Modified: 01/06/2016
Version name: "1.00" Shoes name: "S10316456MT_2" Modified: 01/06/2016
Version name: "1.00" Shoes name: "S10316567MT_3" Modified: 01/12/201

 

I only care about the variable names and the following data:

These are the variable names begin with "Item Type". Sometimes the variable names’ order change and each file not always has the same numbers of variables.

 

"Item Type"|"Item ID"|"AA"|" BB"|"CC"|"DD"|"EE"|"FF"|"GG"|"Date"|"Owner"|"Paper Type"|"HH"|"II"|"Rev. Order"|"TT"|Type"|"Source"|"Status"|"Translation Date"|"Translator"|"UDA 2 - 2016"|"UDA 3"|"UDA 3"|"Item Recorded"|"Response"|" Weight"|"Statistical History Name"|"Language"|" Start"|" End"|"Group"|" Bench"|"Level"|"N"|" Value"|"Measure"|"KK"

 

The data are the follows:

"Shoestype"|"I33573"|"New Value"|"1"|"0"|"New Value"|"New Value"|"none"|"None"|"2015_10"|"New Value"|"New Value"|"default"|"default"|"New Value"|"scored"|"New Value"|"Approved"|"New Value"|"New Value"|" "|" "|"none"|"TRUE"|"C"|1.00|"S1 Shoes Sept 2015"|"ENU"|"3/1/2014"|"8/9/2014"|""|"TRUE"|"S1 - Shoes"|"369.000000"|"0.540000"|"0.400000"|"1.210000"

 

"Shoestype"|"I33681"|"New Value"|"1"|"0"|"New Value"|"New Value"|"none"|"None"|"2015_10"|"New Value"|"New Value"|"default"|"default"|"New Value"|"scored"|"New Value"|"Approved"|"New Value"|"New Value"|" "|" "|"none"|"TRUE"|"C"|1.00|"S1 Shoes Sept 2011"|"ENU"|"8/10/2009"|"8/31/2011"|""|"TRUE"|"S1 - Shoes"|"1039.000000"|"0.950000"|"0.210000"|"-1.740000"

 

"Shoestype"|"I33687"|"New Value"|"1"|"0"|"New Value"|"New Value"|"none"|"None"|"2015_10"|"New Value"|"New Value"|"default"|"default"|"New Value"|"scored"|"New Value"|"Approved"|"New Value"|"New Value"|" "|" "|"none"|"TRUE"|"B"|1.00|"S1 Shoes Sept 2015"|"ENU"|"3/1/2014"|"8/9/2014"|""|"TRUE"|"S1 - Shoes"|"360.000000"|"0.880000"|"0.200000"|"-0.770000"

Tom
Super User Tom
Super User

If the files are not too large then it is probably easiest to just copy the file to a new file without the variable number of top lines.  Then you can just let PROC IMPORT attempt to make some sense out of this strange headers and try to guess what type of data each column contains.

 

So if you existing file is name myfile.dat then here is a data step that will copy it to a temporary file.

filename copy temp;
data _null_;
  infile 'myfile.dat' ;
  file copy ;
  if _n_ = 1 then do until (found) ;
    input @;
    if upcase(_infile_) =: '"ITEM TYPE"' then found=1;
    else input;
  end;
  input;
  put _infile_;
run;

You can then write a simple PROC IMPORT step to read from the temporary file COPY.

 

 

daisy6
Quartz | Level 8

Hello Tom,

Thanks for the solution. However, I did not get the all data and variable names when I used the code you provide.  Here is my code:

filename copy temp;
data _null_;
infile "&bdir.\&inf." ;
file copy ;
if _n_ = 1 then do until (found) ;
input @;
if upcase(_infile_) =: '"ITEM TYPE"' then found=1;
else input;
end;
input;
put _infile_;
run;

 

proc import datafile=copy out=want dbms=csv replace;
delimiter="|";
getnames=yes;
datarow=2;
guessingrows=32767;
run;

 

The dataset want  missed the part of variables and data. Those variables and data with the names of var28 - var35 on the file I attached are very important.  When I check the copy.dat file, it is also truncated, missing those variables and data too.  

 

 

Tom
Super User Tom
Super User
The notes in the SAS log should show the min and max record lengths read and written.
Try adding LRECL= settings to your INFILE and FILE statements. You should be able to just set it to something really large without any problem say 10 million. LRECL=10000000

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1305 views
  • 0 likes
  • 4 in conversation