Reading in multiple csv files...column names appearing in SAS dataset

Jack1 · Posted 06-13-2016 10:25 AM

I have written SAS code to import three .csv files into a SAS dataset. These csv files contain the exact same column names.

I have coded my program so the first row is read as the column name (firstobs=2). However when the program goes to the second and third csv files the first row of these are being read in as records.

Here is the code:

data work.firm_count;
format Ref_Date $7. GEO $16. 'Geographical Classification'n $2. SIZE $21. Description $110. Vector $9. Coordinate $8. Value comma9. ;

infile "c:\user\homareau\PredictiveModels\FirmCounts\Firm_count_NM_*.csv" firstobs=2 dlm=',' Missover DSD;

informat Ref_Date $7. GEO $16. 'Geographical Classification'n $2. SIZE $21. Description $110. Vector $9. Coordinate $8. Value comma9. ;

input Ref_Date GEO 'Geographical Classification'n SIZE Description Vector Coordinate Value;
label Ref_Date = 'Date'
GEO = 'Prov_Name'
'Geographical Classification'n = 'Prov_Code'
SIZE = 'Firm_Size'
Description = 'NAICS_Description'
Vector = 'CANSIM_V_Code'
Coordinate = 'Coordinate'
Value = 'Count';
drop Vector Coordinate;
run;

My question: What modificaiton do I need to make to this code so that don't end up with column names ending up in my dataset? I thought "firstobs" would have done the trick?

Thanks

Jack

FreelanceReinh · Posted 06-13-2016 11:39 AM

Hi Jack,

Last month I saw a clever solution of this problem (using a combination of FIRSTOBS= and EOV= options) in this forum in a post written by ballardw.

RW9 · Posted 06-13-2016 11:42 AM

Sorry, not seeing why you are mixing things up:

input Ref_Date GEO 'Geographical Classification'n SIZE Description Vector Coordinate Value;

Why do you specify 'Geographicl Classification'n ? This doesn't make sense in this context. The ''n is for select long names from a database for instance, your specifying a name for an input statement?

Also check your file, I can't see the file, so make sure the data does start on row two. This in cludes having a return at the end of each row - if they all appear in one row firstobs will not work.

data work.firm_count;
  infile  "c:\user\homareau\PredictiveModels\FirmCounts\Firm_count_NM_*.csv" firstobs=2 dlm=',' missover dsd;
  format    ref_date $7. 
            geo $16. 
            classification $2. 
            size $21. 
            description $110. 
            vector $9. 
            coordinate $8. 
            value comma9.;
  informat  ref_date $7. 
            geo $16. 
            classification $2. 
            size $21. 
            description $110. 
            vector $9. 
            coordinate $8. 
            value comma9.;
  input     ref_date $
            geo $
            classification $
            size $
            description $
            vector $
            coordinate $
            value;
  label     ref_date='Date'
            geo='Prov_Name'
            classification='Prov_Code'
            size='Firm_Size'
            description='NAICS_Description'
            vector='CANSIM_V_Code'
            coordinate='Coordinate'
            value='Count';
  drop vector coordinate;
run;

Tom · Posted 06-13-2016 12:34 PM

Here is a simplified version of the technique that was used in the linked article.

Use the EOV option to name a flag variable that will let you detect when a new file starts.
Use the FIRSTOBS=2 option to skip the first record since the EOV flag will not be set until you start a NEW file.
Read the line and hold it with the trailing @ to allow SAS to be able to set the EOV flag.
When it is the beginning of a new file then use another INPUT statement to throw away the line.
Reset the EOV variable to false because SAS will not reset it for you.

Then the rest of the data step is the same.

data want ;
  infile "....*.csv" firstobs=2 dsd truncover eov=eov ;
  input @;
  if eov then input;
  eov=0;
  input .... ;
run;

Note also the use of the TRUNCOVER option instead of MISSOVER option. This will make writing your INPUT statement easier since you do not need to worry about reading past the end of the line. With MISSOVER if you ask SAS to read 10 characters and only 5 more exist on the line then you get a missing value, but with TRUNCOVER you get the 5 characters that do exist.

Note this will work with wildcards in the INFILE statement or when using a FILENAME statement that references multiple files.

filename indata
('file1.txt'
,'file2.txt'
,'file3.txt'
);

data want ;
   infile indata eov=eov ....

Jack1 · Posted 06-13-2016 01:47 PM

Hi Tom,

I made the modification you suggested and I couldn't get it to work.

Here is my original SAS code:

data work.firm_count;
format Ref_Date $7. GEO $16. 'Geographical Classification'n $2. SIZE $21. Description $110. Vector $9. Coordinate $8. Value comma9. ;
infile "c:\user\homareau\PredictiveModels\FirmCounts\Firm_count_NM_*.csv" firstobs=2 dlm=',' missover dsd ;
informat Ref_Date $7. GEO $16. 'Geographical Classification'n $2. SIZE $21. Description $110. Vector $9. Coordinate $8. Value comma9. ;
input Ref_Date GEO 'Geographical Classification'n SIZE Description Vector Coordinate Value;
label Ref_Date = 'Date'
      GEO = 'Prov_Name'
      'Geographical Classification'n = 'Prov_Code'
      SIZE = 'Firm_Size'
      Description = 'NAICS_Description'
      Vector = 'CANSIM_V_Code'
      Coordinate = 'Coordinate'
      Value = 'Count';
drop Vector Coordinate;
run;

How would the code above need adjusting to make it work so its not reading in headers from other sheets?

Thanks..

Jack

Tom · Posted 06-13-2016 02:01 PM

Doesn't look like you made any modifications?

Any way try this.

data work.firm_count;
format Ref_Date $7. GEO $16. 'Geographical Classification'n $2. SIZE $21. Description $110. Vector $9. Coordinate $8. Value comma9. ;
informat Ref_Date $7. GEO $16. 'Geographical Classification'n $2. SIZE $21. Description $110. Vector $9. Coordinate $8. Value comma9. ;

infile "c:\user\homareau\PredictiveModels\FirmCounts\Firm_count_NM_*.csv" firstobs=2 dsd truncover eov=eov  ;

input @;
if eov then input;
eov=0;

input Ref_Date GEO 'Geographical Classification'n SIZE Description Vector Coordinate Value;
label Ref_Date = 'Date'
      GEO = 'Prov_Name'
      'Geographical Classification'n = 'Prov_Code'
      SIZE = 'Firm_Size'
      Description = 'NAICS_Description'
      Vector = 'CANSIM_V_Code'
      Coordinate = 'Coordinate'
      Value = 'Count';
drop Vector Coordinate;
run;

Jack1 · Posted 06-13-2016 02:05 PM

OK Tom, I didn't enter all the code correctly. I ran it as you suggested and it works great! Thanks so much...

Jack

Reading in multiple csv files...column names appearing in SAS dataset

Re: Reading in multiple csv files...column names appearing in SAS dataset

Re: Reading in multiple csv files...column names appearing in SAS dataset

Re: Reading in multiple csv files...column names appearing in SAS dataset

Re: Reading in multiple csv files...column names appearing in SAS dataset

Re: Reading in multiple csv files...column names appearing in SAS dataset

Re: Reading in multiple csv files...column names appearing in SAS dataset

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!