BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

I'm going to be the recipient of a stream of CSV files transferred daily.  The CSV file is obscenely wide.  It has more than 1,000 variables, the first row which lists the variable names is ~35,000 characters long.

 

Looks like PROC IMPORT can only read variable names up to 32,767 places.  So if I try:

 

filename foo "&root/Data/In/mycsv.csv" lrecl=50000 ;

proc import 
  datafile=foo
  out=want 
  dbms=csv 
  replace ;
  guessingrows=max;
run;

PROC IMPORT only reads the variable names up to 32767.  So one variable name is truncated, and after that variables are named VAR1000-VAR1200 or whatever.

 

Is there any way to tell PROC IMPORT to read past 32767 characters when it reads in the variable names?

 

I was hoping to avoid writing the data step code to import the data, as the number of variables in each file will vary over days (unfortunately : ).  I know the full set of 1200 variables that could be in the file.  But some days the file might have 1,198 variables, or 1,195 variables, just because fun stuff.

 

So now I'm thinking to read in each file I will need to do something like read in the first row of variable names as data, then use that to generate a DATA step to do the importing.  Which could be feasible.  But seems like a hassle.  

 

On SAS 9.4M6 on linux.

 

 

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

If you want to read them all as strings that it also easy.

data names;
  infile 'myfile.txt' dsd obs=1 ;
  varnum+1;
  input name :$32. @@ ;
run;
filename code temp;
data _null_;
  file code lrecl=70;
  put 'input (' @;
  do while (not eof);
    set names end=eof;
    put name @ ;
 end;
 put ') (:$200.); ' ;
 stop;
run;

data want;
  infile 'myfile.txt' dsd truncover firstobs=2;
%include code / source2;
run;

View solution in original post

13 REPLIES 13
ballardw
Super User

<Rant/ on>

Varying the number of variables in a data set that should be the same is amateur work at best.

With that start one of the things I would be tempted to check is if the column headings/variables are even spelled consistently. If not then you have a lot of work to use this mess. At which point I typically bring the subject up to management and share how much time (= $$$$) is spend changing code every day. An hour a day translates to at least 2000 hours per year. Multiply by your salary and include that as a minimum wasted budget item.

 

Any sort of recurring data should be in the same structure and only changed per an agreement between source and users.

<Rant/off>

 

I would be tempted to store the list of expected variables in a data set then first read the header line of the data (_infile_ variable).

Then find which position each of the words in the data set appears in the file and write the INPUT statement with that information.

Likely I would make a basic data step with the informat, labels, formats and such code that might look like:

data fromstupidfile;
   infile "that source file" dlm=',' dsd lrecl=1000000;
   informat thisvar $10. thatvar $25 thedate yymmdd10. <continue for all of the 1000 + variables INCLUDING the ones that may not be there>
   ;
   Format <list of variables and formats as needed>;
   label <list of the variables>;
   %include "<path to file>\input.sas";
 run;

That include file could be made similar to:

data varlist;
   length varname $32. ;
   input varname;
datalines;
Name 
sex
age
height
weight
sometimesthere
;

data headerrow;
   headerrow="sex,name,age,weight,height";
run;

proc sql;
   create table helper as
   select a.varname, findw(b.headerrow,strip(a.varname),',','ie') as inputorder
   from varlist as a, headerrow as b
   ;
quit;

/* use the helper to write an INput statement*/
proc sort data=helper;
   by inputorder;
run;
data _null_;
   file "<path for file\input.sas";
   set helper end=last;
   if _n_=1 then put "input";
   if inputorder>0 then put varname;
   /* end the input statement*/
   if last then put ";";
run;

The Varlist set is just an example. How you make it is up to you.

The Headerrow data set could be made with data step reading the file, obs=1. An input; and assignment like Headerrow=_infile_;

 

Caution: If the spelling isn't consistent you are hosed. This does not address spaces in the column headings, so you may need to two variables in the Varlist data set, the variable name you want and the expected text. The FINDW would use the expected text. The options used in Findw are i for "ignore case" to handle some of the possible spelling issues and E to report on word count order instead of character positions.

 

Proc import would be very wrong for daily files because each time you read a file the variable lengths, if not the type can change depending on the content of the file. Plus if you have values that look numberish, such as account numbers, with leading zeros then Proc import will treat them as numbers, losing the leading zeros.

 

A data step means that your data sets will have the same variables, albeit some missing, the same type and the same lengths. So when you go to combine the daily data sets for any purpose you do not have start "fixing" things to allow that.

 

Quentin
Super User

Thanks @ballardw .  Yes, it's a poor data generating process.  It is actually a program generating the data, so the column names wont have spelling errors.  But it's a poor data generating process which results in "optional" columns.

 

In this case the data is all character, so if PROC IMPORT would work I'd be okay, at least for some rapid prototyping.  I could import the file, and force all the vars to be $200 or whatever (it's not a lot of rows), rip it into pieces to normalize it a bit and shove it into a database with some varchar columns.

 

But instead, I think you're right I'm going to have to bit the bullet and spend more time creating some metadata to drive the import process.  I was going to do that anyway, as lots of the columns from the CSV will have to be renamed, etc.  But was hoping I could just start with PROC import for now for some quick loads / prototyping.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
Tom
Super User Tom
Super User

If you want to read them all as strings that it also easy.

data names;
  infile 'myfile.txt' dsd obs=1 ;
  varnum+1;
  input name :$32. @@ ;
run;
filename code temp;
data _null_;
  file code lrecl=70;
  put 'input (' @;
  do while (not eof);
    set names end=eof;
    put name @ ;
 end;
 put ') (:$200.); ' ;
 stop;
run;

data want;
  infile 'myfile.txt' dsd truncover firstobs=2;
%include code / source2;
run;
Quentin
Super User

@Tom If the column names aren't all valid var names, how would you adapt that?  My first thought would be to transpose the header row and then transpose it again, to have SAS generate the legal names for me.

 

That's a nice %include solution.  I always think macro, but here I think your old school FILE PUT code generation might be better, especially given the long lists involved.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
Tom
Super User Tom
Super User

You can either make your own rules for generating valid names or just generate generic names and use the headers as labels.

If you really have a list of all possible headers then you can calculate the names you want to use for those headers once and store it in a dataset.  Then either add a step to join the headers read back to the dataset.  Or use a format (or an informat) to convert the header string into a name string.

Quentin
Super User

Sounds good.  I think maybe I'll read in the InsanelyLongColumNamesWithCrazyCharactersInThem from the first row of the file like you suggested.

 

Then I'll hash those values using CRC32.  So at that point I've got a lookup table with arbitrary variable names, and the InsanelyLong names.

 

Then I'll read INPUT the data using the arbitrary variable names.  And after that it should be not too painful to calculate my own shortened pretty names and do the renames.  

 

Thanks again.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
Tom
Super User Tom
Super User

If you are reading everything as character then just read the original file in a "tall" structure.

data
  names(drop=row rename=(value=header))
  values
;
  infile 'myfile.txt' dsd length=ll column=cc truncover ;
  row+1;
  do col=1 by 1 until(cc>ll);
    input value :$200. @;
    if row=1 then output names;
    else output values;
  end;
  input;
run;    
ballardw
Super User

@Quentin wrote:

@Tom If the column names aren't all valid var names, how would you adapt that?  My first thought would be to transpose the header row and then transpose it again, to have SAS generate the legal names for me.

 

That's a nice %include solution.  I always think macro, but here I think your old school FILE PUT code generation might be better, especially given the long lists involved.


One way, going along with my previous code, is to place the (sort of expected) text that should appear in the header row in a data set with the Name you would like to use.

data varlist;
   infile datalines dlm='|' dsd;
   length varname $32. textinfile $100.;
   input varname textinfile;
datalines;
Name|"Customer name"
sex|"Client sex"
age|"age at procedure"
height|"client weight (Lbs)"
weight|"Client height (in)"
sometimesthere|"RAndom text here"
;

And as I mentioned earlier, when parsing the header row information use the Textinfile variable instead of the name to get the position. I'm too lazy at this point to make another header row example to work with. Should be a trivial exercise from the provided example.

 

Another element you could add to the data step that writes the %include file is if the input order value is 0 for a required variable is to halt everything so you don't assume that a value is actually there. This might happen when someone does yet another undocumented program change so that the variable that should be "Customer Name" gets changed to "Name of Customer" or "Customer Full Name" or other nonsense.

 

I used to have a data base of metadata (actually a SAS/AF application to help with data entry and such) used to read instrumental data coming from over 50 sites with as many as 50 channels of data and 100's of instrument readings per channel. The scientists were testing new instruments. So I had to parse data all the time to see if new channels or new fields in existing channels were added. In the MIDDLE of a file because the file had multiple sets of instruments reporting at different intervals, daily, hourly, half-hour, 10 or 15 minutes, and event occurrence. So my programs, that ran every day (eventually) told me when unexpected stuff occurred so I could sleuth the file time point where that happened, update my database with the new meta data and have it write my new program for that site. Some of those programs approached 50,000 lines of code. (Lots of short lines as that was easier than trying to combine elements to make shorter code lines). And sometimes I had to look at the diagnostics and go to the scientists and ask what they did because an instrument that had been reporting, for example, values in the 10 to 50 range suddenly started reporting 1000 to 3000. (Swapped instrument but reused a channel identification for the old one).

Quentin
Super User

Thanks, that's pretty close to the setting I'm in as well.

 

I think I've got the team to agree that once we settle on the set of columns for v1, it will stay the same until v2.  So I'll just ignore any column names that aren't in my approved metadata that defines what is expected.  Maybe I'll make a list of all the columns that I'm ignoring, so I know what to add in the future versions.

 

Talking with various stakeholders, yes, the scientists who are always tinkering with their toys that generate the data are the ones that will make this work 'interesting.'

 

My others thought was just to store every value in a single table with recordId, Name, Value.  Just so all the data in the database somehow.  So for 'surprise' columns that don't have a defined home in a  database table, at least the values are captured somewhere, for someone who wants to search for them.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
ballardw
Super User

One thing with reading CSV is that you will need to insert some sort of placeholder variable in the INPUT if you are skipping columns.

In a situation like that I typically define a single variable with a catchy name like Dropvar with a $1. informat.

Then my input statement might look like:

Input

   thisvar

   thatvar

   dropvar

   dropvar

   keepvar

;

 

And then have a

drop dropvar;

 

This would be fairly easy to implement using the input order var. Any place there is a gap in the sequence insert Dropvar as the name(s) of the variable to read. Routine and tedious programming with any of the approaches to keeping track of the last order value and then likely a loop to insert one or more based on the interval difference.

Quentin
Super User

Thanks.  It's coming together in my head.

 

I'll build a look up table which matches BadVarName to GoodVarName, as suggested.  Probably it will also have informat, and other useful information.

 

Then when I read a CSV file, I will read in the first row which is the bad names.  Then I'll translate the bad names into good names (with a format or whatever), and read in the data with an INPUT statement using the good names.  If there is a bad name which is not in my lookup table, I'll input it as _DropMe, as you suggest.

 

This seems more straight forward than it did a couple days ago, so thanks again for your help.  I still wish PROC IMPORT could have magically done it for me.  But in the long run, I'll definitely be better off with a data step that I write than a generated PROC IMPORT data step.  So it'll be worth the effort.

 

I should be okay, as long as I don't find out that there is a limit on the length of an input statement. : )

 

Thanks,

-Q.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
Tom
Super User Tom
Super User

If the column headers are valid variables names and are consistent between versions then using a DATA step to read the file will be much easier and reliable than PROC IMPORT.

 

First create code (or an actual dataset) for ALL possible variables.

data template;
  length var1 $10 var2 8 ;
  informat var2 yymmdd;
  format var2 date9.;
  stop;
run;

Then use one data step to get the list of names from header row of the CSV file.

data todays_variables;
   infile 'todays_file.txt' dsd obs=1;
   input name :$32. @@;
run;

Now use that data to write your program to read the actual data lines.

filename code temp;
data _null_;
  file code lrecl=75 ;
  put 'input ' @ ;
do while(not eof1);
   set todays_variables end=eof1;
   put name @ ;
end ;
  put ';' / 'keep ' @;
do while(not eof2);
   set todays_variables end=eof2;
   put name @ ;
end ;
  put ';' ;
  stop;
run;

data want;
  if 0 then set template;
  infile 'todays_file.txt' dsd truncover firstobs=2;
%include code / source2;
run;

If you are ok with the resulting dataset including ALL of the possible variables even when they are not in this version of the text file then skip generating the KEEP statement.

Quentin
Super User

Thanks @Tom .  That's a helpful start of an approach.  The variable names aren't all SAS legal names, so I'll have to play with that a bit.

 

It's funny, the first sample data I got for this were XLSX files.  And I usually hate XLSX and prefer CSV.  But reading the XLSX files with a data step with a XLSX engine was a piece of case.  It read the 1200 variables.  And it automatically converted the column names to become legal names and kept the original column names as value labels.

 

I still hate reading Excel, but this is the first time I feel like reading Excel might have been easier.  Even though reading Excel always feels riskier than it's worth. 🙂

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 6338 views
  • 3 likes
  • 3 in conversation