Re: reading a variable list using INFILE

PaigeMiller · Posted 07-22-2019 12:54 PM

So you can't tell us how one example winds up with 5 subsets and another example winds up with 100 subsets? There must be a process.

But you can query a library to find out how many data sets exist, and obtain their names. Is that what you need?

--
Paige Miller

taupirho · Posted 07-22-2019 01:00 PM

Well if I was to spit up a 100 line dataset into subsets of 10 lines each ,
then split a 1000 line data set into subsets of ten lines each. That would
give me different numbers of subsets would it not?

PaigeMiller · Posted 07-22-2019 01:12 PM

So, I'm not clear on your terminology, now that you bring up the 100 line dataset.

Exactly what are we talking about? Data sets? Variable names? Rows of a data set? Something else?

But this has gone on long enough, I'm still not clear about what the problem is, we're not making progress; and I request you start from the beginning, explain the entire problem IN DETAIL with examples of what the inputs are to this desired SAS program, and what the outputs are. Please use the proper terminology as well. Do not use "subset" as I don't know what that means in this context.

--
Paige Miller

ballardw · Posted 07-22-2019 04:23 PM

@taupirho wrote:
Well if I was to spit up a 100 line dataset into subsets of 10 lines each ,
then split a 1000 line data set into subsets of ten lines each. That would
give me different numbers of subsets would it not?

If I wanted to split a bunch of data sets into, roughly, the same number of records this is the approach I would take. My example uses 100 records each. If you want a different number of lines then replace all of the 100 values in the code below with your number. Irregular numbers may be possible but you would need a different variable for each set for managing the multiple both in the SQL and the data step do loop.

proc sql;
   create table tableinfo as
   select libname, memname, nobs, ceil( nobs/100) as sets
   from dictionary.tables
   where libname='LIB' and memtype='DATA';
quit;

data _null_;   
   set tableinfo;
   file print;
   length str $ 200.;
   do i=1 to sets;
      newmemname=cats(memname,i);
      fobs= (i-1)*100+1;
      lobs = i*100;
      str = catx(' ','data', newmemname,';');
      put str;
      str = catx(' ','set', catx('.',libname,memname),'firstobs=',fobs,'obs=',lobs,';');
      put str;
      str='run;';
      put str;
   end;
run;

The proc sql gets the names of the data sets, the number of observations and calculates how many sets of 100 lines would be needed.

The LIBNAME stored in the dictionary.columns is upper case so consider that when putting your name in the code. You may be able to select desired set names to process but you haven't provided any rules so none suggested.

You did not indicate whether the new data sets went to the same library or not. I would likely recommend a different library. In which case this line

str = catx(' ','data', newmemname,';');

should be

str = catx(' ','data', catxI('.','newlib',newmemname,';');

The example writes code to the results window that you could copy to the editor and run or save for later. The File statement could be modified to write to a SAS program file. Or use Call Execute(str); instead of Put str; to submit lines for execution.

CAVEATS: If you have long data set names adding digits could create invalid names longer than 32 characters. Up to you to manage that case. If you have data sets with millions of lines a split by a small value then you increase the chance of this happening.

If you have data sets whose names end in digit already then you run a chance of creating duplicate names. If is possible you may want to consider using

      newmemname=cats(memname,'_',i);

remember that each additional character is more likely to get to the 32 character name limit.

Tom · Posted 07-22-2019 06:51 PM

What code are you trying to run that is running out of memory? Perhaps we can fix that code instead of worrying about splitting the data? You might want to start a new thread explaining what analysis you are doing that needs so much memory.

Do you really have large datasets with hundreds of variables? Are you sure that analyzing just some of the variables at a time will help? Do you really need to split the data to do that? Why not just pass a list of the variables to use to the analysis program?

Or do you really have large datasets with so many observations? Are you sure that analyzing just some of the observations at a time will help? Do you really need to split the data to do that? Why not just pass in a WHERE condition (or FIRSTOBS=/OBS= combination) to process just some of the observations? Do you have a natural grouping variable (STATE, COUNTRY, STUDYID) that you could use to split the data into smaller parts? Is the data already sorted by that variable(s)? Could you just add BY group processing to your analysis to make it use less memory?