Solved: Re: Split SAS dataset by columns

rfarmenta · Posted 06-15-2017 11:30 AM

I have a SAS dataset with 3000 variables that I need to split into 3 datasets with 10000 variables each. I need each dataset to have the studyid in the dataset so they can remerged later. Is there a way to do this without listing varibales in a keep statement? I've done this to subset observations but not varibales.

art297 · Posted 06-15-2017 01:51 PM

The first datastep was only to create some test data for me to use as a test. Nothing about it is needed for the rest of the code.

The only lines you have to change in the three sql calls are:

        where libname eq upcase("WORK") and
              memname eq upcase("have")

You have to change them if your dataset isn't in the WORK library and/or isn't called HAVE.

Similarly, the the datastep, at the end, you have to change:

set have;

to reflect the one or two level name of your actual dataset (e.g., set sashelp.class; );

Art, CEO, AnalystFinder.com

View solution in original post

Astounding · Posted 06-15-2017 11:55 AM

Not really. Macro language can automate the construction of the lists of variable names (so you don't have to type them out), but they still have to be named. Here's a quick and dirty version that will give you two data sets (not three):

data want_char (keep=studyid _character_)

want_num (keep=studyid _numeric_);

set have;

run;

Let me amend this ... if you are willing to do some preliminary work. For example:

proc contents data=have noprint out=_contents_;

run;

proc print data=_contents_;

where varnum in (1, 10000, 10001, 20000, 20001, 30000);

var name varnum;

run;

That will give you the names of the variables that are the 10,000th, 20,000th, etc. You may have to adjust the numbers here to find the right variable names. But once you find them, you can plug them in here:

data one (keep=studyid varname1--varname10000)

two (keep=studyid varname10001--varname20000)

three (keep=studyid varname20001--varname30000);

set have;

run;

The double dash defines a range of variables, using the order in which they were created in the incoming data set.

art297 · Posted 06-15-2017 12:34 PM

Here is one way that is less manual and more generalizable:

/* create some test data */
data have (drop=i);
  studyid='AA,AA,BB';
  array allMyVars_ {3020} 3. (3020*1);
  do i=1 to 100;
    output;
  end;
  stop;
run;

proc sql noprint;
  select name into :names1
    separated by ' '
      from dictionary.columns
        where libname eq upcase("WORK") and
              memname eq upcase("have")
          having upcase(name) eq 'STUDYID' or
              varnum le max(varnum)/3 
  ;
  
  select name into :names3
    separated by ' '
      from dictionary.columns
        where libname eq upcase("WORK") and
              memname eq upcase("have")
          having upcase(name) eq 'STUDYID' or
              varnum ge 2*(max(varnum)/3)
  ;
  select name into :names2
    separated by ' '
      from dictionary.columns
        where libname eq upcase("WORK") and
              memname eq upcase("have")
          having upcase(name) eq 'STUDYID' or
              (max(varnum)/3)<varnum< (2*(max(varnum)/3))    
  ;
quit;

data want1 (keep=&names1.) want2 (keep=&names2.) want3 (keep=&names3.);
  set have;
run;

Art, CEO, AnalystFinder.com

rfarmenta · Posted 06-15-2017 01:40 PM

Thank you for the suggestions/code.

Art-the code works with the test data but I am not sure how to include my own dataset as I do not code much using Proc SQL. Do I have to use the first data step, it doesn't seem like it is necessary except for the array with allmyvars. I tried to set that data step to my data but that gives me weird numbers.

Thank you again!

art297 · Posted 06-15-2017 01:51 PM

The first datastep was only to create some test data for me to use as a test. Nothing about it is needed for the rest of the code.

The only lines you have to change in the three sql calls are:

        where libname eq upcase("WORK") and
              memname eq upcase("have")

You have to change them if your dataset isn't in the WORK library and/or isn't called HAVE.

Similarly, the the datastep, at the end, you have to change:

set have;

to reflect the one or two level name of your actual dataset (e.g., set sashelp.class; );

Art, CEO, AnalystFinder.com

Tom · Posted 06-15-2017 03:18 PM

@art297 To improve performance of the query against DICTIONARY.COLUMNS do not use functions in the WHERE clause because then it cannot take advantage of the index and not bother to search the metadata of all visible datasets.

You can use macro function %UPCASE() instead.

        where libname eq %upcase("WORK") 
          and memname eq %upcase("have")

rfarmenta · Posted 06-15-2017 03:22 PM

Thank you all. This worked great. I can modify to use the where clause as well. Thank you!

art297 · Posted 06-15-2017 03:44 PM

@Tom: Didn't make any visible difference in the present case since the time for that step was all of .01 seconds, but useful to know. Would the same recommendation apply to the upcase in the having clause?

@rfarmenta: A slight change to the conditions in the code will cause a slightly better split. i.e.:

proc sql noprint;
  select name into :names1
    separated by ' '
      from dictionary.columns
        where libname eq %upcase("WORK") and
              memname eq %upcase("have")
          having upcase(name) eq 'STUDYID' or
              varnum le max(varnum)/3 
  ;
  
  select name into :names3
    separated by ' '
      from dictionary.columns
        where libname eq %upcase("WORK") and
              memname eq %upcase("have")
          having upcase(name) eq 'STUDYID' or
              varnum gt 2*(max(varnum)/3)
  ;
  select name into :names2
    separated by ' '
      from dictionary.columns
        where libname eq %upcase("WORK") and
              memname eq %upcase("have")
          having upcase(name) eq 'STUDYID' or
              (max(varnum)/3)<varnum<= (2*(max(varnum)/3))    
  ;
quit;

data want1 (keep=&names1.) want2 (keep=&names2.) want3 (keep=&names3.);
  set have;
run;

Art, CEO, AnalystFinder.com

Tom · Posted 06-15-2017 04:44 PM

You cannot use %upcase(name) since that will just uppercase the variable's name and not its value.

You need the upcase(name) since the metadata will store the variable name in mixed case.

In the past I have seen it make a query that takes just fractions of a second when the WHERE clause causes SAS to skip opening datasets and libraries go to taking 20,30 or more seconds. The time hit will happen if you have librefs that point to locations with many datasets or worse that are pointing to external databases. If the WHERE clause can eliminate those without SAS having to open every dataset or query every remote database it can save a lot of time. And if some of you "tables" are actually views then the views might even need to be run.

It does look like with SAS 9.4 that SAS can now understand that upcase('work') is just a constant and so that is why you see no effect. I suspect that SAS is optimizing the query to eliminate the UPCASE() function. I am not sure when this improvement was made.

Note that if you use a DATA step to access SASHELP.VCOLUMN view instead of using PROC SQL to access either SASHELP.VCOLUMN or DICTIONARY.COLUMNS then SAS will not be able to use the WHERE condition to skip opening the datasets and libraries. The WHERE clause is not passed through the view to the query against DICTIONARY.COLUMNS, but instead is applied to the results pulled from DICTIONARY.COLUMNS.

Example:

110  proc sql noprint;
111    select count(distinct memname),count(*)
112      into :nds trimmed, :nvars trimmed
113    from dictionary.columns
114    where libname=upcase('WORK')
115    ;
116  %put &=nds &=nvars;
NDS=1 NVARS=8
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.06 seconds
      cpu time            0.00 seconds


117  proc sql noprint;
118    select count(distinct memname),count(*)
119      into :nds trimmed, :nvars trimmed
120    from sashelp.vcolumn /* dictionary.columns */
121    where libname='WORK'
122    ;
123  %put &=nds &=nvars;
NDS=1 NVARS=8
124  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.11 seconds
      cpu time            0.01 seconds


125
126  data _null_;
127    set sashelp.vcolumn;
128    where libname='WORK';
129  run;

ERROR: Teradata connection: The UserId, Password or Account is invalid.
INFO: Data file SASUSER.TERADATA.DATA is in a format that is native to another host, or the file
encoding does not match the session encoding. Cross Environment Data Access will be used, which
might require additional CPU resources and might reduce performance.
NOTE: There were 8 observations read from the data set SASHELP.VCOLUMN.
      WHERE libname='WORK';
NOTE: DATA statement used (Total process time):
      real time           4.41 seconds
      cpu time            1.62 seconds

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away