Solved: Re: Splitting big dataset into smaller dataset

TXSASneophyte · Posted 04-05-2016 03:51 PM

Hi Everybody!

I recently came across acode online that works fantastically well in splitting a big dataset into smaller datasets. However, I was wondering if someone could help me modify it so the smaller datasets are more aligned with what I want.

Right now, the code (see below) splits the big dataset based on one variable (a character variable). What I want to do is split the big dataset by two different variables, the current variable in the code but also by a second variable (not currently in code).

For example, let's assume I have a dataset for the entire United States for six years (2009-2015). I want to split that dataset per state per year (Texas2012).

Here is the code I am using:

%LET TABLE=PQIEX.STELIGENROLL_KEYVAR;
%LET COLUMN=SDA;
PROC SQL NOPRINT;
/* build a mini program for each value */
/* create a table with valid chars from data value */
SELECT DISTINCT
CAT("SDA.PROFILE_",COMPRESS(&COLUMN.,,'KAD'),"(WHERE=(&COLUMN.=",QUOTE(&COLUMN.),"))")
INTO :ALLSTEPS SEPARATED BY ''
FROM &TABLE.;
QUIT;
%PUT NOTE:&ALLSTEPS.;

DATA SDA.&ALLSTEPS.;
SET &TABLE;
RUN;

Some additional information that might be helpful:

1) SAS Version 9.2

2) Library name: PQIEX

3) Dataset: STELIGENROLL_KEYVAR

4) First Splitting Variable: SDA (character variable with 13 different values)

5) Second Splitting Variable (the one I am having trouble incorporating into the code): FFYEAR (numeric variable for the years 2009-2015)

I am fairly new with SAS and just have no clue how to modify this code in order to split the dataset by two variables. My mind starts getting confused on how to treat the numeric variable and how to incorporate it into the name of the new datasets. If y'all can please help me (and have patience with my ignorance), I would greatly appreciate it!

Muchas Gracias

AhmedAl_Attar · Posted 04-05-2016 05:24 PM

Try the code below.

I used sashelp.prdsal2 --> it has fields similar to your case (year, state). The code will perform single data read with multiple writes.

Hope it helps,

Ahmed

proc sql noprint;
   select distinct cats('WORK.',translate(STRIP(state),'_',' '),'_',year)
           ,cat ('WHEN (year=',year,' and state="',STRIP(state),'") OUTPUT WORK.',translate(STRIP(state),'_',' '),'_',year,';')
   into   :g_outputNames separated by ' '
           ,:g_conditions separated by ' '
   from   sashelp.prdsal2
   order by year;
quit;
/*
%put g_outputNames = %superq(g_outputNames);
%put g_conditions=%superq(g_conditions);
*/

DATA &g_outputNames;
   SET sashelp.prdsal2;
   SELECT;
       &g_conditions
       OTHERWISE;
   END;
RUN;

View solution in original post

Julie4435637 · Posted 04-05-2016 04:06 PM

I don't have a lot of time now to look at your code but maybe the cats function could be useful here.

select distinct(cats(state, year))

where ....

TXSASneophyte · Posted 04-05-2016 04:27 PM

I tried doing this but my attemps end in failure. I don't fully grasp how to modify the current code and how it handles the cat function to include the second variable.

Reeza · Posted 04-05-2016 04:08 PM

Why do you want to split your dataset?

How many records in your big dataset vs each smaller ones (on average).

This isn't usually a recommended step, there are a lot of ways to use data within SAS without splltting it up.

TXSASneophyte · Posted 04-05-2016 04:29 PM

My big dataset has over 300 million observations; my computer takes a long time to do basic things with such a massive dataset. I also need to run specific analysis based on the observations that are in specific region and for specific years. I cannot give you an average of how many observations would be allocated into each smaller dataset since the regions and years vary considerably with their observations.

LinusH · Posted 04-05-2016 04:45 PM

A combined index on year and state would probably give a perform boost (if that's a normal query).
Also, use SPDE instead of the BASE engine could speed up IO.
Investing in faster disks and more RAM, depending in your common query pattern of course. HW is cheap, man power isn't.

Data never sleeps

Reeza · Posted 04-05-2016 04:58 PM

Ok...that seems somewhat reasonable 🙂

Here's a blog that goes over splitting a data set.

http://blogs.sas.com/content/sasdummy/2015/01/26/how-to-split-one-data-set-into-many/

One thing to keep in mind, if you have two variables that you're splliting on you can add the two variables you need to design your naming structure of your new datasets with some thought so that you can easily query them later on. If you're likely to use year more often or state most often then that should come first in your naming convention.

I would think with 300 million splitting by only one variable would be enough to obtain efficient results. Perhaps split by one and index by the other? My suggestion would be split by year and then index by state.

Look into using call execute to do this, instead of a macro or in combination with a macro.

I'll try and post some code tonight if someone else hasn't already answered it 🙂

TXSASneophyte · Posted 04-05-2016 05:13 PM

Hi Reeza,

You found the original blog I found the code in! I actually reached out to the individual asking for help and he directed me to posting the question on here.

I really want to split my dataset by region and year because there is a good chance I'll be performing different analyses on them. I understand your point of trying to keep the dataset together as much as possible, but considering all the different moving parts in my project, having a separate dataset per region per year is a lot more efficient.

I do appreciate your help getting this resolved!

Reeza · Posted 04-05-2016 05:17 PM

So what's your naming convention for the files?

AhmedAl_Attar · Posted 04-05-2016 05:24 PM

Try the code below.

I used sashelp.prdsal2 --> it has fields similar to your case (year, state). The code will perform single data read with multiple writes.

Hope it helps,

Ahmed

proc sql noprint;
   select distinct cats('WORK.',translate(STRIP(state),'_',' '),'_',year)
           ,cat ('WHEN (year=',year,' and state="',STRIP(state),'") OUTPUT WORK.',translate(STRIP(state),'_',' '),'_',year,';')
   into   :g_outputNames separated by ' '
           ,:g_conditions separated by ' '
   from   sashelp.prdsal2
   order by year;
quit;
/*
%put g_outputNames = %superq(g_outputNames);
%put g_conditions=%superq(g_conditions);
*/

DATA &g_outputNames;
   SET sashelp.prdsal2;
   SELECT;
       &g_conditions
       OTHERWISE;
   END;
RUN;

ChrisNZ · Posted 04-05-2016 06:57 PM

If you really must split, use the code @AhmedAl_Attar provided.

You'll create all your tables in one go rather than reading your large table over and over.

To decide wether to split, you have to weigh:

- the time and storage wasted in splitting the original table

versus

- the time used to create an index and the potential additional time processing subsets of the original table directly (as compared to processing smaller tables)

I also agree with @LinusH that SPDE is worth a try. Its indexes are very fast, and its binary compression very effective.

Maybe duplicating and splitting is the answer. Or maybe not.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 04-06-2016 03:47 AM

I'd try to get the list of distinct state/year combinations via a data step and proc sort nodupkey. select distinct in proc sql can be very resource and time consuming with big datasets.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

TXSASneophyte · Posted 04-06-2016 10:43 AM

The code worked perfectly! Thanks for the help Ahmed 🙂

Registration is open

SAS Training: Just a Click Away