topic Re: Duplication in parallel in SAS Programming

Duplication in parallel

Planck — Thu, 30 Mar 2017 14:51:31 GMT

Hi,

I have a table A with a lot of column (something like 70) including this three ones:

* ID

* DURATION

* DATE

I want to duplicate every row according to the value of DURATION, and recalculate DATE (one month more evry time), and then recalculate every other columns according to the new DATE value.

data B;
set A;

do i = 1 to DURATION;
DATE = INTNX('month', DATE, +1 , 'e');
[...];
output;
end;
run;

The problem is, the number of rows in A is 60K

DURATION is often very high, so the number of rows in B at the end of the duplication is something like 35M

Vrey long, and very huge volume. I would like to make it parallel. Divide in 6 part (or more if necessary) of 10K rows and run them in parallel, then only do a simple union.

Do you know how I can do this in SAS?

Thanks

Re: Duplication in parallel

art297 — Thu, 30 Mar 2017 14:56:01 GMT

You haven't said what you are going to do with the file after expanding it.

For most SAS procedures you can probably avoid expanding the file and simply use DURATION as a WEIGHT variable.

Art, CEO, AnalystFinder.com

Re: Duplication in parallel

Planck — Thu, 30 Mar 2017 15:30:52 GMT

Thanks for the answer.

But I don't understand what do you mean by WEIGHT variables. Can you be more explicit?

After, with the table B, I need to extract specific rows to create a table C, and then export in Excel the table C...

Re: Duplication in parallel

art297 — Thu, 30 Mar 2017 16:03:21 GMT

May or may not be relevant to your needs. Here is a link that explains WEIGHT and FREQ statements in SAS procedures: http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473731.htm

Thus, for a proc freq, it would count each record as representing n (i.e., weight or freq) records without forcing you to build the large file you were describing.

Are you really planning on exporting a 60 million row file to Excel?

Art, CEO, AnalystFinder.com

Re: Duplication in parallel

Planck — Thu, 30 Mar 2017 16:34:47 GMT

You are kidding not 60 millions ^^.

My first table A contains 60.000 rows. My table B contains 35 milions rows, and my table C is back to 60.000 rows: my table C will extract from B only a specific calculated DATE.

So I will export in Excel only 60.000 rows (table C)

But I can't understand the link between WEIGHT/FREQ and my needs. It is more for statical needs, me I just need to transform data...

Re: Duplication in parallel

art297 — Thu, 30 Mar 2017 16:52:30 GMT

Agree weight or freq probably wouldn't be relevant. However, couldn't something like the following serve as an alternative to creating the large file?

data have;
  informat date date9.;
  input ID DURATION DATE;
  cards;
1 10 1jan2016
2 3 5mar2016
3 15 8sep2016
;

data want;
  set have (where=('01jan2017'd between date and intnx('month', date, duration-1 )));
run;

Art, CEO, AnalystFinder.com

Re: Duplication in parallel

Planck — Thu, 30 Mar 2017 17:02:05 GMT

Unfortunately not! 😞

I didn't go into details of all 70 columns in my explanation, but among them some are calculated in a complex way and I truely need to create the large table B, before reducing for table C.

The question is not about finding an alternative, but to accept it and find a solution parallelising it, because... I have no choice!

Re: Duplication in parallel

ballardw — Thu, 30 Mar 2017 17:13:14 GMT

@Planck wrote:

Unfortunately not! 😞

I didn't go into details of all 70 columns in my explanation, but among them some are calculated in a complex way and I truely need to create the large table B, before reducing for table C.

The question is not about finding an alternative, but to accept it and find a solution parallelising it, because... I have no choice!

Are you sure you can't post a little example data and then where things go after that to the final table C?

I have seen lots of processes that were either written before new procedures were available in SAS or were a direct translation from another data system and did not use SAS tools well. A frequent example are data steps that go through lots of loops and retaining variables to calculate common statistics such as mean and standard deviation that could be done with four or five lines of code using Proc Means or summary. Or forcing data into a specific layout because "that's the way the report looks" instead of using the SAS report procedures such as Proc Report or Tabulate.