Solved: Re: Split data in to quarters

skallamp · Posted 04-27-2012 01:11 AM

Hi,

Can you please help me to split the file in to quarters according to filldat (column F in the attached).

For instance if the date is 01/25/2011 it sould be treated as Q1_2011 and if the date is 12/25/2010 it should be Q4_2010.

I have attached similar code for split to months, but not sure on how to do for Quarters.

Thanks,

+Sathyan

Ksharp · Posted 04-27-2012 03:29 AM

How about:

proc import datafile='d:\software\SampleFile.xls' out=have dbms=excel replace;
getnames=yes;run;
data have(drop=_f);
 set have(rename=(filldat=_f));
 filldat=input(_f,mmddyy10.);
 format filldat mmddyy10.;
run;
proc sort data=have;by filldat;run;
proc sql noprint;
 select quote(name) into : list separated by ','
  from dictionary.columns
   where libname='WORK' and memname='HAVE';
quit; 
data _null_;
declare hash ha(multidata:'Y');
 ha.definekey('filldat');
 ha.definedata(&list);
 ha.definedone();
 do until(last.filldat);
  set have;
  by filldat groupformat;
  format filldat yyq.;
  ha.add();
 end;
 ha.output(dataset: cats('Q',qtr(filldat),'_',year(filldat)));
run;

Ksharp

View solution in original post

manojinpec · Posted 04-27-2012 01:21 AM

There is function tonget the quarter of year and then it can be concatenated with correspodning year.

shivas · Posted 04-27-2012 02:06 AM

Hi,

Try this..

	data want;
	input date;
	informat date mmddyy10.;
	format date date9.;
	year=year(date);
	qtr=qtr(date);
	DateValue=compress('Q'\|\|qtr\|\|'_'\|\|year);
	cards;
	01/25/2011

12/25/2010

	;
	run;

Thanks,

Shiva

Ksharp · Posted 04-27-2012 03:29 AM

How about:

proc import datafile='d:\software\SampleFile.xls' out=have dbms=excel replace;
getnames=yes;run;
data have(drop=_f);
 set have(rename=(filldat=_f));
 filldat=input(_f,mmddyy10.);
 format filldat mmddyy10.;
run;
proc sort data=have;by filldat;run;
proc sql noprint;
 select quote(name) into : list separated by ','
  from dictionary.columns
   where libname='WORK' and memname='HAVE';
quit; 
data _null_;
declare hash ha(multidata:'Y');
 ha.definekey('filldat');
 ha.definedata(&list);
 ha.definedone();
 do until(last.filldat);
  set have;
  by filldat groupformat;
  format filldat yyq.;
  ha.add();
 end;
 ha.output(dataset: cats('Q',qtr(filldat),'_',year(filldat)));
run;

Ksharp

Patrick · Posted 04-27-2012 04:13 AM

Or this way?

proc sql noprint;
select distinct
    cats('want_',put(filldat,yyq6.)) ,
    cat(' ',"when('",put(filldat,yyq6.),"') output ",cats('want_',put(filldat,yyq6.),';'))
      into :Data_DsList separated by ' ' ,
           :Out_DsList separated by ' '
from work.SampleFile
;
quit;

data &Data_DsList;
set work.SampleFile;
select(put(filldat,yyq6.));
&Out_DsList
otherwise;
end;
run;

Ksharp · Posted 04-27-2012 04:24 AM

Patrick,

I want know, yours and mine ,which code is more efficient .

Ksharp

Patrick · Posted 04-27-2012 04:28 AM

Mine of course! It uses less characters :smileylaugh:

...and more serious: I believe the Proc Sort to be quite costly.

Ksharp · Posted 04-27-2012 05:07 AM

No. I do not think so.

You used PUT() for every observations ,which is very low efficient.

I bet mine is faster than yours.

Hope OP can give us some feedback.

Ksharp

Patrick · Posted 04-27-2012 05:46 AM

You're writing the data 3 times to disk - even worse :smileyshocked:

Patrick · Posted 04-27-2012 06:25 AM

O.K.... you win. Unless I feed less rows to the SQL with the put statements as done below.

But then: You could replace your Proc Sort with a SQL sort creating a view and I assume we would be more or less even.

Good to be reminded: I'm going to use functions next time.

16         data work.SampleFile;
17           set work.SampleFile;
18           do _i=1 to 20;
19             output;
20           end;
21         run;

NOTE: There were 56720 observations read from the data set WORK.SAMPLEFILE.
NOTE: The data set WORK.SAMPLEFILE has 1134400 observations and 41 variables.
NOTE: DATA statement used (Total process time):
      real time           0.67 seconds
      user cpu time       0.23 seconds
      system cpu time     0.42 seconds
      memory              30629.37k
      OS Memory           37732.00k
      Timestamp           27/04/2012 08:19:20 PM

22
23         data have;
24          set work.SampleFile;
25          format filldat mmddyy10.;
26         run;

NOTE: There were 1134400 observations read from the data set WORK.SAMPLEFILE.
NOTE: The data set WORK.HAVE has 1134400 observations and 41 variables.
NOTE: DATA statement used (Total process time):
      real time           1.07 seconds
      user cpu time       0.26 seconds
      system cpu time     0.78 seconds
      memory              30629.37k
      OS Memory           37732.00k
      Timestamp           27/04/2012 08:19:21 PM

2                                         The SAS System              20:18 Friday, April 27, 2012

27
28         options fullstimer;
29         /* Patrick */
30         proc sql noprint;
31           create view vSampleFile as select distinct filldat from SampleFile;
NOTE: SQL view WORK.VSAMPLEFILE has been defined.
32           select distinct
33             cats('want_',put(filldat,yyq6.)) ,
34             cat(' ',"when('",put(filldat,yyq6.),"') output ",cats('want_',put(filldat,yyq6.),';
34       ! '))
35               into :Data_DsList separated by ' ' ,
36                    :Out_DsList separated by ' '
37           from work.vSampleFile
38           ;
39         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.96 seconds
      user cpu time       0.56 seconds
      system cpu time     0.15 seconds
      memory              46209.40k
      OS Memory           53564.00k
      Timestamp           27/04/2012 08:19:22 PM

40
41         data &Data_DsList;
42           set work.SampleFile;
43           select(put(filldat,yyq6.));
44             &Out_DsList
45             otherwise;
46           end;
47         run;

NOTE: There were 1134400 observations read from the data set WORK.SAMPLEFILE.
NOTE: The data set WORK.WANT_2011Q2 has 255600 observations and 41 variables.
NOTE: The data set WORK.WANT_2011Q3 has 283200 observations and 41 variables.
NOTE: The data set WORK.WANT_2011Q4 has 300000 observations and 41 variables.
NOTE: The data set WORK.WANT_2012Q1 has 295600 observations and 41 variables.
NOTE: DATA statement used (Total process time):
      real time           5.78 seconds
      user cpu time       0.53 seconds
      system cpu time     0.60 seconds
      memory              46209.40k
      OS Memory           53564.00k
      Timestamp           27/04/2012 08:19:28 PM

48
49
50         /* Ksharp */
51         proc sort data=have;by filldat;run;

3 The SAS System 20:18 Friday, April 27, 2012

NOTE: There were 1134400 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE has 1134400 observations and 41 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           9.96 seconds
      user cpu time       1.37 seconds
      system cpu time     1.70 seconds
      memory              268894.25k
      OS Memory           276588.00k
      Timestamp           27/04/2012 08:19:38 PM

52         proc sql noprint;
53          select quote(name) into : list separated by ','
54           from dictionary.columns
55            where libname='WORK' and memname='HAVE';
56         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.05 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              268894.25k
      OS Memory           276588.00k
      Timestamp           27/04/2012 08:19:39 PM

57         data _null_;
58         declare hash ha(multidata:'Y');
59          ha.definekey('filldat');
60          ha.definedata(&list);
61          ha.definedone();
62          do until(last.filldat);
63           set have;
64           by filldat groupformat;
65           format filldat yyq.;
66           ha.add();
67          end;
68          ha.output(dataset: cats('Q',qtr(filldat),'_',year(filldat)));
69         run;

NOTE: The data set WORK.Q2_2011 has 255600 observations and 41 variables.
NOTE: The data set WORK.Q3_2011 has 283200 observations and 41 variables.
NOTE: The data set WORK.Q4_2011 has 300000 observations and 41 variables.
NOTE: The data set WORK.Q1_2012 has 295600 observations and 41 variables.
NOTE: There were 1134400 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
      real time           6.34 seconds
      user cpu time       1.65 seconds
      system cpu time     0.79 seconds
      memory              549040.51k
      OS Memory           557128.00k
      Timestamp           27/04/2012 08:19:46 PM

4 The SAS System 20:18 Friday, April 27, 2012

70
71
72         GOPTIONS NOACCESSIBLE;
73         %LET _CLIENTTASKLABEL=;
74         %LET _CLIENTPROJECTPATH=;
75         %LET _CLIENTPROJECTNAME=;
76         %LET _SASPROGRAMFILE=;
77
78         ;*';*";*/;quit;run;
79         ODS _ALL_ CLOSE;
80
81
82         QUIT; RUN;
83

Ksharp · Posted 04-27-2012 10:53 PM

Hi.Patrick,

I guess OP's dataset might have already been sorted before.

My ' proc sort ' maybe doesn't need any more.

Just Guess.

Ksharp

Peter_C · Posted 04-28-2012 12:26 PM

if you want high effficiency, collect the data in memory - either array or hash table indexed by put( filldat, qtr_format. )

and write the array at end-of-file.

For an array you would need to know the range (min/max dates) but a hash table solution would need only the single pass. (For an array solution, you could presume "outer" bounds like 1970-q1 to 2100-q1 without taking a lot of memory, but of course, the lower bound might need to be much earlier for data like date-of-birth).

Another high-efficiency alternate solution writes to a separate table for each distinct quarter, in one data step (needing no sort as there will only be a relatively small number of quarters).

Patrick · Posted 04-28-2012 08:24 PM

Thanks Peter.

This little "contest" between Ksharp and I wasn't really serious. My code was never designed for performance and I don't believe that for the OP's real problem performance is importants. I liked the discussion with Ksharp - and even when not really serious there is always something to learn and take out of it.

PGStats · Posted 04-29-2012 12:11 AM

Patrick, Ksharp, your skilled discussion is very stimulating. There is a lot to be learned in trying to sort out such performance issues. It set me doing my own "testing" along the same approach as Patrick's. I compared a simple SQL solution inspired by your comments with the Hash-based program proposed by Ksharp. The SQL queries go as follows:

proc sql noprint;

create table qtr as select fillDate,

cats('Q',qtr(fillDate),'_',year(fillDate)) as qtr length=7

from (select distinct fillDate from have)

order by fillDate;

select distinct cat("create table ", qtr,

" as select * from have natural join qtr where qtr=", quote(qtr))

into :sqlCmd separated by ';'

from qtr;

&sqlCmd.;

drop table qtr;

quit;

My tests seem to indicate that when the data is already sorted by date, the two approaches take about the same time. But when the order is random, the hash-based approach takes about twice as long. In both cases (sorted or not), hash takes more than 10 times the memory that SQL does. The log is attached.

PG

Patrick · Posted 04-29-2012 07:43 AM

I wouldn't have expected a pure SQL solution to perform that good. Interesting.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away