BookmarkSubscribeRSS Feed
ciro
Quartz | Level 8

Hi,

i have a big dataset (about 40 millions obs) concerning some items.

For each item I have a starting date and an end date (if the item has expired,otherwise it is missing).

the data set is as following;


data have;
id=1; start='1FEB2015'd; end=.;output;
id=2; start='1FEB2015'd; end='31JUL2015'd;output;
id=3; start='1SEP2015'd; end='31OCT2015'd;output;

format start end date9.;
run;

 

I need a dataset with a record per each day, say from 1 Jan 2015 to 31 dec 2016.

for each day I need information on three variables (number of starts, number of ends and number of active items).

 

As the dataset is very big, is there any other way other than creating a flag (1/0)  for each day and each of the three variables (that is over 2000 variables)  and then summarize by days?

I imagine I can partition the problem but what is a good way? 

Any suggestion on the procedure to write, the coding to create this kind of variables,and how to summarize is much appreciated. 

 

 

10 REPLIES 10
PaigeMiller
Diamond | Level 26

data have;
id=1; start='1FEB2015'd; end=.;output;
id=2; start='1FEB2015'd; end='31JUL2015'd;output;
id=3; start='1SEP2015'd; end='31OCT2015'd;output;

format start end date9.;
run;

 

I need a dataset with a record per each day, say from 1 Jan 2015 to 31 dec 2016.


How does the sentence "I need a dataset with a record per each day, say from 1 Jan 2015 to 31 dec 2016" relate to the data provided above, where 31DEC2016 is not mentioned?

--
Paige Miller
ballardw
Super User

You need to provide details on just how you need to compare Start and End with those two other values.

 

You have a number of cases to potentially address

lowerlimit < start < end < upperlimit    <= obviously complicated by end Missing

lowerlimit < start   < upperlimit    <= doesn't use end at all

lowerlimit < end< upperlimit    <= doesn't use start at all but the end missing is an issue

 

So describe in detail which of these you want and how the End=missing is to be treated

 

 

PhilC
Rhodochrosite | Level 12

I hear @PaigeMiller and @ballardw , something about this is unclear.  So lets start by saying why this code is good or bad.

 


data temp (Keep=date start	end);
  set have;
  /*do date="01Jan2015"d to "31dec2016"d; */

  *no,no,no. to keep the data sane let's do this... ;
  do date="01Feb2015"d,"31Jul2015"d,"01Aug2015"d,
          "31Aug2015"d,"01Sep2015"d,"31Oct2015"d,
          "01Nov2015"d;
    if start<=date  and (date<=end or missing(end)) then output;
  end;
  format date date9.;
run;

PROC SQL;
   CREATE TABLE want_ish AS  
   SELECT DISTINCT date, 
            (COUNT(start)) AS COUNT_of_start, 
            (COUNT(end)) AS COUNT_of_end, 
            (COUNT(start)) AS 'number active'n
      FROM TEMP 
      GROUP BY date;
QUIT;

Number active means number of starts.  It's to unclear, to me, what you're counting.

 

 

date COUNT_of_start COUNT_of_end number_active
01Feb2015 2 1 2
31Jul2015 2 1 2
01Aug2015 1 0 1
31Aug2015 1 0 1
01Sep2015 2 1 2
31Oct2015 2 1 2
01Nov2015 1 0 1

 

ciro
Quartz | Level 8

Let me try to be more clear.

 

In the original data, for each date from 1JAN2015 to 31DEC2016 I could calculate three variables like:

n_start=(date=start);

n_end=(date=end);

n_active=(start le date le coalesce(end,'31DEC2099'd));

 

and then:

1.summarize all variables(=365*2*3) over the 40millions observations.

2. transpose the resulting dataset in order to have a wanted dataset with (365*2*3) records identified by the date of the day and three variables 

I was wondering whether there were other ways to do this and which is the quickest way to obtain the result.

I hope it is clearer.

Thank you very much in advance for any suggestion.

 

 

 

 

 

ballardw
Super User

@ciro wrote:

Let me try to be more clear.

 

In the original data, for each date from 1JAN2015 to 31DEC2016 I could calculate three variables like:

n_start=(date=start);

n_end=(date=end);

n_active=(start le date le coalesce(end,'31DEC2099'd));

 

and then:

1.summarize all variables(=365*2*3) over the 40millions observations.

2. transpose the resulting dataset in order to have a wanted dataset with (365*2*3) records identified by the date of the day and three variables 

I was wondering whether there were other ways to do this and which is the quickest way to obtain the result.

I hope it is clearer.

Thank you very much in advance for any suggestion.

 


Example. Actual data, actual find parameters, expected results

The way you phrase this is VERY different than your initial question where you were listing starts and ends. This

n_start=(date=start);

Will create boolean values, 1 or 0. Is that what you expect for n_start?

 

And if your 365 in the number of variables means you have a different variable for every day of the year then ugly ugly ugly data model for many things.

 

ciro
Quartz | Level 8

the resulting dataset that I want is a dataset of aggregate data obtained by summing up the boolean values created in the micro data. it is like the one built by Philc .

 

my issue is how to build it starting from a very large dataset with an efficient procedure.

hope this helps clarify.

ciro
Quartz | Level 8

the following seems to work (on a smaller dataset).

 

data have2;
set have;
do day='1JAN2015'd to '31DEC2016'd;
n_start=(day=start);
n_end=(day=end);
n_active=(start le day le coalesce(end,'31DEC2099'd));
end;
if n_active=1 then output;*to write on the disk occuping less space;
end;
run;


proc means data=have2 nway noprint;
class day;
var n_start n_end n_active;
output out=want sum=;
run;

 

on 1000000 obs it takes about 5-6 minutes.

however the disk space occupied seems a lot. Maybe I should partition the problem via a macro loop or something like that. any advice on this route? 

 

Kurt_Bremser
Super User

Try this:

data have;
id=1; start='1FEB2015'd; end=.;output;
id=2; start='1FEB2015'd; end='31JUL2015'd;output;
id=3; start='1SEP2015'd; end='31OCT2015'd;output;
format start end yymmdd10.;
run;

data template;
format day yymmdd10.;
do day = '01jan2015'd to '31dec2016'd;
  output;
end;
run;

data active;
set have;
active = 1;
end = coalesce(end,'31dec2016'd);
do day = start to end;
  output;
end;
keep day active;
run;

proc summary data=active nway;
class day;
var active;
output out=actives (drop=_:) sum()=;
run;

proc summary data=have nway;
class start;
var id;
output out=starts (drop=_: rename=(id=starts start=day)) n()=;
run;

proc summary data=have nway;
class end;
var id;
output out=ends (drop=_: rename=(id=ends end=day)) n()=;
run;

data want;
merge
  template
  starts
  ends
  actives
;
by day;
run;

The longest step should be the SUMMARY of dataset active.

PaigeMiller
Diamond | Level 26

@ciro wrote:

Let me try to be more clear.

 

In the original data, for each date from 1JAN2015 to 31DEC2016 I could calculate three variables like:

n_start=(date=start);

n_end=(date=end);

n_active=(start le date le coalesce(end,'31DEC2099'd));

 

and then:

1.summarize all variables(=365*2*3) over the 40millions observations.

2. transpose the resulting dataset in order to have a wanted dataset with (365*2*3) records identified by the date of the day and three variables 

I was wondering whether there were other ways to do this and which is the quickest way to obtain the result.

I hope it is clearer.

 


Does not answer my question. Why are the dates being chosen to be 1JAN2015 to 31DEC2016, when you clearly presented data with other dates? And so because of this, the code you present is essentially meaningless.

--
Paige Miller
PhilC
Rhodochrosite | Level 12

Then I would change my solution to:

data temp (Keep=date n_start n_end n_active );
  set have;
  /*do date="01Jan2015"d to "31dec2016"d;*/

  *but... just to keep the data sane let do this... ;
  do date="01Feb2015"d,"31Jul2015"d,"01Aug2015"d,
          "31Aug2015"d,"01Sep2015"d,"31Oct2015"d,
          "01Nov2015"d;
    if start<=date  and (date<=end or missing(end)) then do
       n_start=(date=start);
       n_end=(date=end);
       n_active=1;
       output;
    end;
  end;
  format date date9.;
run;

PROC SQL;
   CREATE TABLE want_ish AS 
   SELECT DISTINCT date, 
            (SUM(n_start)) AS COUNT_of_start, 
            (Sum(n_end)) AS COUNT_of_end, 
            (SUM(n_active)) AS COUNT_of_active
      FROM TEMP 
      GROUP BY date;
QUIT;

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2099 views
  • 4 likes
  • 5 in conversation