create daily data from large file

ciro · Posted 01-19-2021 11:06 AM

Hi,

i have a big dataset (about 40 millions obs) concerning some items.

For each item I have a starting date and an end date (if the item has expired,otherwise it is missing).

the data set is as following;

data have;
id=1; start='1FEB2015'd; end=.;output;
id=2; start='1FEB2015'd; end='31JUL2015'd;output;
id=3; start='1SEP2015'd; end='31OCT2015'd;output;

format start end date9.;
run;

I need a dataset with a record per each day, say from 1 Jan 2015 to 31 dec 2016.

for each day I need information on three variables (number of starts, number of ends and number of active items).

As the dataset is very big, is there any other way other than creating a flag (1/0) for each day and each of the three variables (that is over 2000 variables) and then summarize by days?

I imagine I can partition the problem but what is a good way?

Any suggestion on the procedure to write, the coding to create this kind of variables,and how to summarize is much appreciated.

PaigeMiller · Posted 01-19-2021 11:49 AM

data have;
id=1; start='1FEB2015'd; end=.;output;
id=2; start='1FEB2015'd; end='31JUL2015'd;output;
id=3; start='1SEP2015'd; end='31OCT2015'd;output;

format start end date9.;
run;

I need a dataset with a record per each day, say from 1 Jan 2015 to 31 dec 2016.

How does the sentence "I need a dataset with a record per each day, say from 1 Jan 2015 to 31 dec 2016" relate to the data provided above, where 31DEC2016 is not mentioned?

--
Paige Miller

ballardw · Posted 01-19-2021 12:58 PM

You need to provide details on just how you need to compare Start and End with those two other values.

You have a number of cases to potentially address

lowerlimit < start < end < upperlimit <= obviously complicated by end Missing

lowerlimit < start < upperlimit <= doesn't use end at all

lowerlimit < end< upperlimit <= doesn't use start at all but the end missing is an issue

So describe in detail which of these you want and how the End=missing is to be treated

PhilC · Posted 01-19-2021 03:23 PM

I hear @PaigeMiller and @ballardw , something about this is unclear. So lets start by saying why this code is good or bad.


data temp (Keep=date start	end);
  set have;
  /*do date="01Jan2015"d to "31dec2016"d; */

  *no,no,no. to keep the data sane let's do this... ;
  do date="01Feb2015"d,"31Jul2015"d,"01Aug2015"d,
          "31Aug2015"d,"01Sep2015"d,"31Oct2015"d,
          "01Nov2015"d;
    if start<=date  and (date<=end or missing(end)) then output;
  end;
  format date date9.;
run;

PROC SQL;
   CREATE TABLE want_ish AS  
   SELECT DISTINCT date, 
            (COUNT(start)) AS COUNT_of_start, 
            (COUNT(end)) AS COUNT_of_end, 
            (COUNT(start)) AS 'number active'n
      FROM TEMP 
      GROUP BY date;
QUIT;

Number active means number of starts. It's to unclear, to me, what you're counting.

date	COUNT_of_start	COUNT_of_end	number_active
01Feb2015	2	1	2
31Jul2015	2	1	2
01Aug2015	1	0	1
31Aug2015	1	0	1
01Sep2015	2	1	2
31Oct2015	2	1	2
01Nov2015	1	0	1

ciro · Posted 01-19-2021 05:31 PM

Let me try to be more clear.

In the original data, for each date from 1JAN2015 to 31DEC2016 I could calculate three variables like:

n_start=(date=start);

n_end=(date=end);

n_active=(start le date le coalesce(end,'31DEC2099'd));

and then:

1.summarize all variables(=365*2*3) over the 40millions observations.

2. transpose the resulting dataset in order to have a wanted dataset with (365*2*3) records identified by the date of the day and three variables

I was wondering whether there were other ways to do this and which is the quickest way to obtain the result.

I hope it is clearer.

Thank you very much in advance for any suggestion.

ballardw · Posted 01-20-2021 02:08 AM

@ciro wrote:

Let me try to be more clear.

In the original data, for each date from 1JAN2015 to 31DEC2016 I could calculate three variables like:

n_start=(date=start);

n_end=(date=end);

n_active=(start le date le coalesce(end,'31DEC2099'd));

and then:

1.summarize all variables(=365*2*3) over the 40millions observations.

2. transpose the resulting dataset in order to have a wanted dataset with (365*2*3) records identified by the date of the day and three variables

I was wondering whether there were other ways to do this and which is the quickest way to obtain the result.

I hope it is clearer.

Thank you very much in advance for any suggestion.

Example. Actual data, actual find parameters, expected results

The way you phrase this is VERY different than your initial question where you were listing starts and ends. This

n_start=(date=start);

Will create boolean values, 1 or 0. Is that what you expect for n_start?

And if your 365 in the number of variables means you have a different variable for every day of the year then ugly ugly ugly data model for many things.

ciro · Posted 01-20-2021 04:16 AM

the resulting dataset that I want is a dataset of aggregate data obtained by summing up the boolean values created in the micro data. it is like the one built by Philc .

my issue is how to build it starting from a very large dataset with an efficient procedure.

hope this helps clarify.

ciro · Posted 01-20-2021 05:46 AM

the following seems to work (on a smaller dataset).

data have2;
set have;
do day='1JAN2015'd to '31DEC2016'd;
n_start=(day=start);
n_end=(day=end);
n_active=(start le day le coalesce(end,'31DEC2099'd));
end;
if n_active=1 then output;*to write on the disk occuping less space;
end;
run;

proc means data=have2 nway noprint;
class day;
var n_start n_end n_active;
output out=want sum=;
run;

on 1000000 obs it takes about 5-6 minutes.

however the disk space occupied seems a lot. Maybe I should partition the problem via a macro loop or something like that. any advice on this route?

Kurt_Bremser · Posted 01-20-2021 07:29 AM

Try this:

data have;
id=1; start='1FEB2015'd; end=.;output;
id=2; start='1FEB2015'd; end='31JUL2015'd;output;
id=3; start='1SEP2015'd; end='31OCT2015'd;output;
format start end yymmdd10.;
run;

data template;
format day yymmdd10.;
do day = '01jan2015'd to '31dec2016'd;
  output;
end;
run;

data active;
set have;
active = 1;
end = coalesce(end,'31dec2016'd);
do day = start to end;
  output;
end;
keep day active;
run;

proc summary data=active nway;
class day;
var active;
output out=actives (drop=_:) sum()=;
run;

proc summary data=have nway;
class start;
var id;
output out=starts (drop=_: rename=(id=starts start=day)) n()=;
run;

proc summary data=have nway;
class end;
var id;
output out=ends (drop=_: rename=(id=ends end=day)) n()=;
run;

data want;
merge
  template
  starts
  ends
  actives
;
by day;
run;

The longest step should be the SUMMARY of dataset active.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PaigeMiller · Posted 01-20-2021 06:36 AM

@ciro wrote:

Let me try to be more clear.

In the original data, for each date from 1JAN2015 to 31DEC2016 I could calculate three variables like:

n_start=(date=start);

n_end=(date=end);

n_active=(start le date le coalesce(end,'31DEC2099'd));

and then:

1.summarize all variables(=365*2*3) over the 40millions observations.

2. transpose the resulting dataset in order to have a wanted dataset with (365*2*3) records identified by the date of the day and three variables

I was wondering whether there were other ways to do this and which is the quickest way to obtain the result.

I hope it is clearer.

Does not answer my question. Why are the dates being chosen to be 1JAN2015 to 31DEC2016, when you clearly presented data with other dates? And so because of this, the code you present is essentially meaningless.

--
Paige Miller

PhilC · Posted 01-20-2021 08:43 AM

Then I would change my solution to:

data temp (Keep=date n_start n_end n_active );
  set have;
  /*do date="01Jan2015"d to "31dec2016"d;*/

  *but... just to keep the data sane let do this... ;
  do date="01Feb2015"d,"31Jul2015"d,"01Aug2015"d,
          "31Aug2015"d,"01Sep2015"d,"31Oct2015"d,
          "01Nov2015"d;
    if start<=date  and (date<=end or missing(end)) then do
       n_start=(date=start);
       n_end=(date=end);
       n_active=1;
       output;
    end;
  end;
  format date date9.;
run;

PROC SQL;
   CREATE TABLE want_ish AS 
   SELECT DISTINCT date, 
            (SUM(n_start)) AS COUNT_of_start, 
            (Sum(n_end)) AS COUNT_of_end, 
            (SUM(n_active)) AS COUNT_of_active
      FROM TEMP 
      GROUP BY date;
QUIT;

create daily data from large file

Re: create daily data from large file

Re: create daily data from large file

Re: create daily data from large file

Re: create daily data from large file

Re: create daily data from large file

Re: create daily data from large file

Re: create daily data from large file

Re: create daily data from large file

Re: create daily data from large file

Re: create daily data from large file

Registration is open

Call for Content EXTENDED

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away