Desktop productivity for business analysts and programmers

how to sort and filter large amount of data with less memory resources consumption

Reply
Frequent Contributor
Posts: 87

how to sort and filter large amount of data with less memory resources consumption

suppose I have a data which is quite large both in volums and variables.

 

All i want to do one simple process...

 

Filter unique dates in ascending order (from date time stamp) for each of the variables in a group, say V1.

 

To help understand the above, first let look at some sample data (subset for only two vars TPMC and DateTimeStamp).

 

TPMC Date
7101 01Jan2016 15:37
1011 03Jan2016 01:23
1001 02Jan2016 06:44
2012 01Jan2016 18:54
3012 01Jan2016 12:23
3101 02Jan2016 15:37
5011 01Jan2016 01:23
6001 02Jan2016 06:44
1012 03Jan2016 18:54
6012 02Jan2016 12:23

 

Now what I want is to look like something as below:

 

TPMCDate
100102Jan201606:44
101103Jan201601:23
101203Jan201618:54
201201Jan201618:54
301201Jan201612:23
310102Jan201615:37
501101Jan201601:23
600102Jan201606:44
601202Jan201612:23
710101Jan201615:37

 

 

Now if the dataset is small it can be done using proc sql in no time, but if dataset is huge (to the tune of 200 million), time consumed is very large.

 

Is there any other method to run the process without consuming the much resources?

Esteemed Advisor
Posts: 6,698

Re: how to sort and filter large amount of data with less memory resources consumption

If you need sorted data, you have to sort. Period.

If you only want two variables, do

proc sort
  data=have (keep=tpmc date)
  out=want
;
by tpmc /*date*/;
run;

You will have to read the big dataset sequentially once (no way around that), but 200 million * 12 bytes isn't that much anymore, so the sort should proceed quite quickly after the initial read.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Respected Advisor
Posts: 4,997

Re: how to sort and filter large amount of data with less memory resources consumption

I can give you an experiment to try, but there is no guarantee that it will help.

 

Some sorting algorithms contain a component (when estimating CPU time to complete) that is proportional to the square of the number of observations.  It MIGHT help to split up the data in an intelligent way.  For example:

 

data subset1 subset2 subset3 subset4 subset5;

set have;

if TPMC in : ('0', '1') then output subset1;

else if TPMC in : ('2', '3') then output subset2;

else if TPMC in : ('4', '5') then output subset3;

else if TPMC in : ('6', '7') then output subset4;

else output subset5;

run;

 

Then try sorting one of the subsets:

 

proc sort data=subset3;

by TPMC date;

run;

 

(This example assumes that TPMC is character.)

 

Examine the CPU time, and estimate the CPU time to complete all necessary steps as 2 x the DATA step time, plus 5 x the SORT time.

Ask a Question
Discussion stats
  • 2 replies
  • 191 views
  • 3 likes
  • 3 in conversation