11-19-2016 04:37 AM
suppose I have a data which is quite large both in volums and variables.
All i want to do one simple process...
Filter unique dates in ascending order (from date time stamp) for each of the variables in a group, say V1.
To help understand the above, first let look at some sample data (subset for only two vars TPMC and DateTimeStamp).
7101 01Jan2016 15:37
1011 03Jan2016 01:23
1001 02Jan2016 06:44
2012 01Jan2016 18:54
3012 01Jan2016 12:23
3101 02Jan2016 15:37
5011 01Jan2016 01:23
6001 02Jan2016 06:44
1012 03Jan2016 18:54
6012 02Jan2016 12:23
Now what I want is to look like something as below:
Now if the dataset is small it can be done using proc sql in no time, but if dataset is huge (to the tune of 200 million), time consumed is very large.
Is there any other method to run the process without consuming the much resources?
11-19-2016 04:54 AM
If you need sorted data, you have to sort. Period.
If you only want two variables, do
proc sort data=have (keep=tpmc date) out=want ; by tpmc /*date*/; run;
You will have to read the big dataset sequentially once (no way around that), but 200 million * 12 bytes isn't that much anymore, so the sort should proceed quite quickly after the initial read.
11-19-2016 11:20 AM
I can give you an experiment to try, but there is no guarantee that it will help.
Some sorting algorithms contain a component (when estimating CPU time to complete) that is proportional to the square of the number of observations. It MIGHT help to split up the data in an intelligent way. For example:
data subset1 subset2 subset3 subset4 subset5;
if TPMC in : ('0', '1') then output subset1;
else if TPMC in : ('2', '3') then output subset2;
else if TPMC in : ('4', '5') then output subset3;
else if TPMC in : ('6', '7') then output subset4;
else output subset5;
Then try sorting one of the subsets:
proc sort data=subset3;
by TPMC date;
(This example assumes that TPMC is character.)
Examine the CPU time, and estimate the CPU time to complete all necessary steps as 2 x the DATA step time, plus 5 x the SORT time.