BookmarkSubscribeRSS Feed
imanojkumar1
Quartz | Level 8

suppose I have a data which is quite large both in volums and variables.

 

All i want to do one simple process...

 

Filter unique dates in ascending order (from date time stamp) for each of the variables in a group, say V1.

 

To help understand the above, first let look at some sample data (subset for only two vars TPMC and DateTimeStamp).

 

TPMC Date
7101 01Jan2016 15:37
1011 03Jan2016 01:23
1001 02Jan2016 06:44
2012 01Jan2016 18:54
3012 01Jan2016 12:23
3101 02Jan2016 15:37
5011 01Jan2016 01:23
6001 02Jan2016 06:44
1012 03Jan2016 18:54
6012 02Jan2016 12:23

 

Now what I want is to look like something as below:

 

TPMCDate
100102Jan201606:44
101103Jan201601:23
101203Jan201618:54
201201Jan201618:54
301201Jan201612:23
310102Jan201615:37
501101Jan201601:23
600102Jan201606:44
601202Jan201612:23
710101Jan201615:37

 

 

Now if the dataset is small it can be done using proc sql in no time, but if dataset is huge (to the tune of 200 million), time consumed is very large.

 

Is there any other method to run the process without consuming the much resources?

2 REPLIES 2
Kurt_Bremser
Super User

If you need sorted data, you have to sort. Period.

If you only want two variables, do

proc sort
  data=have (keep=tpmc date)
  out=want
;
by tpmc /*date*/;
run;

You will have to read the big dataset sequentially once (no way around that), but 200 million * 12 bytes isn't that much anymore, so the sort should proceed quite quickly after the initial read.

Astounding
PROC Star

I can give you an experiment to try, but there is no guarantee that it will help.

 

Some sorting algorithms contain a component (when estimating CPU time to complete) that is proportional to the square of the number of observations.  It MIGHT help to split up the data in an intelligent way.  For example:

 

data subset1 subset2 subset3 subset4 subset5;

set have;

if TPMC in : ('0', '1') then output subset1;

else if TPMC in : ('2', '3') then output subset2;

else if TPMC in : ('4', '5') then output subset3;

else if TPMC in : ('6', '7') then output subset4;

else output subset5;

run;

 

Then try sorting one of the subsets:

 

proc sort data=subset3;

by TPMC date;

run;

 

(This example assumes that TPMC is character.)

 

Examine the CPU time, and estimate the CPU time to complete all necessary steps as 2 x the DATA step time, plus 5 x the SORT time.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 2262 views
  • 3 likes
  • 3 in conversation