turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- BI
- /
- Enterprise Guide
- /
- how to sort and filter large amount of data with l...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-19-2016 04:37 AM

suppose I have a data which is quite large both in volums and variables.

All i want to do one simple process...

Filter unique dates in ascending order (from date time stamp) for each of the variables in a group, say V1.

To help understand the above, first let look at some sample data (subset for only two vars TPMC and DateTimeStamp).

TPMC Date

7101 01Jan2016 15:37

1011 03Jan2016 01:23

1001 02Jan2016 06:44

2012 01Jan2016 18:54

3012 01Jan2016 12:23

3101 02Jan2016 15:37

5011 01Jan2016 01:23

6001 02Jan2016 06:44

1012 03Jan2016 18:54

6012 02Jan2016 12:23

Now what I want is to look like something as below:

TPMC | Date |

1001 | 02Jan201606:44 |

1011 | 03Jan201601:23 |

1012 | 03Jan201618:54 |

2012 | 01Jan201618:54 |

3012 | 01Jan201612:23 |

3101 | 02Jan201615:37 |

5011 | 01Jan201601:23 |

6001 | 02Jan201606:44 |

6012 | 02Jan201612:23 |

7101 | 01Jan201615:37 |

Now if the dataset is small it can be done using **proc sql **in no time, but if dataset is huge (to the tune of 200 million), time consumed is very large.

Is there any other method to run the process without consuming the much resources?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-19-2016 04:54 AM

If you need sorted data, you have to sort. Period.

If you only want two variables, do

```
proc sort
data=have (keep=tpmc date)
out=want
;
by tpmc /*date*/;
run;
```

You will have to read the big dataset sequentially once (no way around that), but 200 million * 12 bytes isn't that much anymore, so the sort should proceed quite quickly after the initial read.

---------------------------------------------------------------------------------------------

Maxims of Maximally Efficient SAS Programmers

Maxims of Maximally Efficient SAS Programmers

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-19-2016 11:20 AM

I can give you an experiment to try, but there is no guarantee that it will help.

Some sorting algorithms contain a component (when estimating CPU time to complete) that is proportional to the square of the number of observations. It MIGHT help to split up the data in an intelligent way. For example:

data subset1 subset2 subset3 subset4 subset5;

set have;

if TPMC in : ('0', '1') then output subset1;

else if TPMC in : ('2', '3') then output subset2;

else if TPMC in : ('4', '5') then output subset3;

else if TPMC in : ('6', '7') then output subset4;

else output subset5;

run;

Then try sorting one of the subsets:

proc sort data=subset3;

by TPMC date;

run;

(This example assumes that TPMC is character.)

Examine the CPU time, and estimate the CPU time to complete all necessary steps as 2 x the DATA step time, plus 5 x the SORT time.