Hi All,
@hashmanFrequently I find that the limited memory is preventing Hash to run, but the data can still be processed through proc sort.
I have a very large dataset, and I'm trying to reduce its size by the date of when the data is imported. The date variable is in string format such as the following:
2018-04-03 06:40:28.738
I will need to subset the dataset by date, if the size of the dataset were smaller, I will only need to use the where statement such as:
where trim(substr(date,7,4)) = "4-03";
But this is taking a long time and sometime crashes, not to mention putting a datetime format on the original date variable were even slower or nearly impossible. Create a new variable with only date on the raw dataset was also pretty impossible.(If I could do that easily then I would not need to subset by date)
I can create a hash table and a hash dataset that contains the date, but the original date variable has date + time. Therefore, it will not match.
Following is a sample dataset with only date:
data have;
input date $30;
cards;
2018-04-03 03:44:18.728
2018-04-03 07:40:02.221
2018-04-03 09:20:20.135
2018-04-03 14:50:11.752
2018-04-03 02:42:17.005
2018-04-05 01:22:20.264
2018-04-05 04:45:49.402
2018-04-06 04:09:50.710
2018-04-07 04:12:31.623
2018-04-11 04:11:01.528
;
I have already tried to reduce variables, but it did not help much since the data is very long as well.
Appreciate for any help!
Part of the reason your WHERE filter is taking so long is that you are looking at a substring within a string variable, as in
where trim(substr(date,7,4)) = "4-03"
You would be better off converting your string into numeric variables, namely:
Then your where filter can be a straight equality
where mydate='03apr2018'd;
which is likely to be much faster.
Or if you want April 3 for many years, you could use
where mydate in ('03apr2018'd,'03apr2019'd,.....);
Here's how to create such variables:
data have;
input mydate yymmdd10. mytime time12.3;
format mydate date9. mytime time12.3;
cards;
2018-04-03 03:44:18.728
2018-04-03 07:40:02.221
2018-04-03 09:20:20.135
2018-04-03 14:50:11.752
2018-04-03 02:42:17.005
2018-04-05 01:22:20.264
2018-04-05 04:45:49.402
2018-04-06 04:09:50.710
2018-04-07 04:12:31.623
2018-04-11 04:11:01.528
run;
So I would consider making variables more friendly to where-filters before you choose to break up the data set.
In addition, you could create an index on the variable, which would likely further speed up where-filter performance.
If you are just sub-setting a dataset on disk then I don't think using a hash table will help because all the data has to be read off disk anyway first to see if it satisfies your selection criteria and that is the slowest part of any processing.
What do you want to do with the data once it is subset? This is probably the key to deciding what processing technique to use. If you want it sorted and / or deduped or some other DATA step processing then hash could be useful.
"But this is taking a long time and sometime crashes, not to mention putting a datetime format on the original date variable were even slower or nearly impossible."
Given how simple your where clause is I'm asking myself if the real problem isn't just reading this source table at all. How big is it actually in GB? And are you reading the data over a slow or unstable network connection?
For subsetting using the string: Why not also use the year? I'd expect this to perform better than using a substr() first.
data want;
set have(where=(date like '2018-04-03%'));
run;
Eventually try and use Proc Append.
proc datasets lib=work nolist nowarn;
delete want;
run;
append base=want data=have(where=(date like '2018-04-03%'));
run;
quit;
proc print;
run;
What you are asking to do is easy by assigning an appropriate expression to the KEY argument tag when the CHECK method is called, for example:
data have ; input @1 date $23. x y :$1. z :$3. ; cards ; 2018-04-03 03:44:18.728 1 A A01 2018-04-03 07:40:02.221 2 B B02 2018-05-03 09:20:20.135 3 C C03 2018-06-03 14:50:11.752 4 D D04 2018-07-03 02:42:17.005 5 E E05 2018-08-05 01:22:20.264 6 F F06 2018-01-06 04:45:49.402 7 G G07 2018-11-06 04:09:50.710 8 H H08 2018-07-07 04:12:31.623 9 I I09 2018-12-11 04:11:01.528 10 J J10 ; run ; data dates ; input m_yy $4. ; cards ; 4-03 1-06 ; run ; data want (drop = m_yy) ; if _n_ = 1 then do ; if 0 then set dates ; dcl hash h (dataset: "dates") ; h.definekey ("m_yy") ; h.definedone () ; end ; set have ; if h.check (key: put (substr (date, 7), $4.)) = 0 ; run ;
However, from what you've said I suspect it's not the real snag. If you run into a problem just applying a WHERE clause to the input, hash code akin to the above isn't going to help you much. Most likely, in your input you have a "tail" of satellite variables, much more numerous than X Y Z I've included for the sake of a demo; and though WHERE moves only the records that qualify from the buffer to the PDV, it fails to relieve the I/O burden created by the satellites enough. With the hash code like above, the things are even worse because every record gets moved from the buffer into the PDV before the unwanted ones get discarded by the subsetting IF.
Hence, the strategy needs to be adjusted depending on the nature of your input data. Just to outline two extreme scenarios, suppose that the records you end up selecting constitute:
In both cases, you have very few records to either (1) filter in or (2) filter out. And in both cases, it makes sense to first identify those records by observation number with as little computer resource pain as possible and (1) employ some tactic to get those you need or (2) mark those you don't want as deleted. Either way, at this stage it makes sense to drop all the satellite variables, reading in only the key, and apply the subsetting criteria to it, so that in the end you end up with a list of record IDs (i.e. the observation numbers) you either (1) want or (2) don't want.
Let's first look at #1:
data want (drop = m_yy) ;
* hash to lookup m_yy ;
dcl hash h (dataset: "dates") ;
h.definekey ("m_yy") ;
h.definedone () ;
dcl hash r () ;
* hash to store filtered-in RIDs ;
r.definekey ("rid") ;
r.definedone () ;
dcl hiter ir ("r") ;
* find needed RIDs ;
do rid = 1 by 1 until (lr) ;
* KEEP is critical in SET below ;
set have (keep = date) end = lr ;
if h.check (key: put (substr (date, 7), $4.)) = 0 then r.add() ;
end ;
* select only records with RIDs in hash R from HAVE ;
do while (ir.next() = 0) ;
set have point = rid ;
output ;
end ;
stop ;
set dates ;
run ;
In the extreme scenario #2, you don't want any record with M_YY in file DATES:
data discard (keep = rid) ;
if _n_ = 1 then do ;
if 0 the set dates ;
dcl hash h (dataset: "dates") ;
h.definekey ("m_yy") ;
h.definedone () ;
end ;
* KEEP is critical in SET below ;
set have (keep = date) ;
* find UNneeded RIDs ;
if h.check (key: put (substr (date, 7), $4.)) = 0 ;
rid = _n_ ;
run ;
data have ;
set discard ;
modify have point = rid ;
remove ;
run ;
In this case, you just make a list of the unwanted RIDs in the first step (which reads nothing but the key) and in the second step, use that list to mark the respective records in the data set HAVE itself as "deleted". In this case, you (a) still have never read anything from HAVE except the key, (b) never written out a huge data set with all the satellites and only a few records discarded. You've merely marked the unwanted records as "deleted" in HAVE. So, in your program downstream you will just read the data set HAVE; and all the records marked for deletion will be automatically ignored.
Of course, there're other scenarios in between these two extremes. But you should be getting the drift. When you deal with voluminous data, it's not always clearly cut and one has to be inventive; it's an art as much as a science. At times, to engineer a successful ETL, one needs to do a distribution analysis on the keys only first and then write a dynamic program smart enough to choose a subsetting tactic based on the distribution.
Kind regards
Paul D.
Though the file is huge, try:
data small;
set have(keep=date);
seq = _N_;
keep seq date;
run;
data obs2keep(kee=seq);
set small;
if substr(date, .......);
run;
data want(drop=seq);
merge obs2keep have;
by;
if seq;
run;
This way, in the final step you still have to read HAVE in its entirety with all its variables, while the OP says this is what causes the job to either run forever or crash.
I would like to quote few sentences:
@lydiawawa wrote:
But this is taking a long time and sometime crashes, not to mention
putting a datetime format on the original date variable were even slower or nearly impossible.
Create a new variable with only date on the raw dataset was also pretty impossible.
(If I could do that easily then I would not need to subset by date)
and
I find that the limited memory is preventing Hash to run, but the data can still be processed through proc sort.
If sort can be done, I understand:
1) No problem to read the whole file
2) There is enough space to hold a small file
@hashman , according to above, my suggested code reads the huge file twice,
I believe it can run fast enough without crashing.
It seems to me that @Kurt_Bremser is right about possible reasons to a crash.
Worth to try.
Start by reducing your dataset size considerably by converting the overlong string to a datetime, or date and time variable(s). A date can be reduced to a length of 4, equally a time (if you do not have fractions of a second!); datetimes can be stored in one 8-byte number (with precision down to microseconds!).
Since proc sort still works, creating an index on your variables will also work.
If you need to subset for specific days in a year, create a new month/day string from your date (will only need 4 bytes also) and create an index on that. Indexes work well when you use them to retrieve a small subset.
If a where condition crashes, you have other problems (where processes each observation individually, so it does not need more memory than one observation size). You either run out of space writing the subset, or you run into problems reading the input caused by a bad storage setup (network share, unreliable connections). Always keep the datasets you work with on local disks or disks attached through a reliable SAN.
What are the symptoms of your crashes? Please supply the respective log.
@lydiawawa wrote:
The crash is caused by usually the prior: "run out of space writing the subset", and it depends on how many people are running the program at once. Currently, it is not crashing but the average run time of subsetting dataset (through where indexing) by date is around 9 to 10 mins..
"create a new month/day string" : creating a new variable requires formatting and it seems the time to create and indexing is the same amount as comparing to just index on the original variable.
How big is your dataset, anyway? Either give us the filesize in GB, or the number of observations and the observation size.
Creating a new variable that speeds up an operation is a one-time process upfront; since you seem to run the subset repeatedly, performance gained will outweigh the effort spent in the long run. But first we need to know your situation in depth.
I just ran a test with a fake dataset:
data have;
format date yymmddd10.;
string = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"; * add some meat;
do i = 1 to 10000000;
date = rand('integer','01jan1960'd,today());
datestring = put(date,yymmddd10.);
md = cats(put(month(date),z2.),put(day(date),z2.));
output;
end;
run;
and created a subset twice:
data want1;
set have;
where substr(datestring,7,4) = '4-03';
run;
data want2;
set have;
where md = '0403';
run;
The difference between the two data steps was 32 to 39 seconds, so not that big, and could be caused by simultaneous loads on the server.
Then I created indexes:
proc datasets library=work;
modify have;
index create datestring;
index create md;
run;
quit;
and reran the two subsets, and found no real difference to before (both runs now ~35 seconds). If the probability of a certain value appearing in any given dataset page is high, the index will not speed up the process, and often slow it down. Only when the index prevents reading a substantial amount of dataset pages will it really help.
The dataset size is about 407833 million bytes. Page size: 65536
400 Gigabytes? Then the 10 minutes are in fact very fast, as you read 1GB in 1.5 seconds.
@lydiawawa wrote:
The dataset size is about 407833 million bytes. Page size: 65536
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.