BookmarkSubscribeRSS Feed
Edoedoedo
Pyrite | Level 9

Hi,

 

I have a huge dataset (let's name it HUGE). HUGE has:

- 500 million rows

- a composite unique primary key, let's name it (K1,K2) (two columns belong to the key)

- some other columns in addition to the key

 

Next I have a small dataset (let's name it SMALL). SMALL has:

- 200k rows

- exactly the same composite unique primary key (K1,K2)

- no more columns in addition to the key.

 

I.e.:

- HUGE: K1 | K2 | FIELD1 | ANOTHERFIELD2 | SOMEOTHERFIELD3 | ETC, UPK(K1,K2)

- SMALL: K1 | K2, UPK(K1,K2)

 

I need to delete from HUGE all the rows whose key appears in SMALL, so that after this process HUGE and SMALL do not share any common key.

 

I first tried with PROC SQL and DELETE FROM, but I got crazy about how to use the right syntax... on a dbms I'd have done:

DELETE FROM HUGE WHERE (K1,K2) NOT IN (SELECT K1,K2 FROM SMALL)

but SAS does not allow this syntax.

 

Next I tried with this:

data HUGE;
merge HUGE(in=in1) SMALL(in=in2); by K1 K2;
if (in1 and in2) then delete;
run;

 

This works, but it takes almost a day to complete due to HUGE size.

 

What would you recommend?

 

Thanks a lot

Regards

35 REPLIES 35
KachiM
Rhodochrosite | Level 12

Try a Hash Solution. The SMALL goes to Hash table. The HUGE can be processed one record after another. The Matchin record will be ignored.

 

data want;
if _n_ = 1 then do;
if 0 then set SMALL;
declare hash h(dataset:'SMALL');
h.definekey('K1', 'K2');
h.definedone();
end;
set HUGE;
if h.find() ^= 0;
run;

 

PGStats
Opal | Level 21

It's hard to know what will be optimized, beforehand. Try :

 

DELETE FROM HUGE as H
WHERE exists (select K1, K2 from SMALL where K1=H.K1 and K2=H.K2);
PG
LinusH
Tourmaline | Level 20
Hey @KachiM, because hash methods can be quick in some situations sue to in memory techniques, it doesn't mean that us suitable for any problem.
The trick is not to access the whole table, that's why we invented indexes.
I blame @Ksharp for this overuse of hashing 😆
Data never sleeps
KachiM
Rhodochrosite | Level 12

Hi LinusH:

 

I do not understand your comment.

 

To me, Hash, Array or any other SAS way to solve a problem hinges on either minimizing run-time or memory or both. The data type of K1 and K2 is not known, I used Hash Object. If they were NUMBERs, I would have solved the problem by using Array. 

 

I will appreciate the superiority of a method must be empirically verified before offering comments.

 

Regards

PGStats
Opal | Level 21

It's not always about minimizing "run-time or memory or both" but often about minimysing data transfer times. A hash solution will always require the transfer of whole table(s) into memory, and that can take a long time. Which is what @LinusH was talking about. Optimal procedures will be the best compromise within the constraints set by CPU, memory and network resources.

PG
KachiM
Rhodochrosite | Level 12

I must have mentioned I/O time in addtion to Run-time.

Ksharp
Super User
@LinusH
I am enjoining use Hash Table. It seems it is my last resort when I face a tough question and don't know what to do.
I think index should be retired .   *^_^* 


LinusH
Tourmaline | Level 20
Usually in data warehouse you don't delete any records, just mark certain records as not valid any more, perhaps using end dates.

That said, an alternative to SQL DELETE FROM (if you can't get the speed up in that scenario) you should be able to use MODIFY with REMOVE in a data step. I never used it but should be quite fast since it will use your index.
Data never sleeps
ChrisNZ
Tourmaline | Level 20

As mentioned (sorry if I repeat here some of the valid points made before), the key for speed for this type of operation is not to read the whole table and write it out again.

 

So no

proc sql ; create ...

or

data HUGE; set/merge HUGE;

 

Instead modifying the table in place is the way to go.

The data step's modify statement and proc sql's delete statements are prime candidates for this. They allow "soft deletion" where the observation is still there (hence no rewritting the whole data set) but marked as deleted. proc contents shows the deleted observations.

 

Sadly, proc sql doesn't use the index for the syntax that @PGStats proposes (I can't find a better way, any other proposal?), but the modify statement does and hence is very fast.

 

 

data HUGE(index=(A=(I J))) ;
  array A [50];
  do I=1 to 50e6; 
    J=I;
    output;
  end;
run;
data SMALL;
  do I=1 to 50e6; 
    J=I;
    if ranuni(0)> .9999 then output;
  end;
run;

options msglevel=i;

%* MODIFY statement: 0.5 seconds;
data HUGE;
  set SMALL ;
  modify HUGE key=A;
  if _iorc_=%sysrc(_sok) then remove;
  else _error_=0;
run;
proc contents data=HUGE; 
run;

%* MERGE statement: 110 seconds;
data HUGE;
  merge HUGE SMALL(in=BAD);
  by I J;
  if not BAD;
run;
 
%* DELETE SQL statement: hours?;
proc sql;
   delete from HUGE as H
   where exists (select I, J from SMALL where I=H.I and J=H.J);
quit;

proc contents data=HUGE; 
run;

 

 

More benchmarks and discussions about performance in https://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490

PGStats
Opal | Level 21

If table HUGE has a single field UNIQUE index, it seems you can give proc SQL a decent performance by translating your two-fields key into the single key and use IN instead of EXISTS. The only reason I see for going to the trouble of doing this is the rollback ability of SQL which isn't provided by datastep MERGE or MODIFY operations. I modified @ChrisNZ test code as follows

 

options msglevel=i;

data HUGE(index=(A=(I J) ID/unique)) ;
  array A [50];
  do I=1 to 10e6; 
    J=I;
    ID + 1;
    output;
  end;
run;

data SMALL;
  do I=1 to 10e6; 
    J=I;
    if ranuni(0)> .9999 then output;
  end;
run;

proc sql /* undo_policy=none */ ;
/* Optimized with index A */
create table IDS as 
select H.ID 
from HUGE as H inner join SMALL as S on H.i=S.i and H.j=S.j; 
/* No mention of index ID, but runs in a couple of minutes */
delete from HUGE as H
where ID in (select ID from IDS);
quit; 

in my tests, adding undo_policy=none didn't improve performance.

PG
ChrisNZ
Tourmaline | Level 20

 @PGStats SAS did not use the ID index in your example.

Even using the idxwhere= option doesn't trigger the index.

A fixed list does though. I reported this issue long ago and it seems it is still here.

 

data HUGE1(index=(A=(I J) ID/unique)) 
     HUGE2(index=(A=(I J) ID/unique)) 
     SMALL(keep=I J);
  array A [50];
  do I=1 to 10e6; 
    J=I;
    ID + 1;
    if ranuni(0)> .9999 then output SMALL;
    output HUGE1 HUGE2;
  end;
run;

proc sql noprint; /* Optimized with index A */
create table IDS as 
  select H.ID 
  from HUGE1 as H inner join SMALL as S 
  on H.i=S.i and H.j=S.j; 
select ID into :ids separated by ',' from IDS;
quit;

proc sql;  /* No index used: 11 seconds */
delete from HUGE1(idxwhere=yes) where ID in (select ID from IDS);
quit;

proc sql; /* index ID used: 0.2 seconds */
delete from HUGE2 where ID in (&IDS);
quit; 

 

The modify statement looks like the easiest way to speed up this job.

@Edoedoedo Did this work for you?

Edoedoedo
Pyrite | Level 9

Hi guys,

I tried all your suggestions, here are the results. (HUGE is 70GB, SMALL is 10MB)

 

  • HASH solution, by @KachiM: amazing, it does all the job in 7 minutes, including CPU time and I/O time (writing 70GB on disk takes time, just doing "cp HUGE HUGE2" takes 3 minutes)
    Pro: fastest solution
    Cons: since it creates a new table, it wastes I/O time, needs double disk space, and needs to recreate the index after the operation
    Comments: I didn't know hashes, so I read about them on documentation; in your opinion, do you think that loading both tables in memory would speed up more? Are there more sophisticated strategies to increase performance with hashes? Moreover, what happens if the system does not have enough ram to hold the tables, it just slows down or sas crashes?

  • DELETE solution, by @PGStats: thank you for the correct syntax, it will be surely useful! However after 1 hour I aborted it.
    Pro: Rollback available, no double space, no need to recreate the index
    Cons: too much time
    Comments: too much time for production environment

  • MODIFY solution, by @ChrisNZ: quite good solution, it takes about 20 minutes to complete
    Pro: no double space, no need to recreate the index, quite fast
    Cons: not as fast as HASH

  • MODIFY with HASH solution, by Me:
    data HUGE;
        modify HUGE;
        if _n_ = 1 then do;
            declare hash h(dataset:'SMALL');
            h.definekey('K1','K2');
            h.definedone();
        end;
        if h.find() ^= 0 then remove; 
    run;
    I taught I had a good idea mixing your solution, however after 1 hour I aborted it Smiley Sad

 

So I think I'll use the first solution, any other comments and suggestions are wery welcome!

 

Thanks a lot

Regards

 

Ksharp
Super User

It is best position for PROC DS2.

 

 

data have;
 set sashelp.class;
run;


proc ds2;
data Male(overwrite=yes);
 method run();
  set have;
  if sex eq 'M' then output;
 end;
enddata;
run;
quit;
Edoedoedo
Pyrite | Level 9
I didn't know DS2 either, SAS is full of surprises 🙂
@Ksharp could you show me how would you solve the problem using DS2?

Thanks!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 35 replies
  • 9762 views
  • 12 likes
  • 6 in conversation