Many to Many Merge using Hashing..?

Reply
Contributor
Posts: 34

Many to Many Merge using Hashing..?

How to perform many to many megrge using Hashing.....?

We need to pefomr merge on two tables A and B...

Table B needs to be loaded into memory....I need to implement three scenarios:

If A=1 and B=1 then

...........

.........

We can implement this using RC=0

If A=1 and B=0 then

.......

We can implement this using RC<>0

If A=0 and B=1 then

..............

What about this scenario...How to implement this scenario(Basically Right Join) using Hashing...

Please suggest..

Contributor
Posts: 34

Re: Many to Many Merge using Hashing..?

Please find the existing data step too... I need to perfom this using Hashing...

Ideas are welcomed...

data matching;

merge abc(in=a) def (in = b);

by x;

if a and b then aa = 1;

if b and not a then bb = 2;

if a and not b then cc = 3;

run;

Contributor
Posts: 34

Re: Many to Many Merge using Hashing..?

Any updates by any Experts..?

Respected Advisor
Posts: 3,156

Re: Many to Many Merge using Hashing..?

@bhoopesh:

I am no expert, but Any updates on sample data sets, both input and output? Your question is too general to be answered efficiently, so please facilitate it a little by give people something to work on.

Haikuo

Contributor
Posts: 34

Re: Many to Many Merge using Hashing..?

Hi Kuo,

Basically I need to implement Right Join using Hash....

I know Left join can be easily done by loading smaller table into memory and then math it one by one using RC....

But what about RIGHT JOIN...Is it possible using Hash...?

I think this is the most difficult question as of now...

PROC Star
Posts: 7,467

Re: Many to Many Merge using Hashing..?

You've posted this a couple of times but, thus far, still have not described the problem sufficiently for at least some of us to understand.  It would help if you provided to example datasets (provided as datasteps), the code you have tried, and a third dataset showing the results you want to obtain.

Why do you have to do this using a hash?  How many records will exist in the real two datasets?

Contributor
Posts: 34

Re: Many to Many Merge using Hashing..?

Hi Arthur and All,

I was not able to post the code, yesterday as i was busy... Sorry for inconvinence..

Please find the code as attached:

data load;

infile datalines dsd dlm='';

input numid:8. name: $8.;

datalines;

22 "john"

11 "peter"

33 "terry"

44 "dony"

55 "randy"

66 "tony"

77 "mrna"

88 "joe"

99 "jack"

100 "sam"

;

run;

data main;

infile datalines dsd dlm='';

input numid:8. address: $8.;

datalines;

12 "california"

22 "texas"

13 "indiana"

11 "nyc"

33 "xyz"

14 "def"

17 "dff"

18 "deee"

99 "jjjj"

19 "kkkk"

;

run;

proc sort data=load; by numid; quit;

proc sort data=main; by numid; quit;

data final;

merge

main(in=a)

load(in=b);

by numid;

if a and b then output;

if a and not b then output;

if b and not a then output;

run;

Merge works fine...but it takes lot of time...Thats why i need to implement Hashing....As i want to load table 'load' in memory and will read each record from Main table...

Main table is huge...So that why need to implement Hashing on it... Please suggest the Hashing code for above scenario...Thanks in advance....This task is very important to me...So pls help asap...

Respected Advisor
Posts: 3,156

Re: Many to Many Merge using Hashing..?

I can't comment on your real life problem since you have not posted any sample data. However, as far as methodology goes, I don't see any fundamental difference in between two joins, left and right in term of hash processing. We need to know what you have, and what you are expecting to be the outcome, something tangible to work on.

Haikuo

Respected Advisor
Posts: 3,156

Re: Many to Many Merge using Hashing..?

Hi,

Here is a Hash() approach that can mimic your outcome (except the order):

data  final_hash;

  if _n_=1 then do;

     if 0 then do; set load; set main; end;

       declare hash load(dataset:'load');

       load.definekey('numid');

       load.definedata(all:'y');

       load.definedone();

       declare hash main(dataset:'main');

       main.definekey('numid');

       main.definedata(all:'y');

       main.definedone();

     declare hiter hl('load');

       declare hiter hm('main');

  end;

rc=hl.first();

   do rc=0 by 0 while (rc=0);

      rc=main.find();

        if rc ne 0 then call missing(address);

        output;

rc=hl.next();

      end;

      rc=hm.first();

        do rc=0 by 0 while (rc=0);

rc=load.find();

            if rc ne 0 then do; call missing(name); output;end;

rc=hm.next();

        end;

        stop;

        drop rc;

run;


Now some comments:

1. I don't know why you call it "many to many" merge, as your data shows otherwise. Your data addressed an "one to one" merge, since for any  obs in one table, there is only ONE match at most from another table. Maybe you have "many to many" scenarios in your real data, then you need to lay out your rule of what to do if they happen. For now, I can only treat it as is: "one to one".


2. I have commented out part of your merge code, cause it does not make any difference in term of final outcome. In your code,  you have already exhausted all of the merging possibilities without outputting to different target, so it might not reflecting your intention?


data final;

merge

main(in=a)

load(in=b);

by numid;

/*if a and b then output;

if a and not b then output;

if b and not a then output;*/

run;


3. Efficiency wise, I am not sure if Hash() holds significant advantage over 'data step merge', unless we are talking about millions of records with thousands of variables, while Hash() may save some time for direct access, but then the incoming table may have the risk for being too large to be accommodated by your computer RAM. In general,  Hash() will be more efficient  if comparing to SQL join when Cartesian product happens.


Haikuo


Contributor
Posts: 34

Re: Many to Many Merge using Hashing..?

Hi Kuo,

I appreciate your work on creating this Hash code to me... Thanks very much...Smiley Happy

But the problem is..

Actual Dataset Load is having few hundreds of rows and that can be loaded to memory(RAM) succesfully.

But other Dataset Main is having 30 millions of records and that cannot be loaded into the memory.

So i doubt that above code will work...Actual Main dataset was giving memory fill error to me...when i try to load into Hashbuckets using HASEXP:20...

Pls suggest

Respected Advisor
Posts: 3,156

Re: Many to Many Merge using Hashing..?

Like I said, why not using Merge? Hash(), if forced to use to serve your purpose, may not have an edge comparing to "data step merge". Also, you need to make sure that you only process those variables you need, variable numbers do make a big difference when sorting and merging.

Haikuo

PROC Star
Posts: 7,467

Re: Many to Many Merge using Hashing..?

Bhoopesh: A couple of comments.  If your data are already sorted, then a hash or array approach probably won't save you much time.  However, if you data have to be sorted with each run, then another approach probably will be quicker.

First, though, if you continue to use a merge approach, and want all of the matched and non-match records, why do you include the ina and inb statements and the three if then conditions at the end?  Are there any other possibilities I'm not thinking of?  I think you could get what you want/need with just:

data final;

  merge main load;

  by numid;

run;


And, if you do use a hash approach, I would think you only want to use it to build an index of the data in your load file.  However, I'll leave that for our more hash inclined colleagues.


A similar array approach, that also doesn't require the data to be sorted, might be:


data final (drop=i loc check: numids: namesSmiley Happy;

  array numids(&nrecs.);

  array names(&nrecs.) $8.;

  array check(&nrecs.);

  do i=1 to &nrecs.;

    set load;

    numids(i)=numid;

    names(i)=name;

  end;

  do until (eofmain);

    set main end=eofmain;

    loc=whichn(numid,of numids(*));

    if loc then do;

      name=names(loc);

      check(loc)=1;

    end;

    else call missing(name);

    output;

  end;

  call missing(address);

  do i=1 to &nrecs.;

    if not check(i) then do;

      name=names(i);

      numid=numids(i);

      output;

    end;

  end;

run;

However, given sorted data, it runs at about the same speed as a merge approach.

Contributor
Posts: 41

Re: Many to Many Merge using Hashing..?

Dear Hai.Kuo,

could you please explain this part of your code?

if 0 then do;

     set main;

     set load;

end;

What's the meaning of it if conditional expression is always false (zero).

Thank you beforehand.

upd. I think I've got it. Is it necessary to define this datasets in order to tell SAS in compilation step about this datasets? So then it's possible to use dataset: tag, right?

Respected Advisor
Posts: 3,156

Re: Many to Many Merge using Hashing..?

Posted in reply to ghastly_kitten

Hi,

To my understanding (which could be off), Before using rc=hiter.first() method (or any method that brings in Hash() data into PDV), Hash() requires there are already variables defined in PDV. So you need to load ONE record into PDV ahead of time.

Haikuo

Contributor
Posts: 41

Re: Many to Many Merge using Hashing..?

Well, it's still seem to be a little more tricky than I could realize, but this part of code:

if 0 then do;

     set ...;

     set ...;

end;

does not load any variables to PDV.

However, at the moment of executing of data step, SAS perfoms "compilation" and during the "forward run" calculates the length of PDV and initializes some variables (like one you place in nobs= option).

As SAS does not check if some condition is always FALSE/TRUE, he prepare the structure of PDV in order to be ready to load something from datasets which were used in set statement  (as well as for variables used in datastep code which weren't defined in input datasets).

In the nutshell, the PDV is really empty (except automatic variables), but you can use variable definitions. The key idea, which I still want to check, is that SAS also opens and locks (for r/w) that datasets, so you can use links to them in dataset: tag in declare hash statement without reading any data.

Anyway, that's a good idea to make an empty PDV, in case you need to have a foolproof from "accidential" ouptut. However, if the main reason is to read the first record again somewhere after hash definition (I have no example for that), I think the alternative could be the use of (firstobs=) option.

Thanks, anyway!

Ask a Question
Discussion stats
  • 15 replies
  • 4257 views
  • 3 likes
  • 4 in conversation