Solved: Merge with Hash Object

ssafmed · Posted 08-09-2023 09:51 AM

Hello,

I want to merge 2 tables using Hash Object by keyA and keyB that have the same values(but not the same name in the tables), but in my output I want just a few values from both tables,

Example

table A : keyA, Cl_A_1, Cl_A_2, Cl_A_3, Cl_A_4, CL_A5

table B : keyB, Cl_B_1,Cl_B_2,Cl_B_3,Cl_B_4,Cl_B_5,Cl_B_6

------------------------------------------------------------------

Output : KeyA, keyB, Cl_A_1, Cl_A_2, Cl_B_1,Cl_B_2,Cl_B_3

Thanks!

Patrick · Posted 08-10-2023 06:30 PM

Below how I would code for this. Make sure that you load the table with the lower volume into the hash. That's from a memory consumption and performance perspective much more important than if you've got a few more variables than you would need in the PDV. As long as you only write variables to the output table that you need (drop/keep) the difference to performance is negligible.

data Set1;
  if _n_ = 1 then
    do;
      if 0 then set class(keep=name sex age);
      declare Hash Match1 (dataset:'class');
      Match1.DefineKey ('Name');
      Match1.DefineData ('Name','Sex','Age');
      Match1.DefineDone ();
    end;
  call missing(of _all_);

  set cities(keep=Persona department);

  if Match1.find(key:Persona) = 0;

  /* potentially more statements */
  drop persona;
run;

Also be aware that at least in your sample data column Persona has a length of 10 but Name only a length of 8. Should the string length stored in Persona exceed 8 characters then the hash lookup will return an error like: ERROR: Argument length greater than length of key variable Name

View solution in original post

mkeintz · Posted 08-09-2023 06:19 PM

Plesse provide sample data in the form of a working DATA step, and a working DATA step with the desired result.

With that clarification, we'll be able to determine a way to get from sample input to desired output, with tested code.

Help us help you.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 08-09-2023 07:36 PM

Easiest is to just rename the key variable to what's in your base table when loading into the hash table.

dcl hash h1 (dataset:'your_table(rename=(keyB=KeyA))');

"but in my output I want just a few values from both tables"

You need to further specify what this is about AND you need to provide sample data and desired result. Please provide the sample data via a working SAS datastep that creates such data.

ssafmed · Posted 08-10-2023 08:15 AM

Here is an example :

data class;

input Name$ Sex$ Age Height Weight;

infile datalines dlm=' ' dsd;

datalines;

Alfred M 14 69.0 112.5

Alice F 13 56.5 84.0

Barbara F 13 65.3 98.0

Carol F 14 62.8 102.5

Henry M 14 63.5 102.5

;

data cities;

input zipcode City :$30. Department :$30. Persona:$10.;

infile datalines dlm=',' dsd;

datalines;

75008, Paris 8è ,Paris,Alice

91940, Les Ulis, Essonne,Carol

92330,Sceaux, Hauts-de-Seine,Henry

93140, Bondy , Seine-Saint-Denis,Henry

94150, Rungis, Val-de-Marne,Alice

;

data Set1 ;

if 0 then set class cities;

if _n_ = 1 then do;

declare Hash Match1 (dataset:'class');

Match1.DefineKey ('Name');

Match1.DefineData ('Name','Sex','Age');

Match1.DefineDone ();

end;

set cities;

if Match1.find(key:Persona) = 0 then output;

run;

What I want is an output set that contains Name Sex Age Department! I can do it with (keep=Name Sex Age Department), but I dont want to process this variables!

Thanks

Kurt_Bremser · Posted 08-10-2023 12:07 PM

Add these two statements to your IF N_ = 1 block:

length name $8 sex $8 age 8;
call missing(name,sex,age);

You have to tell the data step compiler about the new variables, so it can compile them into the PDV. The second statement is just there to avoid the "uninitialized" NOTE.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ssafmed · Posted 08-10-2023 01:15 PM

Hello,

In fact, in my output I just want de keep Name Sex Age and department and not (zipcode, City, Persona)

I dont even want that the variables (zipcode, City, Persona) went to PVD

Thanks

Kurt_Bremser · Posted 08-10-2023 01:30 PM

Use either the DROP= or KEEP= dataset option in your SET statement.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ssafmed · Posted 08-10-2023 01:37 PM

In this case it will be process in PDV, but I want to avoid that for efficiency because my data are very big!

Thanks

Kurt_Bremser · Posted 08-10-2023 02:34 PM

When you exclude variables in the SET statement, which controls the reading of a dataset, those variables will not be part of the PDV. You can verify this with a PUT _ALL_; statement.

BTW since the whole dataset file will be p?hysically read anyway, the number of variables in the PDV won't have an influence on processing time, only on memory used, and with so few as you describe, the effect of dropping them will be nil.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ssafmed · Posted 08-10-2023 03:03 PM

I put a keep just after If 0 (I dont know if it is that you mean by Set reading statment)

but still processing all the variables in PVD (please see log below)

Code :

if 0 then set class cities;

keep Name Sex Age Persona Department;

if _n_ = 1 then do;

declare Hash Match1 (dataset:'class');

Match1.DefineKey ('Name');

Match1.DefineData ('Name','Sex','Age');

Match1.DefineDone ();

end;

set cities ;

if Match1.find(key:Persona) = 0 then output;

PUT _ALL_;

run;

log :

NOTE: There were 5 observations read from the data set WORK.CLASS.

Name=Alice Sex=F Age=13 Height=. Weight=. zipcode=75008 City=Paris 8è Department=Paris Persona=Alice _ERROR_=0 _N_=1

Name=Carol Sex=F Age=14 Height=. Weight=. zipcode=91940 City=Les Ulis Department=Essonne Persona=Carol _ERROR_=0 _N_=2

Name=Henry Sex=M Age=14 Height=. Weight=. zipcode=92330 City=Sceaux Department=Hauts-de-Seine Persona=Henry _ERROR_=0 _N_=3

Name=Henry Sex=M Age=14 Height=. Weight=. zipcode=93140 City=Bondy Department=Seine-Saint-Denis Persona=Henry _ERROR_=0 _N_=4

Name=Alice Sex=F Age=13 Height=. Weight=. zipcode=94150 City=Rungis Department=Val-de-Marne Persona=Alice _ERROR_=0 _N_=5

data Set1;

Thanks

Kurt_Bremser · Posted 08-10-2023 03:51 PM

This is a KEEP Statement, not a KEEP= Dataset Option.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ssafmed · Posted 08-11-2023 07:07 AM

Hello,

It's what I did but still not solve the problem!

Thanks!

if 0 then set class cities;

keep Name Sex Age Persona Department;

if _n_ = 1 then do;

declare Hash Match1 (dataset:'class');

Kurt_Bremser · Posted 08-11-2023 11:38 AM

@ssafmed wrote:

Hello,

It's what I did but still not solve the problem!

Thanks!

if 0 then set class cities;

keep Name Sex Age Persona Department;

if _n_ = 1 then do;

declare Hash Match1 (dataset:'class');

You are still using the statement instead of the dataset option. Please read the documentation I linked for you.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 08-11-2023 12:05 PM

Let's see what your code does:

if 0 then set class cities;

Since SAS will evaluate zero as FALSE the SET statement will never run. But the data step compiler will see it and add all of the variables from both the CLASS and CITIES datasets.

keep Name Sex Age Persona Department;

This says to only output these 5 specific variables. If CLASS is like SASHELP.CLASS then the first three will be defined as in CLASS and the second two will be defined as in CITIES. If any of those five variables do not exist in either CLASS or CITIES and are not later defined explicitly or implicitly by other code in the step then SAS not create them and instead write a WARNING message to the SAS log.

Perhaps you meant to do something like:

if 0 then set class(keep=Name Sex Age) cities(keep=Persona Department);

Now the data step compile step will only add those specific 5 variables. But you will have to also check other places where you use either CLASS or CITIES in another SET or MERGE statement to make sure that does not bring in other variables.

Patrick · Posted 08-10-2023 06:30 PM

Below how I would code for this. Make sure that you load the table with the lower volume into the hash. That's from a memory consumption and performance perspective much more important than if you've got a few more variables than you would need in the PDV. As long as you only write variables to the output table that you need (drop/keep) the difference to performance is negligible.

data Set1;
  if _n_ = 1 then
    do;
      if 0 then set class(keep=name sex age);
      declare Hash Match1 (dataset:'class');
      Match1.DefineKey ('Name');
      Match1.DefineData ('Name','Sex','Age');
      Match1.DefineDone ();
    end;
  call missing(of _all_);

  set cities(keep=Persona department);

  if Match1.find(key:Persona) = 0;

  /* potentially more statements */
  drop persona;
run;

Also be aware that at least in your sample data column Persona has a length of 10 but Name only a length of 8. Should the string length stored in Persona exceed 8 characters then the hash lookup will return an error like: ERROR: Argument length greater than length of key variable Name

Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Re: Merge with Hash Object

Classroom Training Available!