Solved: Proc Sql -- Join problem

wlierman · Posted 07-01-2021 06:01 PM

I have a problem with a Proc SQL join.

I have two data sets with 124,100 rows each. Data set A has two columns:

Contact_ID Place_of_Employment

There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.

The second data set Data set B has the following columns. These are sparse also.

The Place_of_employment in B matches to the Place_of-Employment in A.

Here are the columns for Dataset B:

Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor

Disabled More_than_one_Job Retired Unemployed__Not_working

Work_from_home_remotely

I used the following code

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field AS A
   Left Join SASCDC_2.Employment_Classify AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I would expect the query to take a few seconds but this is the result from the log

 Proc sql noprint;
39   CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40      Select A.*, B.*
41      From SASCDC_2.Contact_Person_Employ_field AS A
42      Left Join SASCDC_2.Employment_Classify AS B
43      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
      SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete.  When this step
         was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
      percent.
      Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
         stopped.
NOTE: PROCEDURE SQL used (Total process time):
      real time           25:19.26
      cpu time            22:39.62


44   quit;

When I stopped the query 3,036,491,921 observations had been processed.

Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.

There is some problem here if over 3 billion obs were processed and the query was no where near to completion.

What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.

Thank you.

wklierman

Kurt_Bremser · Posted 07-02-2021 02:58 AM

Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

wlierman · Posted 07-01-2021 06:05 PM

Posted 2 seconds ago (0 views)

I have a problem with a Proc SQL join.

I have two data sets with 124,100 rows each. Data set A has two columns:

Contact_ID Place_of_Employment

There are no missing contact_ids be there are sparse inputs for the Place_of_Employment because not every contact provided and answer.

The second data set Data set B has the following columns. These are sparse also.

The Place_of_employment in B matches to the Place_of-Employment in A.

Here are the columns for Dataset B:

Place_of_Employment NAICS_Sector Sector_Type Type_Firm Child_minor

Disabled More_than_one_Job Retired Unemployed__Not_working

Work_from_home_remotely

I used the following code

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field AS A
   Left Join SASCDC_2.Employment_Classify AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I would expect the query to take a few seconds but this is the result from the log

 Proc sql noprint;
39   CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
40      Select A.*, B.*
41      From SASCDC_2.Contact_Person_Employ_field AS A
42      Left Join SASCDC_2.Employment_Classify AS B
43      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: There were 124803 observations read from the data set
      SASCDC_2.CONTACT_PERSON_EMPLOY_FIELD.
NOTE: There were 124803 observations read from the data set SASCDC_2.EMPLOYMENT_CLASSIFY.
WARNING: The data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS may be incomplete.  When this step
         was stopped there were 3036491921 observations and 11 variables.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 90.52
      percent.
      Compressed is 1061686 pages; un-compressed would require 11204768 pages.
WARNING: Data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS was not replaced because this step was
         stopped.
NOTE: PROCEDURE SQL used (Total process time):
      real time           25:19.26
      cpu time            22:39.62


44   quit;

When I stopped the query 3,036,491,921 observations had been processed.

Essentially all I want to do is to append the Contact_ID field from data set A to the front-end of data set B. That is why I chose the Left Join.

There is some problem here if over 3 billion obs were processed and the query was no where near to completion.

What changes do I have to make to just append the contact_ID field from dataset A to dataset B - I am using Place_of_Employment as the common field.

Thank you.

wklierman

PaigeMiller · Posted 07-01-2021 06:11 PM

DUPLICATE THREAD

Do not reply in this thread

Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...

--
Paige Miller

Kurt_Bremser · Posted 07-02-2021 02:47 AM

@PaigeMiller wrote:

DUPLICATE THREAD

Do not reply in this thread

Reply here: https://communities.sas.com/t5/SAS-Programming/Proc-Sql-Join-problem/td-p/751580/jump-to/first-unrea...

Merged.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PaigeMiller · Posted 07-01-2021 06:07 PM

Sparse data can cause this. Each missing in data set A is matched with each missing in data set B, in effect a Cartesian join of all records where Place_of_Employment is missing in both data sets. Instead, try this:

From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
   Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B

--
Paige Miller

wlierman · Posted 07-01-2021 07:27 PM

Thank you for responding. This how I altered the code with your code.

Proc sql noprint;
CREATE TABLE SASCDC_2.Contact_ID_Employment_NAICS AS
   Select A.*, B.*
   From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS A
   Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
   ON A.Place_of_Employment = B.Place_of_Employment;
quit;

I still got over 11MM obs as can be seen in the log.

 Select A.*, B.*
48      From SASCDC_2.Contact_Person_Employ_field(where=(not missing(place_of_employment))) AS
48 ! A
49      Left Join SASCDC_2.Employment_Classify(where=(not missing(place_of_employment))) AS B
50      ON A.Place_of_Employment = B.Place_of_Employment;
WARNING: Variable Place_of_Employment already exists on file
         SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS.
NOTE: SAS threaded sort was used.
NOTE: Compressing data set SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS decreased size by 87.73
      percent.
      Compressed is 5015 pages; un-compressed would require 40858 pages.
NOTE: Table SASCDC_2.CONTACT_ID_EMPLOYMENT_NAICS created, with 11072295 rows and 11 columns.

51   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           5.89 seconds
      cpu time            5.54 seconds

Need some additional modification.

wklierman

ChrisNZ · Posted 07-01-2021 08:26 PM

Bottom line: Know your data.

You need to figure out which values have a high cardinality (repeated values) and cause the large Cartesian product.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 07-02-2021 02:58 AM

Since there is only one possible variable for matching, and that variable has non-missing multiples in both datasets, i think this is an impossible task, as you cannot determine which contact_id should be matched to a given observation in Employment_Classify. If a place of employment has 1000 entries in each dataset, you get a million results, 99.9% of which are wrongly matched. You need to dig further for some fact that allows one-to-one matching.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

wlierman · Posted 07-02-2021 03:40 PM

Thanks. You are right. It can't be accomplished. So I did it the old fashion way.

Exported only the contact_person_id to an xlsx worksheet. Then I copy / pasted the employment data (which had been exported earlier to an xlsx worksheet) to the contact_person_id and I have the dataset that I needed.

I should have thought of that work-around before.

Thank you all for your help.

wklierman

Kurt_Bremser · Posted 07-02-2021 04:00 PM

If the order of both datasets is already correct, then a MERGE without BY can also do that side-by-side lineup.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PGStats · Posted 07-01-2021 09:25 PM

You should leave those two tables separate unless there is a field matching Contact_ID in table Employment_Classify to do the join on.

PG

Proc Sql -- Join problem

Re: Proc Sql -- Join problem

Proc Sql--Join Problem

Re: Proc Sql--Join Problem

Re: Proc Sql--Join Problem

Re: Proc Sql -- Join problem

Re: Proc Sql -- Join problem

Re: Proc Sql -- Join problem

Re: Proc Sql -- Join problem

Re: Proc Sql -- Join problem

Re: Proc Sql -- Join problem

Re: Proc Sql -- Join problem

Classroom Training Available!