getting the same result but spend a shorter time

Alexxxxxxx · Posted 08-10-2021 11:18 PM

Hello all,

I am using the following code,

data have1_1985;
	infile cards dsd  dlm=",";
	input
	Applicant_name :$8.
	appln_filing_year: 8.
	appln_id :8.
	cited_appln_id :8.
	pre_patent :8.
	;
	cards;
	A,2021,101,201,1
	B,2021,101,202,
	A,2017,102,203,
	A,2016,103,204,
	B,2018,104,205,
	;;;;
run;
data have2_1985;
	infile cards dsd  dlm=",";
	input
	appln_id_citing :8.
	appln_id_citing_p5 :8.
	;
	cards;
	101,201
	101,203
	102,203
	103,205
	104,205
	;;;;
run;
%MACRO Explo(year=);


PROC SQL;
	CREATE TABLE want1_&year. AS
	SELECT DISTINCT
	citing1.Applicant_name,
	citing1.appln_filing_year,
	citing1.appln_id,
	pre_patent,
	citing1.cited_appln_id,
	citing2.appln_id_citing_p5
	FROM
	have1_&year. (keep= Applicant_name appln_filing_year appln_id cited_appln_id pre_patent)as citing1
	JOIN have2_&year.(keep=appln_id_citing appln_id_citing_p5)  AS citing2 ON citing1.appln_id = citing2.appln_id_citing
	order by citing1.appln_id
	;
QUIT;
data want2_&year.;
	length cited_patent 3.;
	set want1_&year.;
	if cited_appln_id=appln_id_citing_p5 then cited_patent=1;
run; 
PROC SQL;
	CREATE TABLE want3_&year. AS
	SELECT DISTINCT
	Applicant_name,
	appln_filing_year,
	appln_id,
	cited_appln_id,
	pre_patent,
	cited_patent
	FROM want2_&year.
	GROUP BY Applicant_name,appln_id,cited_appln_id
	Having cited_patent=max(cited_patent) 
	;
QUIT;
%MEND Explo;
%Explo(year=1985)

however, these codes spend around 10 hours getting the results. Is there any method to simplify the steps, getting the same result but spend a shorter time?

I expect to get the "Want3" (want1, and want2 are just intermediary)

many thanks in advance.

Patrick · Posted 08-10-2021 11:24 PM

Is your actual have data in SAS or in a database? Are these yearly tables as the sample data indicates? What are the actual data volumes (like number of rows and volume in GB)?

Do you need all Want tables or only Want3 and the other two tables in your code are just intermediary?

Alexxxxxxx · Posted 08-10-2021 11:45 PM

Hi Patrick, thanks for your reply.

the data is current in the SAS. the data is so big that I do not have enough space to get the result, so I split the data by year. By using 1985 year data as an example, have1 has 279295 rows and 5 columns, have2 has 299680568 rows.

I expect to get the "Want3" (want1, and want2 are just intermediary).

Patrick · Posted 08-11-2021 12:10 AM

@Alexxxxxxx Your have1 dataset is actually not that big so it's only the join that blows things up. If reformulating things a bit it might be possible to execute for all years at once. Do you have a year or date column in your actual data which we need to add for joining?

jimbarbour · Posted 08-11-2021 12:39 AM

@Alexxxxxxx,

I would agree with @Patrick that if there is a way you can get all the data you need without generating so many rows, that will be your best method of reducing run time.

One thing I notice, is that it looks like the Data step really isn't necessary. A Case statement in the first SQL procedure appears to produce the same results. Check it and see what you think. This would eliminate having to read every row again in the Data step.

Here's the modified SQL (added Case statement):

	PROC SQL;
		CREATE TABLE want1_&year. AS
		SELECT DISTINCT
		citing1.Applicant_name,
		citing1.appln_filing_year,
		citing1.appln_id,
		pre_patent,
		citing1.cited_appln_id,
		citing2.appln_id_citing_p5
		,CASE	when	cited_appln_id=appln_id_citing_p5 then 1
			ELSE .	END	AS	cited_patent
		FROM
		have1_&year. (keep= Applicant_name appln_filing_year appln_id cited_appln_id pre_patent)as citing1
		JOIN have2_&year.(keep=appln_id_citing appln_id_citing_p5)  AS citing2 ON citing1.appln_id = citing2.appln_id_citing
		order by citing1.appln_id
		;
	QUIT;

Results:

Patrick · Posted 08-11-2021 12:43 AM

@Alexxxxxxx Looking at your have2 dataset and the number of rows I suspect that the two columns you give us are not the primary key and if de-duping volumes get eventually quite a bit lower.

To test the theory can you please execute below against your real data and tell us if that works (fits into memory) and what number n_obs prints in the log.

data _null_;
  if 0 then set have2_1985;
  dcl hash h1(dataset:"have2_1985");
  h1.defineKey('appln_id_citing','appln_id_citing_p5');
  h1.defineDone();
  n_obs=h1.num_items;
  put n_obs=;
  stop;
run;

And an added question: Is your original data source from which you create HAVE2 already a SAS table or is this data in a data base? If it's the latter then one should reduce volumes already on the DB side.

Below to illustrate my line of thought (fully working code) - but it all comes down if have2 can fit into memory else another coding approach will be required.

data have1_1985;
  infile cards dsd  dlm=",";
  input
    Applicant_name :$8.
    appln_filing_year: 8.
    appln_id :8.
    cited_appln_id :8.
    pre_patent :8.
  ;
  cards4;
A,2021,101,201,1
B,2021,101,202,
A,2017,102,203,
A,2016,103,204,
B,2018,104,205,
;;;;

data have2_1985;
  infile cards dsd  dlm=",";
  input
    appln_id_citing :8.
    appln_id_citing_p5 :8.
  ;
  cards4;
101,201
101,203
102,203
103,205
104,205
;;;;

%MACRO Explo(year=);

  data _null_;
    if 0 then set have2_&year.;
    dcl hash h1(dataset:"have2_&year.");
    h1.defineKey('appln_id_citing','appln_id_citing_p5');
    h1.defineData('appln_id_citing','appln_id_citing_p5');
    h1.defineDone();
    h1.output(dataset:"have2_dedup_&year.");
    stop;
  run;

  data want3_&year.;
    if _n_=1 then 
      do;
        if 0 then set have1_&year. have2_dedup_&year.;
        dcl hash h1(dataset:" have2_dedup_&year.",multidata:'y');
        h1.defineKey('appln_id_citing');
        h1.defineData('appln_id_citing','appln_id_citing_p5');
        h1.defineDone();
        length cited_patent 3.;
      end;

    set have1_&year.;
    do while(h1.do_over(key:appln_id) eq 0);
      if cited_appln_id=appln_id_citing_p5 then 
        do;
          cited_patent=1;
          output;
          leave;
        end;
    end;
    if cited_patent ne 1 then output;

    keep
      Applicant_name
      appln_filing_year
      appln_id
      cited_appln_id
      pre_patent
      cited_patent
      ;

  run;

  proc sort data=want3_&year. out=want3_&year.;
    by Applicant_name appln_filing_year cited_appln_id;
  run;

%MEND Explo;

%Explo(year=1985)

jimbarbour · Posted 08-11-2021 12:57 AM

@Patrick,

Good approach.

If I may ask: Why use multidata:'y'? It sounded as though we wanted a unique count. Wouldn't multidata:'y' permit duplicates?

Jim

Patrick · Posted 08-11-2021 12:59 AM

@jimbarbour wrote:

@Patrick,

Good approach.

If I may ask: Why use multidata:'y'? It sounded as though we wanted a unique count. Wouldn't multidata:'y' permit duplicates?

Jim

@jimbarbour Ooops! Copy/paste error. Yes, certainly not multidata. I've fixed the code.

jimbarbour · Posted 08-10-2021 11:52 PM

Can you post your log? Exactly which step is it that is taking a long time? The 2nd SQL step?

How many rows does Want1 wind up having?

The first thing that occurs to me is that a WHERE clause will be more efficient than a HAVING with a summary function.

Something like:

	PROC SQL;
		CREATE TABLE want3_&year. AS
		SELECT DISTINCT
		Applicant_name,
		appln_filing_year,
		appln_id,
		cited_appln_id,
		pre_patent,
		cited_patent
		FROM want2_&year.
		WHERE	cited_patent	>	0
		ORDER BY 
/*		GROUP BY*/
			Applicant_name,appln_id,cited_appln_id
/*		Having cited_patent=max(cited_patent) */
		;

However, if there are cases where the max(cited_patent) is going to be zero or missing and you're going to want to have them in the final output, my WHERE suggestion won't work.

Jim

andreas_lds · Posted 08-11-2021 01:31 AM

however, these codes spend around 10 hours getting the results. Is there any method to simplify the steps, getting the same result but spend a shorter time?

Start by showing what you have by posting proc contents of the original files, then explain what you are trying to achieve. And you should post some details about the hardware you are using.

Kurt_Bremser · Posted 08-11-2021 05:05 AM

If the example data represents your real data correctly, the first SQL will result in a (hidden) cartesian join that can easily blow up on you anytime. The two 101s in both datasets result in 4 observations in the output.

So you should get to know why there are repeats in both datasets.

Next, the use of the DISTINCT clause is always a big performance-eater and should only be done when absolutely necessary.

If you actually use DISTINCT to dedupe the result, then it is much better to dedupe before doing the join.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 08-11-2021 07:11 AM

1. Why do you

order by citing1.appln_id

when you never use that order?

2. What is the HAVING supposed to achieve when the value is either missing or 1?

3. As mentioned, is the DISTINCT keyword useful? Is there a DQ issue with the data?

In any case this generates exactly the same result on your data and should be much faster:

PROC SQL;
	CREATE TABLE want AS
	SELECT unique
	Applicant_name,
	appln_filing_year,
	appln_id,
	cited_appln_id,
	pre_patent,
	case when cited_appln_id=appln_id_citing_p5 then 1 end as cited_patent
	FROM (
		SELECT  
		citing1.Applicant_name,
		citing1.appln_filing_year,
		citing1.appln_id,
		pre_patent,
		citing1.cited_appln_id,
		citing2.appln_id_citing_p5
		FROM
        have1_&year.       as citing1
		JOIN have2_&year.  as citing2 
		ON citing1.appln_id = citing2.appln_id_citing
	)
	GROUP BY Applicant_name,appln_id,cited_appln_id
	HaViNg cited_patent=max(cited_patent) 
	;
QUIT;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 08-11-2021 07:20 AM

This also generates the same results and should be even faster:

PROC SQL;
	CREATE TABLE t AS
	SELECT unique * 
	FROM (
		SELECT  
			citing1.Applicant_name,
			citing1.appln_filing_year,
			citing1.appln_id,
			pre_patent,
			citing1.cited_appln_id,
			case when cited_appln_id=appln_id_citing_p5 then 1 end as cited_patent
		FROM have1_&year.  as citing1
		JOIN have2_&year.  as citing2 
		ON citing1.appln_id = citing2.appln_id_citing
	)
	GROUP BY Applicant_name,appln_id,cited_appln_id
	Having cited_patent=max(cited_patent) 
	;
QUIT;

High-Performance SAS Coding - Third Edition

jimbarbour · Posted 08-11-2021 11:20 AM

@ChrisNZ,

Select Unique? I know it works in Oracle, but is that valid in SAS? Is its behavior any different than Select Distinct?

Edit:

DISTINCT

eliminates duplicate rows. The DISTINCT argument is identical to UNIQUE.

Alias UNIQUE

Notes Although the UNIQUE argument is identical to DISTINCT, it is not an ANSI standard.

Well, I'll be darned. Learn something new every day.

Jim

ChrisNZ · Posted 08-11-2021 08:40 PM

Unsure how unique became my go-to syntax rather than distinct. Easier to type maybe? Yes, they are the same.

High-Performance SAS Coding - Third Edition

getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

Re: getting the same result but spend a shorter time

DISTINCT

Re: getting the same result but spend a shorter time

Alias	UNIQUE
Notes	Although the UNIQUE argument is identical to DISTINCT, it is not an ANSI standard.

DISTINCT

Classroom Training Available!