BookmarkSubscribeRSS Feed
Alexxxxxxx
Pyrite | Level 9

Hello all,

 

I am using the following code,

data have1_1985;
	infile cards dsd  dlm=",";
	input
	Applicant_name :$8.
	appln_filing_year: 8.
	appln_id :8.
	cited_appln_id :8.
	pre_patent :8.
	;
	cards;
	A,2021,101,201,1
	B,2021,101,202,
	A,2017,102,203,
	A,2016,103,204,
	B,2018,104,205,
	;;;;
run;
data have2_1985;
	infile cards dsd  dlm=",";
	input
	appln_id_citing :8.
	appln_id_citing_p5 :8.
	;
	cards;
	101,201
	101,203
	102,203
	103,205
	104,205
	;;;;
run;
%MACRO Explo(year=);


PROC SQL;
	CREATE TABLE want1_&year. AS
	SELECT DISTINCT
	citing1.Applicant_name,
	citing1.appln_filing_year,
	citing1.appln_id,
	pre_patent,
	citing1.cited_appln_id,
	citing2.appln_id_citing_p5
	FROM
	have1_&year. (keep= Applicant_name appln_filing_year appln_id cited_appln_id pre_patent)as citing1
	JOIN have2_&year.(keep=appln_id_citing appln_id_citing_p5)  AS citing2 ON citing1.appln_id = citing2.appln_id_citing
	order by citing1.appln_id
	;
QUIT;
data want2_&year.;
	length cited_patent 3.;
	set want1_&year.;
	if cited_appln_id=appln_id_citing_p5 then cited_patent=1;
run; 
PROC SQL;
	CREATE TABLE want3_&year. AS
	SELECT DISTINCT
	Applicant_name,
	appln_filing_year,
	appln_id,
	cited_appln_id,
	pre_patent,
	cited_patent
	FROM want2_&year.
	GROUP BY Applicant_name,appln_id,cited_appln_id
	Having cited_patent=max(cited_patent) 
	;
QUIT;
%MEND Explo;
%Explo(year=1985)

however, these codes spend around 10 hours getting the results. Is there any method to simplify the steps, getting the same result but spend a shorter time? 

 

I expect to get the "Want3" (want1, and want2 are just intermediary)

 

many thanks in advance.

 

 

15 REPLIES 15
Patrick
Opal | Level 21

Is your actual have data in SAS or in a database? Are these yearly tables as the sample data indicates? What are the actual data volumes (like number of rows and volume in GB)?

Do you need all Want tables or only Want3 and the other two tables in your code are just intermediary?

Alexxxxxxx
Pyrite | Level 9
Hi Patrick, thanks for your reply.

the data is current in the SAS. the data is so big that I do not have enough space to get the result, so I split the data by year. By using 1985 year data as an example, have1 has 279295 rows and 5 columns, have2 has 299680568 rows.

I expect to get the "Want3" (want1, and want2 are just intermediary).
Patrick
Opal | Level 21

@Alexxxxxxx Your have1 dataset is actually not that big so it's only the join that blows things up. If reformulating things a bit it might be possible to execute for all years at once. Do you have a year or date column in your actual data which we need to add for joining?

jimbarbour
Meteorite | Level 14

@Alexxxxxxx

 

I would agree with @Patrick that if there is a way you can get all the data you need without generating so many rows, that will be your best method of reducing run time.

 

One thing I notice, is that it looks like the Data step really isn't necessary.  A Case statement in the first SQL procedure appears to produce the same results.  Check it and see what you think.  This would eliminate having to read every row again in the Data step.

 

Here's the modified SQL (added Case statement):

	PROC SQL;
		CREATE TABLE want1_&year. AS
		SELECT DISTINCT
		citing1.Applicant_name,
		citing1.appln_filing_year,
		citing1.appln_id,
		pre_patent,
		citing1.cited_appln_id,
		citing2.appln_id_citing_p5
		,CASE	when	cited_appln_id=appln_id_citing_p5 then 1
			ELSE .	END	AS	cited_patent
		FROM
		have1_&year. (keep= Applicant_name appln_filing_year appln_id cited_appln_id pre_patent)as citing1
		JOIN have2_&year.(keep=appln_id_citing appln_id_citing_p5)  AS citing2 ON citing1.appln_id = citing2.appln_id_citing
		order by citing1.appln_id
		;
	QUIT;

 

Results:

jimbarbour_1-1628656663713.png

 

Patrick
Opal | Level 21

@Alexxxxxxx Looking at your have2 dataset and the number of rows I suspect that the two columns you give us are not the primary key and if de-duping volumes get eventually quite a bit lower.

To test the theory can you please execute below against your real data and tell us if that works (fits into memory) and what number n_obs prints in the log.

data _null_;
  if 0 then set have2_1985;
  dcl hash h1(dataset:"have2_1985");
  h1.defineKey('appln_id_citing','appln_id_citing_p5');
  h1.defineDone();
  n_obs=h1.num_items;
  put n_obs=;
  stop;
run;

 

And an added question: Is your original data source from which you create HAVE2 already a SAS table or is this data in a data base? If it's the latter then one should reduce volumes already on the DB side.

 

Below to illustrate my line of thought (fully working code) - but it all comes down if have2 can fit into memory else another coding approach will be required.

data have1_1985;
  infile cards dsd  dlm=",";
  input
    Applicant_name :$8.
    appln_filing_year: 8.
    appln_id :8.
    cited_appln_id :8.
    pre_patent :8.
  ;
  cards4;
A,2021,101,201,1
B,2021,101,202,
A,2017,102,203,
A,2016,103,204,
B,2018,104,205,
;;;;

data have2_1985;
  infile cards dsd  dlm=",";
  input
    appln_id_citing :8.
    appln_id_citing_p5 :8.
  ;
  cards4;
101,201
101,203
102,203
103,205
104,205
;;;;

%MACRO Explo(year=);

  data _null_;
    if 0 then set have2_&year.;
    dcl hash h1(dataset:"have2_&year.");
    h1.defineKey('appln_id_citing','appln_id_citing_p5');
    h1.defineData('appln_id_citing','appln_id_citing_p5');
    h1.defineDone();
    h1.output(dataset:"have2_dedup_&year.");
    stop;
  run;

  data want3_&year.;
    if _n_=1 then 
      do;
        if 0 then set have1_&year. have2_dedup_&year.;
        dcl hash h1(dataset:" have2_dedup_&year.",multidata:'y');
        h1.defineKey('appln_id_citing');
        h1.defineData('appln_id_citing','appln_id_citing_p5');
        h1.defineDone();
        length cited_patent 3.;
      end;

    set have1_&year.;
    do while(h1.do_over(key:appln_id) eq 0);
      if cited_appln_id=appln_id_citing_p5 then 
        do;
          cited_patent=1;
          output;
          leave;
        end;
    end;
    if cited_patent ne 1 then output;

    keep
      Applicant_name
      appln_filing_year
      appln_id
      cited_appln_id
      pre_patent
      cited_patent
      ;

  run;

  proc sort data=want3_&year. out=want3_&year.;
    by Applicant_name appln_filing_year cited_appln_id;
  run;

%MEND Explo;

%Explo(year=1985)
jimbarbour
Meteorite | Level 14

@Patrick

 

Good approach. 

 

If I may ask:  Why use multidata:'y'?  It sounded as though we wanted a unique count.  Wouldn't multidata:'y' permit duplicates?

 

Jim

Patrick
Opal | Level 21

@jimbarbour wrote:

@Patrick

 

Good approach. 

 

If I may ask:  Why use multidata:'y'?  It sounded as though we wanted a unique count.  Wouldn't multidata:'y' permit duplicates?

 

Jim


@jimbarbour Ooops! Copy/paste error. Yes, certainly not multidata. I've fixed the code.

jimbarbour
Meteorite | Level 14

Can you post your log?  Exactly which step is it that is taking a long time?  The 2nd SQL step?

 

How many rows does Want1 wind up having?

 

The first thing that occurs to me is that a WHERE clause will be more efficient than a HAVING with a summary function.

 

Something like:

	PROC SQL;
		CREATE TABLE want3_&year. AS
		SELECT DISTINCT
		Applicant_name,
		appln_filing_year,
		appln_id,
		cited_appln_id,
		pre_patent,
		cited_patent
		FROM want2_&year.
		WHERE	cited_patent	>	0
		ORDER BY 
/*		GROUP BY*/
			Applicant_name,appln_id,cited_appln_id
/*		Having cited_patent=max(cited_patent) */
		;

 

However, if there are cases where the max(cited_patent) is going to be zero or missing and you're going to want to have them in the final output, my WHERE suggestion won't work.

 

Jim

andreas_lds
Jade | Level 19

however, these codes spend around 10 hours getting the results. Is there any method to simplify the steps, getting the same result but spend a shorter time?

Start by showing what you have by posting proc contents of the original files, then explain what you are trying to achieve. And you should post some details about the hardware you are using.

Kurt_Bremser
Super User

If the example data represents your real data correctly, the first SQL will result in a (hidden) cartesian join that can easily blow up on you anytime. The two 101s in both datasets result in 4 observations in the output.

 

So you should get to know why there are repeats in both datasets.

Next, the use of the DISTINCT clause is always a big performance-eater and should only be done when absolutely necessary.

 

If you actually use DISTINCT to dedupe the result, then it is much better to dedupe before doing the join.

ChrisNZ
Tourmaline | Level 20

1. Why do you

order by citing1.appln_id

when you never use that order?

2. What is the HAVING supposed to achieve when the value is either missing or 1?

3. As mentioned, is the DISTINCT keyword useful? Is there a DQ issue with the data?

In any case this generates exactly the same result on your data and should be much faster:

PROC SQL;
	CREATE TABLE want AS
	SELECT unique
	Applicant_name,
	appln_filing_year,
	appln_id,
	cited_appln_id,
	pre_patent,
	case when cited_appln_id=appln_id_citing_p5 then 1 end as cited_patent
	FROM (
		SELECT  
		citing1.Applicant_name,
		citing1.appln_filing_year,
		citing1.appln_id,
		pre_patent,
		citing1.cited_appln_id,
		citing2.appln_id_citing_p5
		FROM
        have1_&year.       as citing1
		JOIN have2_&year.  as citing2 
		ON citing1.appln_id = citing2.appln_id_citing
	)
	GROUP BY Applicant_name,appln_id,cited_appln_id
	HaViNg cited_patent=max(cited_patent) 
	;
QUIT;

 

 

ChrisNZ
Tourmaline | Level 20

This also generates the same results and should be even faster:

PROC SQL;
	CREATE TABLE t AS
	SELECT unique * 
	FROM (
		SELECT  
			citing1.Applicant_name,
			citing1.appln_filing_year,
			citing1.appln_id,
			pre_patent,
			citing1.cited_appln_id,
			case when cited_appln_id=appln_id_citing_p5 then 1 end as cited_patent
		FROM have1_&year.  as citing1
		JOIN have2_&year.  as citing2 
		ON citing1.appln_id = citing2.appln_id_citing
	)
	GROUP BY Applicant_name,appln_id,cited_appln_id
	Having cited_patent=max(cited_patent) 
	;
QUIT;

 

jimbarbour
Meteorite | Level 14

@ChrisNZ,

 

Select Unique?  I know it works in Oracle, but is that valid in SAS?  Is its behavior any different than Select Distinct?

 

Edit:

DISTINCT

eliminates duplicate rows. The DISTINCT argument is identical to UNIQUE.

Alias UNIQUE
Notes Although the UNIQUE argument is identical to DISTINCT, it is not an ANSI standard.

Well, I'll be darned.  Learn something new every day.

 

Jim

ChrisNZ
Tourmaline | Level 20

Unsure how unique became my go-to syntax rather than distinct. Easier to type maybe? Yes, they are the same.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1083 views
  • 10 likes
  • 7 in conversation