Hello all,
I am using the following code,
data have1_1985; infile cards dsd dlm=","; input Applicant_name :$8. appln_filing_year: 8. appln_id :8. cited_appln_id :8. pre_patent :8. ; cards; A,2021,101,201,1 B,2021,101,202, A,2017,102,203, A,2016,103,204, B,2018,104,205, ;;;; run; data have2_1985; infile cards dsd dlm=","; input appln_id_citing :8. appln_id_citing_p5 :8. ; cards; 101,201 101,203 102,203 103,205 104,205 ;;;; run; %MACRO Explo(year=); PROC SQL; CREATE TABLE want1_&year. AS SELECT DISTINCT citing1.Applicant_name, citing1.appln_filing_year, citing1.appln_id, pre_patent, citing1.cited_appln_id, citing2.appln_id_citing_p5 FROM have1_&year. (keep= Applicant_name appln_filing_year appln_id cited_appln_id pre_patent)as citing1 JOIN have2_&year.(keep=appln_id_citing appln_id_citing_p5) AS citing2 ON citing1.appln_id = citing2.appln_id_citing order by citing1.appln_id ; QUIT; data want2_&year.; length cited_patent 3.; set want1_&year.; if cited_appln_id=appln_id_citing_p5 then cited_patent=1; run; PROC SQL; CREATE TABLE want3_&year. AS SELECT DISTINCT Applicant_name, appln_filing_year, appln_id, cited_appln_id, pre_patent, cited_patent FROM want2_&year. GROUP BY Applicant_name,appln_id,cited_appln_id Having cited_patent=max(cited_patent) ; QUIT; %MEND Explo; %Explo(year=1985)
however, these codes spend around 10 hours getting the results. Is there any method to simplify the steps, getting the same result but spend a shorter time?
I expect to get the "Want3" (want1, and want2 are just intermediary)
many thanks in advance.
Is your actual have data in SAS or in a database? Are these yearly tables as the sample data indicates? What are the actual data volumes (like number of rows and volume in GB)?
Do you need all Want tables or only Want3 and the other two tables in your code are just intermediary?
@Alexxxxxxx Your have1 dataset is actually not that big so it's only the join that blows things up. If reformulating things a bit it might be possible to execute for all years at once. Do you have a year or date column in your actual data which we need to add for joining?
I would agree with @Patrick that if there is a way you can get all the data you need without generating so many rows, that will be your best method of reducing run time.
One thing I notice, is that it looks like the Data step really isn't necessary. A Case statement in the first SQL procedure appears to produce the same results. Check it and see what you think. This would eliminate having to read every row again in the Data step.
Here's the modified SQL (added Case statement):
PROC SQL;
CREATE TABLE want1_&year. AS
SELECT DISTINCT
citing1.Applicant_name,
citing1.appln_filing_year,
citing1.appln_id,
pre_patent,
citing1.cited_appln_id,
citing2.appln_id_citing_p5
,CASE when cited_appln_id=appln_id_citing_p5 then 1
ELSE . END AS cited_patent
FROM
have1_&year. (keep= Applicant_name appln_filing_year appln_id cited_appln_id pre_patent)as citing1
JOIN have2_&year.(keep=appln_id_citing appln_id_citing_p5) AS citing2 ON citing1.appln_id = citing2.appln_id_citing
order by citing1.appln_id
;
QUIT;
Results:
@Alexxxxxxx Looking at your have2 dataset and the number of rows I suspect that the two columns you give us are not the primary key and if de-duping volumes get eventually quite a bit lower.
To test the theory can you please execute below against your real data and tell us if that works (fits into memory) and what number n_obs prints in the log.
data _null_;
if 0 then set have2_1985;
dcl hash h1(dataset:"have2_1985");
h1.defineKey('appln_id_citing','appln_id_citing_p5');
h1.defineDone();
n_obs=h1.num_items;
put n_obs=;
stop;
run;
And an added question: Is your original data source from which you create HAVE2 already a SAS table or is this data in a data base? If it's the latter then one should reduce volumes already on the DB side.
Below to illustrate my line of thought (fully working code) - but it all comes down if have2 can fit into memory else another coding approach will be required.
data have1_1985;
infile cards dsd dlm=",";
input
Applicant_name :$8.
appln_filing_year: 8.
appln_id :8.
cited_appln_id :8.
pre_patent :8.
;
cards4;
A,2021,101,201,1
B,2021,101,202,
A,2017,102,203,
A,2016,103,204,
B,2018,104,205,
;;;;
data have2_1985;
infile cards dsd dlm=",";
input
appln_id_citing :8.
appln_id_citing_p5 :8.
;
cards4;
101,201
101,203
102,203
103,205
104,205
;;;;
%MACRO Explo(year=);
data _null_;
if 0 then set have2_&year.;
dcl hash h1(dataset:"have2_&year.");
h1.defineKey('appln_id_citing','appln_id_citing_p5');
h1.defineData('appln_id_citing','appln_id_citing_p5');
h1.defineDone();
h1.output(dataset:"have2_dedup_&year.");
stop;
run;
data want3_&year.;
if _n_=1 then
do;
if 0 then set have1_&year. have2_dedup_&year.;
dcl hash h1(dataset:" have2_dedup_&year.",multidata:'y');
h1.defineKey('appln_id_citing');
h1.defineData('appln_id_citing','appln_id_citing_p5');
h1.defineDone();
length cited_patent 3.;
end;
set have1_&year.;
do while(h1.do_over(key:appln_id) eq 0);
if cited_appln_id=appln_id_citing_p5 then
do;
cited_patent=1;
output;
leave;
end;
end;
if cited_patent ne 1 then output;
keep
Applicant_name
appln_filing_year
appln_id
cited_appln_id
pre_patent
cited_patent
;
run;
proc sort data=want3_&year. out=want3_&year.;
by Applicant_name appln_filing_year cited_appln_id;
run;
%MEND Explo;
%Explo(year=1985)
Good approach.
If I may ask: Why use multidata:'y'? It sounded as though we wanted a unique count. Wouldn't multidata:'y' permit duplicates?
Jim
@jimbarbour wrote:
Good approach.
If I may ask: Why use multidata:'y'? It sounded as though we wanted a unique count. Wouldn't multidata:'y' permit duplicates?
Jim
@jimbarbour Ooops! Copy/paste error. Yes, certainly not multidata. I've fixed the code.
Can you post your log? Exactly which step is it that is taking a long time? The 2nd SQL step?
How many rows does Want1 wind up having?
The first thing that occurs to me is that a WHERE clause will be more efficient than a HAVING with a summary function.
Something like:
PROC SQL;
CREATE TABLE want3_&year. AS
SELECT DISTINCT
Applicant_name,
appln_filing_year,
appln_id,
cited_appln_id,
pre_patent,
cited_patent
FROM want2_&year.
WHERE cited_patent > 0
ORDER BY
/* GROUP BY*/
Applicant_name,appln_id,cited_appln_id
/* Having cited_patent=max(cited_patent) */
;
However, if there are cases where the max(cited_patent) is going to be zero or missing and you're going to want to have them in the final output, my WHERE suggestion won't work.
Jim
however, these codes spend around 10 hours getting the results. Is there any method to simplify the steps, getting the same result but spend a shorter time?
Start by showing what you have by posting proc contents of the original files, then explain what you are trying to achieve. And you should post some details about the hardware you are using.
If the example data represents your real data correctly, the first SQL will result in a (hidden) cartesian join that can easily blow up on you anytime. The two 101s in both datasets result in 4 observations in the output.
So you should get to know why there are repeats in both datasets.
Next, the use of the DISTINCT clause is always a big performance-eater and should only be done when absolutely necessary.
If you actually use DISTINCT to dedupe the result, then it is much better to dedupe before doing the join.
1. Why do you
order by citing1.appln_id
when you never use that order?
2. What is the HAVING supposed to achieve when the value is either missing or 1?
3. As mentioned, is the DISTINCT keyword useful? Is there a DQ issue with the data?
In any case this generates exactly the same result on your data and should be much faster:
PROC SQL;
CREATE TABLE want AS
SELECT unique
Applicant_name,
appln_filing_year,
appln_id,
cited_appln_id,
pre_patent,
case when cited_appln_id=appln_id_citing_p5 then 1 end as cited_patent
FROM (
SELECT
citing1.Applicant_name,
citing1.appln_filing_year,
citing1.appln_id,
pre_patent,
citing1.cited_appln_id,
citing2.appln_id_citing_p5
FROM
have1_&year. as citing1
JOIN have2_&year. as citing2
ON citing1.appln_id = citing2.appln_id_citing
)
GROUP BY Applicant_name,appln_id,cited_appln_id
HaViNg cited_patent=max(cited_patent)
;
QUIT;
This also generates the same results and should be even faster:
PROC SQL;
CREATE TABLE t AS
SELECT unique *
FROM (
SELECT
citing1.Applicant_name,
citing1.appln_filing_year,
citing1.appln_id,
pre_patent,
citing1.cited_appln_id,
case when cited_appln_id=appln_id_citing_p5 then 1 end as cited_patent
FROM have1_&year. as citing1
JOIN have2_&year. as citing2
ON citing1.appln_id = citing2.appln_id_citing
)
GROUP BY Applicant_name,appln_id,cited_appln_id
Having cited_patent=max(cited_patent)
;
QUIT;
Select Unique? I know it works in Oracle, but is that valid in SAS? Is its behavior any different than Select Distinct?
Edit:
DISTINCT
eliminates duplicate rows. The DISTINCT argument is identical to UNIQUE.
Alias UNIQUE Notes Although the UNIQUE argument is identical to DISTINCT, it is not an ANSI standard.
Well, I'll be darned. Learn something new every day.
Jim
Unsure how unique became my go-to syntax rather than distinct. Easier to type maybe? Yes, they are the same.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.