BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tritringuyen
Quartz | Level 8
Hi Peter, it does not work.
mkeintz
PROC Star

@tritringuyen wrote:
Hi Peter, it does not work.

What does "it does not work" mean?

 

It ran out of resources?

It produced erroneous results?

It generated error messages and stopped?

 

I think the time has come for you provide some more information to help us help you.  A good starting point would be @Tom 's questions.   And I, for one, would need to see the sas log and the code that produced it to be confident that I am understanding all the relevant issues for this task.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tritringuyen
Quartz | Level 8
Hi @mkeintz, thank you for your response. Please find below for the SAS log when I run your code. The problem might be that my SAS system (windows) is too old and there might be some functions not available.

The advantage of your code is that it is very fast.

I will come back to @Tom soon.

Thank you very much!
Best regards,

***

NOTE: PROCEDURE PRINTTO used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds


141 data have;
142 input gvkey $6. sic fyear item :$5. ;
143 datalines;

NOTE: The data set WORK.HAVE has 12 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds


156 run;
157 proc sort data=have out=have_srt;
158 by sic fyear gvkey item;
159 run;

NOTE: SAS threaded sort was used.
NOTE: There were 12 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.HAVE_SRT has 12 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds


160 data want (keep=gvkeyi gvkeyj sic fyear gvkey_i_total gvkey_j_total
160! ij_intersection) ;
161 array _gvkeyid {200} $6; /* List of gvkey's for current SIC/FYEAR */
162 array _itemcount {200}; /* Count of items by gvkey, for current
162! SIC/FYEAR */
163 array _itemmatrix {200,200} $5 ;
164 if 0 then set have_srt /* Establish variables in the PDV, without
164! reading data*/
165 have_srt (rename=(gvkey=gvkeyi))
166 have_srt (rename=(gvkey=gvkeyj));
167 call missing(gvkey_i_total,gvkey_j_total,ij_intersection);
168 if _n_=1 then do;
169 declare hash itemlist(); /*gvkey/item Lookup table, within each
169! SIC/FYEAR*/
170 itemlist.definekey('gvkey','item');
171 itemlist.definedone();
172 declare hash ij_count(); /*For counting up gvkeyi/gvkeyj item
172! intersection*/
173 ij_count.definekey('gvkeyi','gvkeyj');
174 ij_count.definedata('ij_intersection');
175 ij_count.definedone();
176 end;
177 itemlist.clear();
178 ij_count.clear();
179 do g=1 by 1 until (last.fyear); /* Read all records for every gvkey
179! in current SIC/FYEAR */
180 do i=1 by 1 until (last.gvkey); /* Read each item for current gvkey
180! */
181 set have_srt ;
182 by sic fyear gvkey;
183 _itemmatrix{g,i}=item ; /* Populate matrix ... */
184 itemlist.add(); /* ... and the analogous lookup table
184! */
185 end;
186 _gvkeyid{g}=gvkey;
187 _itemcount{g}=i;
188 end;
189 /* Now process the accumulated data for this SIC/FYEAR */
190 do i=1 to g; /* For each gvkeyi ... */
191 gvkeyi=_gvkeyid{i};
192 gvkey_i_total=_itemcount{i};
193 do j=1 to g; /* For each gvkeyj ... */
194 if j=i then continue; /* Skip loop if i=j */
195 gvkeyj=_gvkeyid{j};
196 gvkey_j_total=_itemcount{j};
197 ij_intersection=0; /* Count common items */
198 if j>i then do k=1 to _itemcount{i};
199 item=_itemmatrix{i,k};
200 if itemlist.find(key:gvkeyj,key:item)=0 then
200! ij_intersection=ij_intersection+1;
201 end;
202 if j>i then ij_count.add();
203 else ij_count.find(key:gvkeyj,key:gvkeyi);
204 output;
205 end;
206 end;
207 run;

ERROR: Could not find method.
ERROR: Could not find method.
ERROR: Could not find method.
ERROR: Could not find method.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 12 observations read from the data set WORK.HAVE_SRT.
WARNING: The data set WORK.WANT may be incomplete. When this step was
stopped there were 6 observations and 7 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.12 seconds
cpu time 0.12 seconds


Tom
Super User Tom
Super User

You didn't answer the question about whether or not you need the INDUSTY and FYEAR as separate sets of observations per COMPANY PAIR.  If so then please supply some example data and expected output that demonstrates data of that type and how you want it summarized. 

 

Please clarify how large your data actually is. 

What are the types and lengths of the variables used in the analysis
How many total observations?
How many distinct COMPANY ids?

How many distinct ITEM codes?

How any distinct INDUSTRY codes?

How many FYEAR values?

 

Where does the dataset live?  Is it actually a SAS dataset? Or are you connecting to some external database?

If it is a SAS dataset does it live on a disk on the SAS computer? Or a shared disk?

 

Are there any duplicate observations? That is same COMPANY, INDUSTRY, FYEAR, ITEM values?  If so then making a dataset with just one observation per each might be the first step towards improved performance.

 

If you really need to process by INDUSTRY*FYEAR then you could subset the problem along those dimensions and re-run separate queries for each.

 

But if the problem is too large for the simple SQL join then perhaps a custom SAS data step solution could be faster.  The sizes will matter whether HASH objects or even simple ARRAYs could help in the summarization.  

mkeintz
PROC Star

Another comment, this time on the substance of the task.

 

SIC code is a 4-digit code.  Although not every SIC code is assigned to an industry, you are probably dealing with at least a couple hundred SIC values, some of which will have very few GVKEYs.   Before you go down this rabbit hole a lot more, are you convinced that you should not first be collapsing the SIC codes?  Perhaps just use the first 2 (or even first 3) digits.   (And remember the first 3 digits of SIC code 161  are 016).  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tritringuyen
Quartz | Level 8

Hi @mkeintz. I tried your code in SAS University Edition and it worked for a small sample (did you revise the code recently? It did not work previously).

 

However, when I tried your code with my sub dataset of 1 year of SIC (1 digit), it stopped due to errors. Please see file attached. In addition to the errors, is there any way for your code to work on windows SAS old version (SAS University Edition will retire soon, while SAS on-demand for academic is not enough space for my data? 

 

Here are some points requested by @Tom: my dataset has about 10,000 firm (gvkey), about 100 industry codes (SIC two digits, thanks @mkeintz for this), and 10 years. The data I give in the original question is a couple of lines of my real dataset (I simplified everything to make the dataset light). My real dataset has 21m observations with four columns: firm, industry, year, and item as indicated in the sample.

 

Thank you both for your kindness and your help so far. I got a feeling that we are very close.

Best regards,

Thierry

mkeintz
PROC Star

@tritringuyen wrote:

Hi @mkeintz. I tried your code in SAS University Edition and it worked for a small sample (did you revise the code recently? It did not work previously).

 

However, when I tried your code with my sub dataset of 1 year of SIC (1 digit), it stopped due to errors. Please see file attached. In addition to the errors, is there any way for your code to work on windows SAS old version (SAS University Edition will retire soon, while SAS on-demand for academic is not enough space for my data? 

 

Here are some points requested by @Tom: my dataset has about 10,000 firm (gvkey), about 100 industry codes (SIC two digits, thanks @mkeintz for this), and 10 years. The data I give in the original question is a couple of lines of my real dataset (I simplified everything to make the dataset light). My real dataset has 21m observations with four columns: firm, industry, year, and item as indicated in the sample.

 

Thank you both for your kindness and your help so far. I got a feeling that we are very close.

Best regards,

Thierry


 

 

 

Your attachment has the notes:

NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
110:5
NOTE: Character values have been converted to numeric
values at the places given by: (Line):(Column).
117:12 121:14

Line 110 is

110 _gvkeyid{g}=gvkey;

which means that gvkey must be a numeric variable, which is being copied to a character variable: _gvkeyid{g}.  So change the array type _gvkeyid from character to numeric.

 

That will also eliminate the notes for lines 117 and 121, in which the character values in _gvkeyid{i} and _gvkey{j} are copied to the numeric vars GVKEYI and GVKEYJ.

 

Importantly, you also got the error message:

ERROR: Array subscript out of range at line 107 column 7.

what do you think that means?  Line 107 is:

107              _itemmatrix{g,i}=item ;       /* Populate matrix ... */

And another curiosity.  Your log messages reported the values of variables, which included a report that

item=derivativeassetdesignatedashedginginstrumentfairvalue gvkey=764065
 fyear=2010 sic=1000 gvkeyi=817366 gvkeyj=. gvkey_i_total=. gvkey_j_total=.
 ij_intersection=. g=1 last.fyear=0 i=201 last.gvkey=0 FIRST.sic=0 LAST.sic=0
 FIRST.fyear=0 FIRST.gvkey=0 j=. k=. _ERROR_=1 _N_=2

Take a look at the value of ITEM ("derivativeassetdesignatedashedginginstrumentfairvalue").  Is it an expected value?  What is the storage length of the variable ITEM?   You will have to increase the character length specified for the constituents of the array ITEMMATRIX to be that length.  I just used length $5 because that was the length in your sample data.  Please run a proc contents against the data set so we can see the attributes of the variables in use.  

 

 

Please run a proc contents on the dataset so we can see the attributes of the variables in use.  I've already made a mistake for GVKEY.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tritringuyen
Quartz | Level 8

Hi @mkeintz, thank you very much for your support. I am happy to let you know that your code works with larger datasets (e.g. with around 1m observations in an acceptable time). After carefully checking and modifying the code, it works perfectly. It is more than enough for me. Apologies that I did not accept your code as the solution because I was busy at the time you posted the code and the code did not work at that time (due to small errors).

 

Also, thank you @Tom for providing the code as well. Your code works well.

 

Thank you both for your kindness and patience. I am very happy with SAS communities and I believe that you both are making a difference.

 

Best regards,

Thierry

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 22 replies
  • 1800 views
  • 4 likes
  • 4 in conversation