@tritringuyen wrote:
Hi Peter, it does not work.
What does "it does not work" mean?
It ran out of resources?
It produced erroneous results?
It generated error messages and stopped?
I think the time has come for you provide some more information to help us help you. A good starting point would be @Tom 's questions. And I, for one, would need to see the sas log and the code that produced it to be confident that I am understanding all the relevant issues for this task.
You didn't answer the question about whether or not you need the INDUSTY and FYEAR as separate sets of observations per COMPANY PAIR. If so then please supply some example data and expected output that demonstrates data of that type and how you want it summarized.
Please clarify how large your data actually is.
What are the types and lengths of the variables used in the analysis
How many total observations?
How many distinct COMPANY ids?
How many distinct ITEM codes?
How any distinct INDUSTRY codes?
How many FYEAR values?
Where does the dataset live? Is it actually a SAS dataset? Or are you connecting to some external database?
If it is a SAS dataset does it live on a disk on the SAS computer? Or a shared disk?
Are there any duplicate observations? That is same COMPANY, INDUSTRY, FYEAR, ITEM values? If so then making a dataset with just one observation per each might be the first step towards improved performance.
If you really need to process by INDUSTRY*FYEAR then you could subset the problem along those dimensions and re-run separate queries for each.
But if the problem is too large for the simple SQL join then perhaps a custom SAS data step solution could be faster. The sizes will matter whether HASH objects or even simple ARRAYs could help in the summarization.
Another comment, this time on the substance of the task.
SIC code is a 4-digit code. Although not every SIC code is assigned to an industry, you are probably dealing with at least a couple hundred SIC values, some of which will have very few GVKEYs. Before you go down this rabbit hole a lot more, are you convinced that you should not first be collapsing the SIC codes? Perhaps just use the first 2 (or even first 3) digits. (And remember the first 3 digits of SIC code 161 are 016).
Hi @mkeintz. I tried your code in SAS University Edition and it worked for a small sample (did you revise the code recently? It did not work previously).
However, when I tried your code with my sub dataset of 1 year of SIC (1 digit), it stopped due to errors. Please see file attached. In addition to the errors, is there any way for your code to work on windows SAS old version (SAS University Edition will retire soon, while SAS on-demand for academic is not enough space for my data?
Here are some points requested by @Tom: my dataset has about 10,000 firm (gvkey), about 100 industry codes (SIC two digits, thanks @mkeintz for this), and 10 years. The data I give in the original question is a couple of lines of my real dataset (I simplified everything to make the dataset light). My real dataset has 21m observations with four columns: firm, industry, year, and item as indicated in the sample.
Thank you both for your kindness and your help so far. I got a feeling that we are very close.
Best regards,
Thierry
@tritringuyen wrote:
Hi @mkeintz. I tried your code in SAS University Edition and it worked for a small sample (did you revise the code recently? It did not work previously).
However, when I tried your code with my sub dataset of 1 year of SIC (1 digit), it stopped due to errors. Please see file attached. In addition to the errors, is there any way for your code to work on windows SAS old version (SAS University Edition will retire soon, while SAS on-demand for academic is not enough space for my data?
Here are some points requested by @Tom: my dataset has about 10,000 firm (gvkey), about 100 industry codes (SIC two digits, thanks @mkeintz for this), and 10 years. The data I give in the original question is a couple of lines of my real dataset (I simplified everything to make the dataset light). My real dataset has 21m observations with four columns: firm, industry, year, and item as indicated in the sample.
Thank you both for your kindness and your help so far. I got a feeling that we are very close.
Best regards,
Thierry
Your attachment has the notes:
NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
110:5
NOTE: Character values have been converted to numeric
values at the places given by: (Line):(Column).
117:12 121:14
Line 110 is
110 _gvkeyid{g}=gvkey;
which means that gvkey must be a numeric variable, which is being copied to a character variable: _gvkeyid{g}. So change the array type _gvkeyid from character to numeric.
That will also eliminate the notes for lines 117 and 121, in which the character values in _gvkeyid{i} and _gvkey{j} are copied to the numeric vars GVKEYI and GVKEYJ.
Importantly, you also got the error message:
ERROR: Array subscript out of range at line 107 column 7.
what do you think that means? Line 107 is:
107 _itemmatrix{g,i}=item ; /* Populate matrix ... */
And another curiosity. Your log messages reported the values of variables, which included a report that
item=derivativeassetdesignatedashedginginstrumentfairvalue gvkey=764065
fyear=2010 sic=1000 gvkeyi=817366 gvkeyj=. gvkey_i_total=. gvkey_j_total=.
ij_intersection=. g=1 last.fyear=0 i=201 last.gvkey=0 FIRST.sic=0 LAST.sic=0
FIRST.fyear=0 FIRST.gvkey=0 j=. k=. _ERROR_=1 _N_=2
Take a look at the value of ITEM ("derivativeassetdesignatedashedginginstrumentfairvalue
"). Is it an expected value? What is the storage length of the variable ITEM? You will have to increase the character length specified for the constituents of the array ITEMMATRIX to be that length. I just used length $5 because that was the length in your sample data. Please run a proc contents against the data set so we can see the attributes of the variables in use.
Please run a proc contents on the dataset so we can see the attributes of the variables in use. I've already made a mistake for GVKEY.
Hi @mkeintz, thank you very much for your support. I am happy to let you know that your code works with larger datasets (e.g. with around 1m observations in an acceptable time). After carefully checking and modifying the code, it works perfectly. It is more than enough for me. Apologies that I did not accept your code as the solution because I was busy at the time you posted the code and the code did not work at that time (due to small errors).
Also, thank you @Tom for providing the code as well. Your code works well.
Thank you both for your kindness and patience. I am very happy with SAS communities and I believe that you both are making a difference.
Best regards,
Thierry
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.