Hello everyone,
I tried merging two data sets by gvkey but I kept getting an error stating that the variable has been defined as both character and numeric. I did some search and was finally able to merge them. I'm sharing my codes here and I would be very grateful if someone would tell me if I did the right thing. I used these codes:
data data_set_1;
set data_set_1 (rename=(gvkey=gvkey_Character)) ;
gvkey = put(gvkey_Character, 1.) ;
drop gvkey_Character ;
run ;
data data_set_2;
set data_set_2 (rename=(gvkey=gvkey_Character)) ;
gvkey = put(gvkey_Character, 1.) ;
drop gvkey_Character ;
run ;
proc sort data=data_set_1;
by gvkey;
run;
proc sort data=data_set_2;
by gvkey;
run;
data merged;
merge data_set_1 data_set_2;
by GVKEY;
run;
Thank you very much.
The correct methodology is to ensure both variables are the same type, either character or numeric. In general, it's easier if ID's are character. However, since you got a warning, you shouldn't have to change both datasets only one. Your code changes both.
Another concern is that you've assigned it to a single character. As long as GVKEY never has more than a single digit this is fine. Otherwise you should use a different format, perhaps Best12
Yours:
gvkey = put(gvkey_Character, 1.) ;
Probably what you need:
gvkey = put(gvkey_Character, best12.) ;
To exactly determine the reason for your initial problem, we would need to see the variable list from proc contents for both datasets, or the log of your code execution, as the NOTEs would reveal the implicit type conversion(s) in your data steps.
Thank you sir, for the reply.
Below is the note I get from the log execution. I don't if this was what you meant. If meant I'm sorry; I started using SAS only recently.
Thank you once again.
data theo.tenure;
19 set theo.tenure;
20 run;
NOTE: There were 2134412 observations read from the data set THEO.TENURE.
NOTE: The data set THEO.TENURE has 2134412 observations and 13 variables.
NOTE: DATA statement used (Total process time):
real time 4.63 seconds
cpu time 0.71 seconds
21 data theo.debt;
22 set theo.debt;
NOTE: There were 506856 observations read from the data set THEO.DEBT.
NOTE: The data set THEO.DEBT has 506856 observations and 22 variables.
NOTE: DATA statement used (Total process time):
real time 6.95 seconds
cpu time 0.61 seconds
23 proc sort data=theo.tenure;
24 by gvkey;
25 run;
NOTE: There were 2134412 observations read from the data set THEO.TENURE.
NOTE: The data set THEO.TENURE has 2134412 observations and 13 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 6.24 seconds
cpu time 1.98 seconds
26 proc sort data=theo.debt;
27 by GVKEY;
28 run;
NOTE: There were 506856 observations read from the data set THEO.DEBT.
NOTE: The data set THEO.DEBT has 506856 observations and 22 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 7.49 seconds
cpu time 1.18 seconds
29 data merged;
30 merge theo.tenure theo.debt;
ERROR: Variable gvkey has been defined as both character and numeric.
31 by GVKEY;
32 run;
Sorry, I meant the log of your conversion data steps.
I'm so sorry,sir. I don't really get what you mean by my conversion data steps. So, I have provided the proc contents of the two variable here.
Thank you.
SAS Output
The SAS System |
THEO.DEBT | 506856 |
DATA | 22 |
V9 | 0 |
02/11/2017 05:31:48 | 424 |
02/11/2017 05:31:48 | 0 |
NO | |
NO | |
WINDOWS_64 | |
wlatin1 Western (Windows) |
65536 |
3292 |
1 |
154 |
146 |
0 |
YES |
C:\Users\toshiba-pc\Desktop\sas_stuff\debt.sas7bdat |
9.0401M3 |
X64_8HOME |
Currency | Char | 11 | $11. | $11. |
GVKEY | Num | 8 | BEST12. | BEST32. |
USD | Char | 22 | $22. | $22. |
VAR5 | Char | 16 | $16. | $16. |
VAR6 | Char | 28 | $28. | $28. |
VAR7 | Char | 34 | $34. | $34. |
VAR8 | Char | 21 | $21. | $21. |
VAR9 | Char | 18 | $18. | $18. |
VAR10 | Char | 38 | $38. | $38. |
VAR11 | Char | 16 | $16. | $16. |
VAR12 | Char | 19 | $19. | $19. |
VAR13 | Char | 16 | $16. | $16. |
VAR14 | Char | 15 | $15. | $15. |
VAR15 | Char | 15 | $15. | $15. |
VAR16 | Char | 15 | $15. | $15. |
VAR17 | Char | 15 | $15. | $15. |
VAR18 | Char | 15 | $15. | $15. |
VAR19 | Char | 30 | $30. | $30. |
VAR20 | Char | 25 | $25. | $25. |
VAR21 | Char | 1 | $1. | $1. |
VAR22 | Char | 18 | $18. | $18. |
_ | Char | 22 | $22. | $22. |
The SAS System |
THEO.TENURE | 2134412 |
DATA | 13 |
V9 | 0 |
02/11/2017 04:48:53 | 96 |
02/11/2017 04:48:53 | 0 |
NO | |
YES | |
WINDOWS_64 | |
wlatin1 Western (Windows) |
65536 |
3135 |
1 |
681 |
651 |
0 |
YES |
C:\Users\toshiba-pc\Desktop\sas_stuff\tenure.sas7bdat |
9.0401M3 |
X64_8HOME |
COMPANYID | Num | 8 | 11. | 11. | This is a unique indentifier denoting a company inside many tables of Capital IQ data feeds |
ENDYEAR | Num | 8 | 11. | 11. | End year of this record |
FSTYEAR | Num | 8 | |||
LnTenure | Num | 8 | |||
PERIODENDDATE | Num | 4 | YYMMDDN8. | DATETIME22.3 | Period End Date |
PERSONID | Num | 8 | 11. | 11. | Unique identifier for a Person in the Capital IQ database |
PROFUNCTIONID | Num | 8 | 11. | 11. | This is the identifier of a given proFunction. A proFunction is a standardized way to describe titles because titles can be different across firms. |
PROID | Num | 8 | 11. | 11. | Unique identifier for a professional . A professional is the instance of a person being associated to a company (or educational institution, etc.) |
STARTYEAR | Num | 8 | 11. | 11. | Start year of this record |
TENURE | Num | 8 | |||
YEAR | Num | 4 | 11. | 11. | Fiscal Year |
YEARBORN | Num | 8 | 11. | 11. | Year the professional was born |
gvkey | Char | 1 |
gvkey |
YES |
ANSI |
You realize the first two steps do nothing?
While writing code as a newbie, comment it. Before each step, write a statement of what you expect to be occuring in the data step.
It'll help you organize your code and future you when you need to read it again in 6 months.
Are you sure your data is correct? GVKEY being a single character seems weird to me, It's usually a longer character string.
Your code is now referencing Tenure and Debt but your conversion code using dataset 1/dataset 2. We can't know how they map to each other.
Your code should be:
1. Convert numeric to character for GVKEY (data step).
2. Sort by relevant fields, once for each data step
3. Merge data sets.
Thank you, sir for your reply.
Sir, dataset 1 and 2 are tenure and debt, respectively. I just used them as examples. As to the codes I used, as I said before, I searched online and saw a solution to a similar problem so I edited the codes and used it. I think the gvkey appearing as a single character in my codes reflect my limited knowledge of SAS and that's why I wanted to make sure I was doing the right thing.
Once again, thank you.
Not a sir.
You need to re-import or fix your original data first - not sure how you imported it but clearly GVKEY was not imported properly.
Remember to review your data after each step. Open the dataset and visually examine it. You should do this regardless of the software you use, but as a step in your analytical process.
Its also better if you make sure when importing that it's character in both to start off with, rather than mess around with trying to change it after.
The correct methodology is to ensure both variables are the same type, either character or numeric. In general, it's easier if ID's are character. However, since you got a warning, you shouldn't have to change both datasets only one. Your code changes both.
Another concern is that you've assigned it to a single character. As long as GVKEY never has more than a single digit this is fine. Otherwise you should use a different format, perhaps Best12
Yours:
gvkey = put(gvkey_Character, 1.) ;
Probably what you need:
gvkey = put(gvkey_Character, best12.) ;
Notice from the proc contents results that you posted: GVKEY is numeric in your first data set, but character in your second data set. To make them match, you could convert GVKEY from numeric to character, but only do that for your first data set.
Looking at your program, you could get rid of the second DATA step entirely. The remaining steps in combination should work.
Thank you everybody for time and patience.
I'm new to SAS and research in general and as has been pointed out to me, I think the mistake is coming from me. I'm sorry.
I think the correct identifier to use for merging my data is COMPANYID and not GVKEY. I have gone back and I'm compiling my data again. I'll tell you what happens.
Once again, thank you for your time and patience. I really appreciate it.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.