Hi, as Patrick suggested in my last post on SAS macro to generate a list of variables, let me describe a bit the data set I have downloaded from Datastream.
To download monthly data for each variable, I have an Excel workbook with 27 sheets. Each sheets contains 9,000 firms (ISIN codes) and data is collected over 17 years. So each sheet has 9,000 rows and 200+ columns, each column corresponding to each quarter of each year. The raw data looks like below
Code _1_Jan_95 _1_Feb_95........_1_Mar_11 _1_Apr_11
ABC 534 353 353 555
CDF 445 554 252 252
GHI $$ERRORS 556 335 552
I have been download data for around 30 variables.
My tasks include:
1) Import all 27 Excell sheets into SAS
2) Clean invalid values such as $$, ERRORS, NA, vv... whenever the firm doesn't have data for one particular quarter or year
3) Convert all those values for each of the quarters from character value to numeric values. This is because if the firm in the 1st row in a given Excell sheet doesn't have data or have invalid data then after importing SAS documents it as a character values rather than numeric values, so 85.65 is arranged to left because it is recognized as characters
4) Stack up/append vertically all these 27 datasets
5) Transpose the final stacked-up dataset to panel data form, rename variables, etc...
For importing 27 Excell sheets, I came up with the following macro
/*Import multiple Excel sheets in an Excel workbook*/
%let filepath=C:\RESEARCH\DataCollection\DS RequestManager;
%let filename=VO.xlsm;
%macro import_Excel_sheets(first_sheet=, last_sheet=);
%local i; /*Loop over the datasets/sheets*/
%do i=&first_sheet. %to &last_sheet. %by 1;
proc import datafile="&filepath.\&filename."
out=Sheet&i. dbms=EXCEL replace;
sheet="Sheet&i.";
getnames=yes;
mixed=yes;
run;
%end;
%mend import_Excel_sheets;
I'm really, really reluctant to use INFILE as one of you suggested. I'm not familiar with it, and to be honest, I still find it confusing to understand the advantage of using INFILE or anything involving INPUT X Y Z. Why? If I use INPUT X Y Z, I always have to count the columns or sth like that? How can I count? I have a list of 250,000 international firms, some firms have more codes than the others, how can I know what is the longest codes so I can declare sth like INPUT X $ longest_char_num (say, INPUT X $ 25)?
Secondly, with 200++ columns I bet that I will have an extremely hard time typing all those variables INPUT var1 $ 12 var2$ 15 ..... var240 $ 22;
So I think I must use Proc Import.
Previously when I collect data like this, what I did was I cleaned data in Excell directly. I renamed variables, stacked sheets on top of each others. This time, one Excell file is really heavy. Even openning one file takes like 1 minute to open, then saving it also. Sometimes it just freezes.
I can't change the way an Excel sheet of data looks because the data vendor dictates the way it looks.
Back to the data cleaning, when it comes to transpose the stacked-up dataset, I must declare the BY variables, which is 200++, and so I initially must type all these _1_Jan_95 up until _1_Apr_11. I made errors once and it took days to spot where it is. The SAS logs runs like forever, and I intially must sit there and clear the logs (now I know I can use Proc Printto).
Then it comes the invalid date _1_Feb_95 and I must extract parts of it and transform them to day, month and year. I have a long list of something like: else if substr(_name_,4,3)="Feb" & substr(_name_,8,2)="95" then date='01Feb1995'D;
By the way, I still find it hard to use INFILE and INPUT? How do you do it in practice? Do you manually count the position of the columns? Manually count the number of spaces for each variable?
Message was edited by: smilingmelbourne
Message was edited by: smilingmelbourne
... View more