BookmarkSubscribeRSS Feed
smilingmelbourne
Fluorite | Level 6
Hi, as Patrick suggested in my last post on SAS macro to generate a list of variables, let me describe a bit the data set I have downloaded from Datastream.

To download monthly data for each variable, I have an Excel workbook with 27 sheets. Each sheets contains 9,000 firms (ISIN codes) and data is collected over 17 years. So each sheet has 9,000 rows and 200+ columns, each column corresponding to each quarter of each year. The raw data looks like below

Code _1_Jan_95 _1_Feb_95........_1_Mar_11 _1_Apr_11
ABC 534 353 353 555
CDF 445 554 252 252
GHI $$ERRORS 556 335 552

I have been download data for around 30 variables.
My tasks include:
1) Import all 27 Excell sheets into SAS
2) Clean invalid values such as $$, ERRORS, NA, vv... whenever the firm doesn't have data for one particular quarter or year
3) Convert all those values for each of the quarters from character value to numeric values. This is because if the firm in the 1st row in a given Excell sheet doesn't have data or have invalid data then after importing SAS documents it as a character values rather than numeric values, so 85.65 is arranged to left because it is recognized as characters
4) Stack up/append vertically all these 27 datasets
5) Transpose the final stacked-up dataset to panel data form, rename variables, etc...

For importing 27 Excell sheets, I came up with the following macro



/*Import multiple Excel sheets in an Excel workbook*/
%let filepath=C:\RESEARCH\DataCollection\DS RequestManager;
%let filename=VO.xlsm;
%macro import_Excel_sheets(first_sheet=, last_sheet=);
%local i; /*Loop over the datasets/sheets*/
%do i=&first_sheet. %to &last_sheet. %by 1;
proc import datafile="&filepath.\&filename."
out=Sheet&i. dbms=EXCEL replace;
sheet="Sheet&i.";
getnames=yes;
mixed=yes;
run;
%end;
%mend import_Excel_sheets;


I'm really, really reluctant to use INFILE as one of you suggested. I'm not familiar with it, and to be honest, I still find it confusing to understand the advantage of using INFILE or anything involving INPUT X Y Z. Why? If I use INPUT X Y Z, I always have to count the columns or sth like that? How can I count? I have a list of 250,000 international firms, some firms have more codes than the others, how can I know what is the longest codes so I can declare sth like INPUT X $ longest_char_num (say, INPUT X $ 25)?

Secondly, with 200++ columns I bet that I will have an extremely hard time typing all those variables INPUT var1 $ 12 var2$ 15 ..... var240 $ 22;

So I think I must use Proc Import.

Previously when I collect data like this, what I did was I cleaned data in Excell directly. I renamed variables, stacked sheets on top of each others. This time, one Excell file is really heavy. Even openning one file takes like 1 minute to open, then saving it also. Sometimes it just freezes.

I can't change the way an Excel sheet of data looks because the data vendor dictates the way it looks.

Back to the data cleaning, when it comes to transpose the stacked-up dataset, I must declare the BY variables, which is 200++, and so I initially must type all these _1_Jan_95 up until _1_Apr_11. I made errors once and it took days to spot where it is. The SAS logs runs like forever, and I intially must sit there and clear the logs (now I know I can use Proc Printto).

Then it comes the invalid date _1_Feb_95 and I must extract parts of it and transform them to day, month and year. I have a long list of something like: else if substr(_name_,4,3)="Feb" & substr(_name_,8,2)="95" then date='01Feb1995'D;

By the way, I still find it hard to use INFILE and INPUT? How do you do it in practice? Do you manually count the position of the columns? Manually count the number of spaces for each variable?

Message was edited by: smilingmelbourne Message was edited by: smilingmelbourne
20 REPLIES 20
Reeza
Super User
If your variables are changing then proc import may work better.

However, you should never need to type the variables out in a by statement, that sounds insanely inefficient. You may want to go into array processing.

See a brief example below that might help. I'd suggest cleaning/validating within the loop as well to output the final clean value rather than using a separate step.

HTH,
Reeza
[pre]
*Create random data set;
data have;
do i=1 to 10;
a1=i;
a2=i*2;
output;
end;
run;

data have2;
*rename variables to similar to your format;
set have (rename = (a1=_1_APR_95 a2=_1_MAY_95));

*create array based on your data structure;
array orig{*} _1_:;

*loop over array;
do i=1 to dim(orig);
*get variable name in to process;
temp=vname(orig(i));
*convert name to month date format;
variable_name=input(substr(temp, 4, 3)||substr(temp, 8,2), monyy5.);
*get value of observation, you may need to use a temp value and clean it up before assigning final value;
value=orig(i);
*output to transposed dataset;
output;
end;
*drop variables that you don't need anymore;
drop _1_:;
format variable_name monyy7.;

run;
[/pre]
smilingmelbourne
Fluorite | Level 6
Thank you, Reeza. Each of my datasets contains roughly 200+ columns, which I don't know whether they are numeric or character due to data errors in the first rows (if numers on the first row, then the column is numeric in SAS; otherwise, it's character), although they all should be numeric.

So you mean I can use _CHARACTER_ and _NUMERIC_ to create 2 arrays, one for numeric and the other for characters. And after that, I should convert the character arrays from char to num?

Thanks a lot
Ksharp
Super User
[pre]
options validvarname=any;
[/pre]
will allow you to use 01/21/2010 as a variable name, ues function vname() to get that,then you do not need to transform them.

Ksharp
smilingmelbourne
Fluorite | Level 6
Thank you very much for the hint!
Patrick
Opal | Level 21
Hi

From how you describe your data/data structure I believe below code should do the job.

Just change the path in the libname to the location of your excel workbook and run the code.


libname MyWB EXCEL 'c:\temp\test.xls';
options source2;

proc sql noprint;
create table ctrl as
select libname, memname, name, type ,varnum
from dictionary.columns
where libname='MYWB'
order by libname, memname
;
select name into :FirstCol_VarName
from ctrl
where varnum=1
;

quit;

filename SET temp;
filename ASSIGN temp;
data _null_;
set ctrl;
by libname memname;

file SET;
if first.memname then
put libname +(-1) ".'" memname +(-1) "'n" "(rename=(";
if type='char' and substr(name,1,1)='_' then
put name "=C_C" name;
if last.memname then
put "))";

file ASSIGN;
if type='char' and substr(name,1,1)='_' then
do;
put name "=coalesce(" name ",input(C_C" name ",?? 8.));";
put "drop C_C" name ";";
end;
run;


data want(keep=&FirstCol_VarName date value indsname);
set %include SET; indsname=inds;
%include ASSIGN;

format date date9.;
array CharDt {*} _: ;
do i=1 to dim(CharDt);
if NOT missing(CharDt(i)) then
do;
date=input(compress(vname(CharDt(i))),?? date9.);
value=CharDt(i);
indsname= inds;
output;
end;
end;
run;

libname MyWB ;


HTH
Patrick
smilingmelbourne
Fluorite | Level 6
Thank you, Patrick. Your codes are very advanced to me! (but obviously very useful since I will learn a lot from that). I will read to understand the codes and will come back later if I have questions.
Patrick
Opal | Level 21
Hi

As it sounds all your columns in the Excel sheets will have a majority of numeric values. And what you want is that the resulting SAS dataset has all these columns numeric with missings for the cels in source with character values.

If so then probabely using "mixed=no" for Proc Import would give you what you want (it also depends on your registry settings).
Give it a try. More details in the doc: http://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003103761.htm


In regards of my "advanced code":
What I do there is:
- use the Excel libname engine (this reads all excel sheets of a workbook into SAS data sets).
- read the dictionary information of these newly created SAS datasets.
- use the dictionary information ("metadata"=data about the data/the tables) to dynamically create statements to read all the existing tables and change character variables to numeric.

"filename temp" assigns a temporary external file. I then use a "data _null_" step to create SAS statements which I write ("put") to these external files.

In the next data step I include these external files ("%include"). %include is a macro command which runs before the data step gets compiled. So it's basically just inserting the SAS code I've previously written to the temporary external files.

Using "options source2" you see in the log what code the %include statements actually add.

So again: %include just inserts the text (SAS code) written to the temporary exteral text file at the place where it's used and only after this the SAS compiler comes and compiles the data step.

HTH
Patrick

Message was edited by: Patrick
smilingmelbourne
Fluorite | Level 6
Hi, I doesn't work with "MIXED=NO". After importing into SAS, a lot of columns/variables have values that look sth like:

- $$ERR,2660, NO DATA AVAILABLE
- $$ERR,INVALID CODE
- NA

etc...

If a column has a numeric value on the 1st rows, then it's perfectly OK and that whole columns looks exactly what I like. Otherwise, then all down the column are characters which I later have to convert to numeric.

Previously when I collected smaller datasets with such problems, I relied on a simple trick of enterng any numeric value like 1, 2, 3 on the 1st row of the Excel sheet. Then, I imported them into SAS and everything was numeric. $$ERR, NA, etc... became dots which is exactly what I want. This time, for 230K firms over 17 years the Excel file becomes rather heavy. It takes like forever to click open or to save a change I made.

Thank you very much
Patrick
Opal | Level 21
Hi

'doesn't work with "MIXED=NO". '
That's why I referred you to the doc as it's written there that the behavior of "MIXED=NO" depends on the setting of "guessingrow".

But nevertheless: If the data structure is as you've described it then the code I've posted should work.

HTH
Patrick
smilingmelbourne
Fluorite | Level 6
Hi Pattrick, I'm trying to absord your codes which I learnt a lot. Actually, I'm currently a student and the only SAS book I've read so far is The Little SAS Book plus several SUGI papers for beginners/intermediate users.

This thing I found interesting since I've never seen it before, and after running it I saw all 27 Excel sheets in SAS. I could see things in each of the datasets. Never heard/seen this kind of thing before!

libname MyWB EXCEL 'C:\DataCollection\DS RequestManager\Sales.xlsm'; /*Assign Excel libnam engine*/

Why is there minus/plus sign in your codes?

if first.memname then put libname +(-1) ".'" memname +(-1) "'n" "(rename=(";

Actually, I run the program but it almost got through. An error: It says one variable is both numeric and character. So I didn't have a chance to look at the final dataset (as it has 0 row) in order to understand what is going on in the code.

Thanks a lot for the codes. I've surely learnt a lot, a lot...
Peter_C
Rhodochrosite | Level 12
smilingmelbourne

sounds like you should read and "transpose" your input in the first step to touch your data. That isn't supplied "out-of-the-box".
For your data source, a libname statement pointing at the workbook could simplify access to the data, but you will need to ensure the first 8 lines of each sheet clarify the data types for those columns.
Although you could read excel directly through DDE, most experts deprecate that technology.
If your data are CSV-type, then a simple data step allows you to achieve directly what you need.
Because you would "transpose" as you read data, you won't need to define much more than "code", "date" and "value" as output columns.
If you want a special flag for those "$$ERRORS" cells, that is not much more work either.
Let the INFILE and INPUT do the hard work
Even if your datastream is created with 100,000 columns (highly unlikely) the program becomes no larger.

but of course, as you say you don't want(/need?) to learn INFILE/INPUT (how to make your task simpler with a little learning):
Still, I would suggest your solution will become more manageable once you understand these things, and:
* trailing @
* output
* do loop
More examples and discussion can be found in
http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/a001302699.htm

peterC
art297
Opal | Level 21
Other than his comment about DDE (i.e., not using something simply because of others' opinions .. especially when we don't all agree), I agree with Peter's comments. However, if these are data that you scrapped from the web, why are you adding the extra layer of complexity by first putting the data into Excel files?

Using the same techniques that Peter suggested, you could download the files directly and not risk introducing the whole range of import and conversion problems that can result from trying to interface with Excel.

Art
---------
> smilingmelbourne
>
> Sounds like you should read and "transpose" your input
> in the first step to touch your data. That isn't
> supplied "out-of-the-box".
> For your data source, a libname statement pointing at
> the workbook could simplify access to the data, but
> you will need to ensure the first 8 lines of each
> sheet clarify the data types for those columns.
> Although you could read excel directly through DDE,
> most experts deprecate that technology.
> If your data are CSV-type, then a simple data step
> allows you to achieve directly what you need.
> Because you would "transpose" as you read data, you
> won't need to define much more than "code", "date"
> and "value" as output columns.
> If you want a special flag for those "$$ERRORS"
> cells, that is not much more work either.
> Let the INFILE and INPUT do the hard work
> Even if your datastream is created with 100,000
> columns (highly unlikely) the program becomes no
> larger.
>
> But of course, as you say you don't want(/need?) to
> learn INFILE/INPUT (how to make your task simpler
> with a little learning):
> Still, I would suggest your solution will become more
> manageable once you understand these things, and:
> * trailing @
> * output
> * do loop
> More examples and discussion can be found in
> http://support.sas.com/documentation/cdl/en/basess/581
> 33/HTML/default/a001302699.htm
>
> peterC
smilingmelbourne
Fluorite | Level 6
Hi Art, thank you for reply. The data vendor (Datastream/Worldscope, etc...) has an Excel add-on that I must use in collecting their batch data. So I must set up a list of firms in Excel and download data for the firms into Excel. I can't use SAS Connect in collecting data from them.
smilingmelbourne
Fluorite | Level 6
Thank you so much for the very useful material...

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 20 replies
  • 6225 views
  • 0 likes
  • 8 in conversation