Write a code to reference the folder where the excel files are stored.
To get the list of files you can use the DOPEN() and DREAD() functions. Or take advantage of some existing code that does that such as
https://github.com/sasutils/macros/blob/master/dirtree.sas
Write a code to extract the two columns of interest from the first excel file and store into a new dataset.
That sounds very strange. What do the excel files look like? Normally when you IMPORT data you import the whole worksheet. You can always ignore the other variables if you don't need them.
Are they really EXCEL files? It is much easier if they are instead CSV files, which your PC might think of as being owned by Excel but are really just plain text files. Since you can write your own data step to read from a text file.
If they are Excel files (XLSX files) then each file could have multiple worksheets in it. Do your files have multiple worksheets? Do the worksheets always have the same name(s)?
Then reiterate this code to continue extracting from the next files until it gets to the last file.
So figure out how to do one file. Then generate that same type of code for each file. This can be made easier by creating a macro that processes one file. Then use the list of files to generate one call to the macro for each file. That way if the logic for how to process a file changes you can just update the macro.
Make sure to flag those files that have been imported.
If you just process the files in your list there is no need to "keep track of which files have been read". But if you expect to need to re-do this process again in the future when there may be some NEW files that need to be processed then save the list into a SAS dataset. So that next time you want to look for and process new files you can exclude the files you processed already.
The new dataset will now contain the following columns: ID-1, Code-1, ID-2, Code-2, ID-3, Code-3, ... Therefore, next step is to transpose.
I can perhaps guess what you mean by this. For example you might mean that the spreadsheets have two variables named ID and CODE and your want to somehow convert it from that useful organizational structure of having each ID/CODE pair on its own observation into some type of wide structure with many ID variables and many CODE variables. Why would that help anything at all?
I would expect it would be better to instead just add another variable (or more) that can be used to tie those observations together. So perhaps the name of the file that they came from. Or perhaps a value like a DATE or COUNTRY or DISEASE that is derived from the filename.
Extract both columns of interest from each excel file into a dataset. This will create ~120 datasets*.
Not really. More like it will re-create the same dataset 120 times. Each new one replacing the old one.
Transpose each of the new datasets.
Don't do this. Or if you need to do it wait until you have all of the data so you know how many variables your will need. I suspect each spreadsheet will have a different number of observations, hence a different number of variables when transposed.
Merge all new datasets.
More like keep appending each new dataset as it is created.
Delete the ~120 datasets.
Probably not really needed. First if you process them one by one there is only one temporary dataset to be deleted. But also SAS work datasets are already deleted as soon as your SAS session ends.
So your coding process should be:
Locate one or two example EXCEL files.
Use PROC IMPORT to (or use LIBNAME with XLSX engine) to convert them into SAS datasets.
Figure out what variables they have. Are the ID and CODE variables really in the spreadsheet? Or are the spreadsheets some complex layoff that poorly constructed for being used as a data source?
Figure how to convert the INCONSISTENT mess you will get by using PROC IMPORT to GUESS what variable names, variable types and lengths for characters into a consistent structure that you can append.
Once you have that mainly sorted try wrapping that into a macro that takes as input the name of the file to read and possibly other things like the sheet within the file to read and the name of the dataset to create. Test that to see if it works for the couple of files you explored already. Then try it on a couple more.
Once the macro is mainly working you can add a PROC APPEND step to aggregate the new file into a consolidated file. This is where making sure the variable names and types are consistent is very important so that each new dataset created from each new file has the same variables so they can be combined.
Now look into the process of getting the list of files. How automatic does it need to be? Is the list static? Then just use normal operating system commands like DIR (or ls on unix) to make a text list of names and paste it into your program. If the list will vary then look into using a tool like the %DIRTREE() macro to build a dataset with the list of filenames.
Once you have the list of filenames you can use it to generate one call to the macro for each file.
... View more