ta
I get all the fun! :smileyconfused: I have a task which conceptually is fairly easy, but difficult in practice. I have thousands of Excel files that consist of many tabs or worksheets. The files were created as forms for customers to fill out and return - so there are locked fields, irregular data locations, multiple rows as column headers but not necessarily starting at row 1. There are a lot of legacy forms before someone finally decided that "hey, maybe it would be a good idea to standardize". After that point, there are files with a revision control (although so far, I have identified at least 20 different revisions). Way to many files to manually process and originally I started with Python because I don't see an easy way to do this in SAS. The process in Python is pretty straight forward, but.... (and there is always a but!) the Python package uses the Open Office API to open the Excel files and it throws an error when fields or tabs are protected or locked. So it looks like Python isn't going to work for a vast majority of the files.
Here is a subset of my Python code :
for f in filelist[:]:
created = time.ctime(os.path.getmtime(f))
modified = time.ctime(os.path.getctime(f))
try:
book = xlrd.open_workbook(f)
sheets = book.sheet_names()
if sheets[1] == '1-Summary':
sheet = book.sheet_by_index(1)
cellC1 = sheet.cell_value(0,2)[:14]
else :
cellC1 = 'Not valid Rev doc'
newline = f + ',' + sheet.name + ',' + created + ',' + modified + ','+ cellC1 + '\n'
VerList.write(newline)
except Exception:
ExLst.write(f)
ExLst.write('\n')
This code loops through all the Excel files found in the subdirectories and opens them, checks for a sheet labelled '1-Summary' and then grabs for a text field out of a specific field which should contain the Revision control number. This produces two csv files - VerList : contains list of the readable files with the version control field listed and an exception list of files which would not open.
Once that list exists, then I can select all the files of a specific version. The data on various tabs and various locations is consistent for any given version. I have found that I can open the Excel files where Python will not, so it should work. But after numerous google searches, I have yet to see a clean way to determine what sheets exist and then if a specific sheet exists, import but import based on specific Row/Col limits.
If anyone knows of a white paper or posting along those lines, it would be greatly appreciated. Oh, and this could be Base SAS or EG, but I do have to loop across many thousands of files.
Thanks!
FG