@SAS0005 I still recommend that you register with Prams and get the proper data for analysis. But just for fun and to improve my Python skills a bit...
When downloading and looking into these prams Excel reports found here I realised that for each table a named ranges had been defined (Table1, Table2, etc.) and that all tables in all sheets have the same structure. In the following an approach how to read such Excel data into a SAS table.
I've run the Python script separately but there are also various methods how to run a Python script directly out of a SAS session (available methods depend on SAS version) or alternatively you can also call SAS out of Python.
Step 1: Using Python read each table (range) into a data frame and then export to a .csv
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.cell import range_boundaries
import re
workbook_path = r"C:\temp\prams_test\\"
workbook_name = "PRAMS-MCH-Indicators-2020-508.xlsx"
outfile_root = r"C:\temp\prams_test\source_csv\\"
wb = load_workbook(workbook_path + workbook_name)
def replace_non_alphanumeric(string, replacement=''):
return re.sub(r'[^0-9a-zA-Z_]+', replacement, string.strip())
def extract_tables(ws, sheet_name):
dfs_tmp = {}
for name, table_range in ws.tables.items():
# Get position of data table defined by named range
min_col, min_row, max_col, max_row = range_boundaries(table_range)
# read value of cell relative to top left corner of named range
reference_cell = ws[table_range.split(':')[0]]
table_name = reference_cell.offset(row=-1).value
table_name = table_name.replace(',',' ')
# Convert table to DataFrame
table = ws.iter_rows(min_row, max_row, min_col, max_col, values_only=True)
header = next(table)
df = pd.DataFrame(table, columns=header)
# add columns workbook, sheet_name and table_name to data frame
df.insert(0, "workbook_name", workbook_name)
df.insert(1, "sheet_name", sheet_name)
df.insert(2, "table_name", table_name)
# write data frame to .csv without header row and index column
df.iloc[1:].to_csv(outfile_root
+ replace_non_alphanumeric(workbook_name.split('.')[0]).lower()
+ "_" + replace_non_alphanumeric(sheet_name).lower()
+ "_" + replace_non_alphanumeric(table_name).lower() + ".csv"
, index=False, header=False)
dfs_tmp[name] = df
return dfs_tmp
# Dictionary to store all the dfs in.
# Format: {table_name1: df, table_name2: df, ...}
dfs = {}
for ws in wb.worksheets:
dfs.update(extract_tables(ws, ws.title))
Above creates per table a separate .csv with a naming convention: <workbook name>_<sheet name>_<table name (derived from the table title)>.csv
Step 2: Read the .csv's into SAS
You can use a wildcard in the csv name to for example read all the tables belonging to a workbook or to only read the tables belonging to a single worksheet. And you can of course also already sub-set your data as part of reading it.
data want;
infile "C:\temp\prams_test\source_csv\pramsmchindicators2020508_breastfeedingpractices_*.csv"
truncover dsd dlm=",";
attrib
workbook informat=$200. label='Name of Excel Workbook'
sheet informat=$31. label='Name of Excel Sheet'
table_name informat=$200. label='Name of table in Excel Sheet'
site_name informat=$100. label='State'
denominator informat=best32. label='N (Denominator) - Unweighted Sample Size'
numerator informat=best32. label='N (Numerator) - Unweighted Frequency'
weight informat=best32. label='Weighted %'
lower95 informat=best32. label='Lower 95% - Confidence Interval'
upper95 informat=best32. label='Upper 95% - Confidence Interval'
;
input
workbook
sheet
table_name
site_name
denominator
numerator
weight
lower95
upper95
;
if site_name in ('Alabama','Florida');
run;
proc print data=want;
run;
Btw: The value for table name is sourced from the title above the table
And because this cell was not part of the named range for the table I needed below code to derive it
# read value of cell relative to top left corner of named range
reference_cell = ws[table_range.split(':')[0]]
table_name = reference_cell.offset(row=-1).value
table_name = table_name.replace(',',' ')
... View more