How do I split up a large table from an imported xlsx file?

SAS0005 · Posted 04-04-2024 08:03 PM

I am using a dataset from PRAMS. I imported the file which has several different tabs for different variables. Under each tab, there are different survey questions in each individauls from each state replied. Since they are on the same tab however, SAS is creating one giant table that includes each survey question for each variable. This is causing there to be up to 800 obs.

I am also wondering how I can code this so that I have a select few states from the set instead of all of them. Below you will find my code. I clearly am new to this, so any advice would be appreciated. Thanks!

libname Prams 'D:\PhotonUser\My Files\Home Folder\Prams';

run;

/* import substance use sheet */

PROC IMPORT FILE="D:\PhotonUser\My Files\Home Folder\Prams\Prams20.xlsx"

OUT=Prams20_Substance_Use

DBMS=xlsx

REPLACE;

SHEET="Substance Use";

run;

/* import health ins. status prenatal sheet */

PROC IMPORT FILE="D:\PhotonUser\My Files\Home Folder\Prams\Prams20.xlsx"

OUT=Prams20_Health_Ins_Prenatal

DBMS=xlsx

REPLACE;

SHEET="Health Ins. Status Prenatal";

run;

/* import health ins. status postpartum sheet */

PROC IMPORT FILE="D:\PhotonUser\My Files\Home Folder\Prams\Prams20.xlsx"

OUT=Prams20_Health_Ins_Postpartum

DBMS=xlsx

REPLACE;

SHEET="Health Ins. Status Postpartum";

run;

/* import depression sheet */

PROC IMPORT FILE="D:\PhotonUser\My Files\Home Folder\Prams\Prams20.xlsx"

OUT=Prams20_Health_Depression

DBMS=xlsx

REPLACE;

SHEET="Depression";

run;

/* import pregnancy intention sheet */

PROC IMPORT FILE="D:\PhotonUser\My Files\Home Folder\Prams\Prams20.xlsx"

OUT=Prams20_Preg_Intent

DBMS=xlsx

REPLACE;

SHEET="Pregnancy Intention";

run;

/* import health care services sheet */

PROC IMPORT FILE="D:\PhotonUser\My Files\Home Folder\Prams\Prams20.xlsx"

OUT=Prams20_Health_Srv

DBMS=xlsx

REPLACE;

SHEET="Health Care Services";

run;

/* import intimate partner violence sheet */

PROC IMPORT FILE="D:\PhotonUser\My Files\Home Folder\Prams\Prams20.xlsx"

OUT=Prams20_IPV

DBMS=xlsx

REPLACE;

SHEET="Intimate Partner Violence (IPV)";

run;

LinusH · Posted 04-05-2024 06:29 AM

I don't have any insight in PRAMS, and maybe not so many other here at he communities, so a bit more details would be helpful to understand your challenge.

"SAS is creating one giant table that includes each survey question for each variable."

Please provide a PROC CONTENTS to describe your imported data set.

For the State filter, isn't there a variable that holds that information?

Again, more detail required to help you.

Data never sleeps

ballardw · Posted 04-05-2024 10:00 AM

I am a bit familiar with PRAMS. Apparently the access to the public use data sets requires some set up that I'm not going to attempt to get access at this time.

I am a bit surprised that @SAS0005 is dealing with such a spreadsheet as most of the CDC sponsored projects like this will also have data available in SAS data set forms and often has analysis code which can be pretty critical for the sample designs of such surveys.

I suggest that you go back to the source of your data and first see if you can get SAS data directly as Proc Import, or somewhat better the LIBNAME EXCEL or XLSX engines may have issues with inconsistent variable lengths and possibly types for the same variables given the way that the import and Libname engines work with Excel files.

Patrick · Posted 04-05-2024 10:26 AM

It appears you've sourced your Excel from https://www.cdc.gov/prams/prams-data/form/arf-access-form.htm under the Selected ...Indicators section. These are REPORTS and not in a form that's ready made for import into a table for further processing.

Post processing such imports of Excel reports to get to a proper rectangular structure required by tables can be tedious (and would be for these reports).

If I understand this right then you can register with PRAMS and get access to more detailed and much more suitable source data for use with SAS. Based on the codebooks it appears PRAMS is also using SAS.

Patrick · Posted 04-06-2024 11:40 PM

@SAS0005 I still recommend that you register with Prams and get the proper data for analysis. But just for fun and to improve my Python skills a bit...

When downloading and looking into these prams Excel reports found here I realised that for each table a named ranges had been defined (Table1, Table2, etc.) and that all tables in all sheets have the same structure. In the following an approach how to read such Excel data into a SAS table.

I've run the Python script separately but there are also various methods how to run a Python script directly out of a SAS session (available methods depend on SAS version) or alternatively you can also call SAS out of Python.

Step 1: Using Python read each table (range) into a data frame and then export to a .csv

import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.cell import range_boundaries
import re

workbook_path = r"C:\temp\prams_test\\"
workbook_name = "PRAMS-MCH-Indicators-2020-508.xlsx"
outfile_root  = r"C:\temp\prams_test\source_csv\\"

wb = load_workbook(workbook_path + workbook_name)

def replace_non_alphanumeric(string, replacement=''):
    return re.sub(r'[^0-9a-zA-Z_]+', replacement, string.strip())

def extract_tables(ws, sheet_name):
    dfs_tmp = {}

    for name, table_range in ws.tables.items():
        # Get position of data table defined by named range
        min_col, min_row, max_col, max_row = range_boundaries(table_range)

        # read value of cell relative to top left corner of named range
        reference_cell = ws[table_range.split(':')[0]]
        table_name = reference_cell.offset(row=-1).value
        table_name = table_name.replace(',',' ')

        # Convert table to DataFrame
        table = ws.iter_rows(min_row, max_row, min_col, max_col, values_only=True)
        header = next(table)
        df = pd.DataFrame(table, columns=header)

        # add columns workbook, sheet_name and table_name to data frame
        df.insert(0, "workbook_name", workbook_name)
        df.insert(1, "sheet_name", sheet_name)
        df.insert(2, "table_name", table_name)

        # write data frame to .csv without header row and index column
        df.iloc[1:].to_csv(outfile_root
                                  + replace_non_alphanumeric(workbook_name.split('.')[0]).lower()
                            + "_" + replace_non_alphanumeric(sheet_name).lower()
                            + "_" + replace_non_alphanumeric(table_name).lower() + ".csv"
                            , index=False, header=False)
        dfs_tmp[name] = df

    return dfs_tmp


# Dictionary to store all the dfs in.
# Format: {table_name1: df, table_name2: df, ...}
dfs = {}

for ws in wb.worksheets:
    dfs.update(extract_tables(ws, ws.title))

Above creates per table a separate .csv with a naming convention: <workbook name>_<sheet name>_<table name (derived from the table title)>.csv

Step 2: Read the .csv's into SAS

You can use a wildcard in the csv name to for example read all the tables belonging to a workbook or to only read the tables belonging to a single worksheet. And you can of course also already sub-set your data as part of reading it.

data want;
  infile "C:\temp\prams_test\source_csv\pramsmchindicators2020508_breastfeedingpractices_*.csv"
         truncover dsd dlm=",";

  attrib
    workbook    informat=$200.    label='Name of Excel Workbook'
    sheet       informat=$31.     label='Name of Excel Sheet'
    table_name  informat=$200.    label='Name of table in Excel Sheet'
    site_name   informat=$100.    label='State'
    denominator informat=best32.  label='N (Denominator) - Unweighted Sample Size'
    numerator   informat=best32.  label='N (Numerator) - Unweighted Frequency'
    weight      informat=best32.  label='Weighted %'
    lower95     informat=best32.  label='Lower 95% - Confidence Interval'
    upper95     informat=best32.  label='Upper 95% - Confidence Interval'
    ;

  input
    workbook
    sheet
    table_name  
    site_name   
    denominator 
    numerator   
    weight      
    lower95     
    upper95     
    ;

 if site_name in ('Alabama','Florida');
run;

proc print data=want;
run;

Btw: The value for table name is sourced from the title above the table

And because this cell was not part of the named range for the table I needed below code to derive it

        # read value of cell relative to top left corner of named range
        reference_cell = ws[table_range.split(':')[0]]
        table_name = reference_cell.offset(row=-1).value
        table_name = table_name.replace(',',' ')

Tom · Posted 04-07-2024 10:51 AM

So assuming you are right and those are the XLSX files of interest they should not be hard to read with SAS because all of the variables appear to be numeric. You just need to start reading in cell A4 and tell SAS to not invent variable names. You can then read the whole sheet and transpose it to get the metadata for the variables.

I tried it for the first two sheets for one of the files and the metadata (column headers) are exactly the same.

filename x  "C:\downloads\PRAMS-MCH-Indicators-2016-2021.xlsx";

proc import dbms=xlsx datafile=x out=meta_2016 replace ;
  getnames=NO;
  range='2016$A1:';
run;
proc import dbms=xlsx datafile=x out=data_2016 replace ;
  getnames=NO;
  range='2016$A4:';
run;

proc transpose data=meta_2016(obs=3) out=names_2016;
  var _all_;
run;

proc import dbms=xlsx datafile=x out=meta_2017 replace ;
  getnames=NO;
  range='2017$A1:';
run;
proc import dbms=xlsx datafile=x out=data_2017 replace ;
  getnames=NO;
  range='2017$A4:';
run;

proc transpose data=meta_2017(obs=3) out=names_2017;
  var _all_;
run;

proc compare data=names_2016 compare=names_2017;
run;

You can process the headers to generate your own unique variable names.

data newnames_2017 ;
  set names_2017 ;
  length varname $32 label $256 indicator variable $200 ;
  retain indicator variable ;
  if _N_=1 then varname='STATE';
  else do;
    if col1 ne ' ' then do;
      ind_num+1;
      indicator=col1;
      var_num=0;
    end;
    if col2 ne ' ' then do;
      variable=col2;
      var_num+1;
      sub_num=0;
    end;
    sub_num+1;
    varname=cats('var',ind_num,'_',var_num,'_',sub_num);
    label=catx(' - ',indicator,variable,col3);
  end;
run;

proc print;
 var _name_ varname label ;
run;

So you get something like this:

Which you can use to generate code to RENAME the variables and attach a label.

Or perhaps use to reshape the data into more normal form since it looks like that third level of name is the same for every variable:

How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

Re: How do I split up a large table from an imported xlsx file?

Registration is open

SAS Training: Just a Click Away