Data Synthesis with SAS Viya and Python – Part Two

4 Likes

In my previous article I outlined some SAS Studio Custom Steps that use the Python package faker to generate synthetic data.

As promised in that article, this follow-up article will look at the details of the code in the Custom Steps, and how they work.

There were three Custom Steps in the “toolkit” - two of which deal with the mechanisms access Python packages, and one that uses faker for data synthesis.

This article is going to focus on the later of the three - Generate Data with Faker

I have to thank my colleague Duncan Bain for the vast majority of work on these, as I am really only starting out with using Python, but I have been able to get enough to grips with it to collaborate with Duncan on these steps.

SAS and Python.

There are a number of Python integration mechanisms available with SAS Viya – sasPy, SWAT and Proc Python to name just three. You can find out (a lot) more about SAS – Python integration here.

In my case, we have used PROC PYTHON, so this article will be restricted to that.

PROC PYTHON is SAS language procedure that executes Python code within a SAS session and allows that Python code to access data from that SAS session, and to send data back to it. This allows a SAS programmer to embed Python in their code.

I’m only going to scratch the surface of using PROC PYTHON, covering how it has been used to leverage faker.

Putting PROC PYTHON to use.

Rather than present the complete code from the “Generate Data with Faker” step, I’m going to distil that code down to its essentials, for clarity, to focus on the core aspects of how we used PROC PYTHON to with faker.

The “full” code is more or less just this “core code” but elaborated to deal with some variations in how some of the providers present their results, and to implement some of the “advanced options” like the uniqueness constraints, and to add some error handling.

So, the “core” code looks like this.

proc python;
submit;
from faker import Faker
import pandas as pd
import gc

# start faker with the selected locale
fake = Faker(SAS.symget("locale"))

# populate a dataframe with "numfakes" rows using "provider"
df_fakes = pd.DataFrame([getattr(fake,SAS.symget("provider"))() for _ in range(int(SAS.symget("numfakes")))],columns=[SAS.symget("provider")])

# export the dataframe to the designated SAS dataset
ds_fakes = SAS.df2sd(df_fakes,SAS.symget("_output1"))

endsubmit;
run;

/* use SAS data step to add idx */
data &_output1;
     set &_output1;
     idx = _N_ + &indexOffset;
run;

Let’s break this down.

The main PROC PYTHON block is

proc python;
submit;
<python code here>
endsubmit;
run;

which runs the Python code between submit and endsubmit.

So, let’s look at the python code inside PROC PYTHON that we've used (omitting the comments).

First of all, we have the typical importing of modules, in this case faker, pandas and gc. The faker module had to be added to our Python environment using the Custom Steps discussed in the previous article, whereas pandas and gc were already available.

from faker import Faker
import pandas as pd
import go

Next, we invoke faker, specifying the locale to be used. This is where we see the first bit of SAS integration – SAS.symget(“locale”) – this reads the content of the SAS macro variable “locale” and substitutes that value into the Python code.

fake = Faker(SAS.symget("locale"))

Then we create a dataframe and populate it with values generated by faker again using the SAS.symget mechanism to substitute values from SAS macro variables into python code.

df_fakes = pd.DataFrame([getattr(fake,SAS.symget("provider"))() for _ in range(int(SAS.symget("numfakes")))],columns=[SAS.symget("provider")])

Finally, we export the contents of the dataframe to a SAS dataset, using SAS.df2sd.

ds_fakes = SAS.df2sd(df_fakes,SAS.symget("_output1"))

And then, after the PROC PYTHON block, we add the variable idx to the dataset we loaded the python results into, using SAS data step.

/* use SAS data step to add idx */
data &_output1;
     set &_output1;
     idx = _N_ + &indexOffset;
run;

The SAS.symget and SAS.df2sd are PROC PYTHON Callback Methods, which are accessed within Python as a virtual module, available to Python as it’s run inside PROC PYTHON. You can read more about the callback methods here.

So, if we were to distil what we’ve done overall here, using the Custom Step, we:

captured some options on how to run faker in some SAS macro variables
used the values of the macro variables in Python code
exported the contents of the resulting dataframe to a SAS dataset
added to that dataset using straight forwards SAS data step

However, by implementing all of this in a SAS Studio Custom Step, the user doesn’t need to know about the coding details of this at all, they just have to add the Custom Step to their Flow, make some menu selections, run it and use the results.

It’s a “drag and drop”, low-code/no-code experience.

Using the Data Synthesis toolkit yourself

We will be adding the three Custom Steps to the GitHub repository shortly. You'll be able to download these from the repository, and add them to your won Viya environment, and try them out.

You can read about the GitHub repository here.

We may have to refactor the steps slightly for this, probably only just renaming them to conform to the naming patterns used by other steps.

These steps are offered as-is, on the basis of Duncan and I having put them to use for purposes and got them working nicely. We have only implemented the providers and locales that we needed or thought looked interesting. It's easy enough to add in any others from what's available in faker if needed.

They've been used on a number of different installations and version of Viya, so we are pretty confident that they should work for everyone. We have used SAS Studio Analyst and Studio Engineer.

From my own perspective, these steps allow you to very quickly generate synthetic data for all sorts of topics, well enough to use for most purposes. I'd particularly highlight the option to generate data from different locales, allowing you to collate multi-lingual, multi-national, data.

However, whilst faker is very good, it isn't perfect. Some of the issues I noted in the previous article could be "showstoppers" for some use cases.

What I have learnt over many years is that there is no substitute for the actual data you are going to have work with, it's almost impossible to simulate the types of errors, omissions and out and out mischief to be found in "real world" data. The toolkit we outlined lets us create a solid alternative using "real data".

I'll post an update to both articles when they are available.

AngusLooney · ‎12-08-2022

The Custom Steps are now available in GitHub - https://github.com/sassoftware/sas-studio-custom-steps