When building demonstrations or prototypes, or even when developing real-world systems, a problem that regularly comes up is securing data to support this. This can be particularly awkward when you need data relating to people, addresses, credit cards, emails addresses, what is sometimes called “Nominal Data”.
With the adoption of data privacy regulations like GDPR and others, and growing concerns about Data Risk Management in general, it is rightly not sustainable to use genuine real data for these purposes, or at least the circumstances where this can be done are more and more limited these days.
What you want is data that “looks like” real data but isn’t. This is where Data Synthesis – the generation of fake but realistic data – can help.
One widely used resource for Data Synthesis is the faker Python module.
This article is going to look at putting this to work using SAS Studio on SAS Viya, building on some work by myself and my colleague Duncan Bain.
Aspirations
For the project I was working on, I needed to create a collection of data containing details of:
People – names, dates of birth, UK National Insurance Numbers, job titles, email addresses
Telephone Numbers
Credit Card Numbers
I needed a reasonable volume of these item – as many as 5 Mn person records.
This data was to be used simulate the kind of data an organisation might hold for its customers or users and will likely contain repetitions of the same real-world entity in divergent pieces of data.
In approaching this, I wanted to put together a reusable set of tools for Data Synthesis that were general purpose and composable, and suitable for subsequent no- code/low-code usage, rather than a one-time use focused development.
The solution we arrived at utilises Custom Steps in SAS Studio on SAS Viya, Viya’s integration with Python, and the no-code/low-code approach of SAS Studio Flows, so we will be covering all of these in this blog.
What is faker
Faker is a Python package that “generates fake data for you”. It can do this for a wide variety of topics, like addresses, company names, credit card numbers, and specifically for a range of locales, like the US, the UK, India and many others.
The faker website goes into great depth on all the functionality it overs, but I’ll very briefly summarise this to two key aspects:
Providers
Providers are the different types of data that faker can generate, for example “company” for the names of Companies or “automotive” for licence plates. Providers vary by Locale – see below – both in terms of which ones are available, and the content they produce.
Locales
Locales are a combination of language and country for instance “fr_FR” is French as used in France, whereas “fr_CH” is French as used in Switzerland. Each Locale supports a different set of providers, and the output of each provider will reflect the locale being used.
For example, the “ssn” provider with product a fake UK National Insurance Number when used with the “en_FB” locale, but a US Social Security Number when used with the “en_US”.
There’s a lot of depth available under the covers, accessible in raw Python coding, but our objective is to make this power available in an easily used form within SAS Studio, so let’s look at how we go about that.
Python in SAS Studio
SAS Viya provides a simple way to integrate Python processing into your workflows, allowing you to combine Python and SAS based processing and leverage the strengths of both.
Within Python itself, a key feature is the use of additional packages that add new functionality, and for this undertaking we need to use faker, so we are going to ensure we have access to it.
As we will be working in a multi-user SAS Viya environment, we need to be careful to avoid affecting other users of the platform, so we can’t just add faker into the “central” Python instance available to us, so we will be taking care to ensure our usage of faker is discrete and non-intrusive to the environment in general.
Building a Data Synthesis toolkit.
In this article I’m going to describe the “toolkit” we developed, without going into the inner details of the SAS and Python code involved. This will be covered in a follow-up post, together with sharing the Custom Steps via the GitHub Custom Step repository, which you can find out more about here.
UPDATE 2022/12/08: The Custom Steps are now available in the GitHub repository.
SAS Studio Custom Steps
Custom Steps allow you to encapsulate complex code in reusable components that can be used in Flows just like the built-in steps like Query, Filter, and Sort.
This allows Studio users to build workflows in a drag-and-drop, non-code/low-code fashion, concentrating on the logic of what they are setting out to do rather than having manually write code themselves.
This is a great way for experienced programmers to support less experienced or technical users by extending the set of steps available to them for them to use in a way they are already familiar with.
The Toolkit steps
The “Data Faking” toolkit that we developed is comprised of three Custom Steps, as follows, in reverse order of use:
Generate Data with Faker
Uses Python and Faker to generate a dataset with fake data based on specified provider and locale. Faker has to have been made available previously (see below)
Add Python Path
Loads a previously downloaded (see below) Python package from a file location for subsequent use. Restricted to the current user session, does not affect any other users.
Download Python Package
Downloads, or updates, a specified Python package using pip, and storing it in user specified location. Doesn’t load the package for use by Python.
In summary, for our Data Synthesis work, these custom steps need to be used in the following order.
Download Python Package – specifying the faker package and the download location. This does not need to be run every time, once the package files have been downloaded, you just need to run Add Python Path
Add Python Path – specifying the location where faker was downloaded to previously.
Generate Data with Faker – specifying the options to use with faker. This is intended to be used multiple times to with different options to combine various types of data into the sort of data needed, to combine person-related attributes with financial details, or addresses.
For example, the Studio Flow below uses all three Custom Steps to generate a list of Names and Telephone numbers.
Let’s look at each of these Custom Steps in turn to see how they’re used.
Download Python Package
There are two values we need to specify:
the name of the Python package to acquire
the location to store the package files
Running this will download or update the files for the specified Python Package in the specified location.
Add Python Path
There’s only one value needed, the path to the downloaded package.
This should be the location that was used for the previous step.
Running this will add the specified path to list of paths that Python will load package files from.
Generate Data with Faker
This step has a quite a few options.
Let’s have a look at the options and elaborate on what they do.
Locale
Specify the locale to use, so that generated data is specific to the country and language you need, for instance addresses, telephone numbers, people and company names etc follow the local patterns.
Provider
This selects the faker provider you want to use, for example “name” for names of people, “street_address” for addresses, “credit_card_number”.
Provider Options
Some providers support options – refer to the faker documentation for more info
Number of fakes
The number of records you want to have generated. Depending on your environment, and the provider selected, you can set this to multiple millions.
Ensure unique
Specifies that faker should attempt to make each record unique, or throw an exception if it can’t. This option has to be used with care as not all combinations of Provider and Number of Fakes will sustain this.
Seed
Allows a specific seed value to be specified if repetition of results is needed. 0 results in random values for each execution.
Offset
The step outputs a field “idx” that increments from 1 for each output row, Offset allows you to easily set the starting value another value. The “idx” field is intended to facilitate assembling data from multiple instances of the step using multiple providers and/or locales.
Running this step will run Python and generate the specified data with faker.
A general guide to using the toolkit.
To scratch the surface of the possibilities, here are couple of usage patterns.
Getting what you want
Delving into the details of the various providers available in faker, they individually tend to be limited in scope - "name" does only person names, "credit_card_number" just does, well, credit card numbers.
There are a number of "collection" providers like "profile" that combine a number of separate providers into a single output. The "profile" provider outputs a number of directly person-related data items like name, data of birth, gender, as well as things like job title, employer.
If you were generating data for a customer record, you could use the "profile" provider, and then simply discard any of the "profile" fields you're not interested in, or you could call each of the smaller, focused, providers like "name", and "organisation" and join the results together afterwards.
This is exactly what the idx field is intended for, adding a simple kay value to the output to facilitate collating disparate sets of records together.
For example, creating a Contact Record of Person Name with a Telephone Number.
Mixing it up
Another idea would be to generate data for multiple locales and combine that.
You can choose a different locale for each instance of Generate Data with Faker, then stack the outputs from each, to create a set of names in multiple languages, or from multiple countries.
Some observations and considerations about faker.
Faker is certainly very convenient, and the Custom Steps presented here make it very simple to use. Is using these for some recent experimentation, some issues and phenomena were observed which are worth discussing.
The data is demonstrably “fake” and “inconsistent”
A number of the data items that faker can generate are unlikely to represent real-world entities.
Using the UK as an example, whilst generated addresses resemble real ones, they don’t contain real postcode or towns, however generated telephone numbers could well coincide with real ones.
The “profile” provider - which outputs a variety of person-level attributes like name, gender and date of birth, email address etc. - generates these attributes in a disconnected fashion, so that the gender field is regularly inconsistent with the name field, nor do the email addresses tally with name as you might imagine. You will see records like:
Name: John Smith Email: iroberts@hotmail.com Sex: F
Additionally, the values generated in the ssn (social security number) field clearly resemble the high-level structure of UK National Insurance Numbers but are not valid values.
So, the data generated by faker will have the general appearance of “real” data but when examined more closely can be seen to be “fake”. This can limit the usefulness of the data faker produces, for instance if you wanted to use the data to test data pipelines that validate and enrich addresses or detect invalid National Insurance Numbers.
But it also visibly avoids the suggestion that the data is genuine – a human looking at the data can tell it’s not “real”, or at least looks wrong.
Realism gaps
In addition to some of the points made above, a couple of the data attributes output by faker, at least for the UK locale, are poorly formatted.
Addresses are not correctly cased, e.g. town names aren’t capitalised as you would expect
Job Titles are not proper cased
These are purely cosmetic, and easily “corrected” later on using readily available SAS functions, or more thoroughly using some of the QKB-based standardisations.
Indeed, the fact that faker output isn’t “perfect” allows scope to demonstrate how the data can be improved - using the tools available to us in the SAS toolset.
Uniqueness
Faker appears to use a limited pool of “seed values” in its generation process, and the fact that this pool is limited in size can cause issues, but equally creates some opportunities.
The following Studio Flow illustrates what seems to be going on, under the covers.
The steps in this Flow:
generates 10,000 UK telephone numbers – in a field “phone_number”
adds a “target” field to hold a standardised version of the “phone_number”
standardises “phone_number” into “Stnd_Phone_Number” using the SAS QKB
calculates the overall row count and the count of distinct values in “phone_number” and “Stnd_Phone_Number”
Let’s look at the result we get:
Looking at the results from step one - the core data generation - we see values in “phone_number” that are visibly UK style telephone numbers and show the variety of ways that these tend to be entered in real data:
0909 879 0660 +44118 496 0876 +443069990475 (020)74960382 0121 496 0219 0808 157 0489
So pretty much exactly what we are looking for.
However, when we calculate how distinct the values of “phone_number” are, and even more so how distinct those values are after they’ve been standardised, we can clearly see that faker is “reusing” values from a limited pool. This becomes even more prevalent when the size of the generate data is increased.
Total Rows
10,000
500,000
Distinct values in “phone_number”
9,769
202,584
Distinct values in “Stnd_Phone_Number”
7,917
22,788
This isn’t so much a problem, as it you can see this as representing the kind of “duplication” often seen in real data, but it is something to be aware of.
Indeed, you can see such duplication, and that the generated data displays the variety value formatting that generally occurs in “real” data, as an opportunity to demonstrate the benefits of Entity Resolution in clarifying activity within customer data.
Generating large row counts
One issue that can occur is when trying to create large row counts of data with faker, e.g. multiple millions, is that the Python process may run out of memory.
Whether it is an issue with faker, or Python in general, or a configuration issue particular to the environment I have been using, I can’t say, but I have encountered this when trying to create more than 5Mn rows using the “profile” provider for the en_GB locale.
There is a simple way to accommodate this - the code inside the Generate Data with Faker custom step resets the Python session at the end of each invocation, deletes its data frame and does garbage collection. This means that each run is “self-contained” and there shouldn’t be an issue with running multiple separate large generation runs.
So, for example, if you needed 20 Mn rows of person data, you could use 4 separate instances of Generate Data with Faker in same Flow, possibly with different locales, setting the offset value for the idx field at 0, 5Mn, 10Mn and 15Mn which will yield a single table with values idx running from 1 all the way up to 10Mn once you’ve appended the four output datasets together.
Advanced use of faker
The faker package offers a range of functionality beyond the scope of what we've implemented here.
As mentioned above, a future article will look at the details of the code inside the Custom Steps I've discussed here, and we'll touch on some aspects of "advanced" usage there.
... View more