SAS Support Communities

AngusLooney · ‎12-08-2022

The Custom Steps are now available in GitHub - https://github.com/sassoftware/sas-studio-custom-steps

AngusLooney · ‎12-08-2022

The Custom Steps are now available in GitHub - https://github.com/sassoftware/sas-studio-custom-steps

AngusLooney · ‎11-28-2022

I've just added the follow up article that delves into the inner workings of the Custom Step that uses faker. Link to it.

AngusLooney · ‎11-28-2022

In my previous article I outlined some SAS Studio Custom Steps that use the Python package faker to generate synthetic data. As promised in that article, this follow-up article will look at the details of the code in the Custom Steps, and how they work. There were three Custom Steps in the “toolkit” - two of which deal with the mechanisms access Python packages, and one that uses faker for data synthesis. This article is going to focus on the later of the three - Generate Data with Faker I have to thank my colleague Duncan Bain for the vast majority of work on these, as I am really only starting out with using Python, but I have been able to get enough to grips with it to collaborate with Duncan on these steps. SAS and Python. There are a number of Python integration mechanisms available with SAS Viya – sasPy, SWAT and Proc Python to name just three. You can find out (a lot) more about SAS – Python integration here. In my case, we have used PROC PYTHON, so this article will be restricted to that. PROC PYTHON is SAS language procedure that executes Python code within a SAS session and allows that Python code to access data from that SAS session, and to send data back to it. This allows a SAS programmer to embed Python in their code. I’m only going to scratch the surface of using PROC PYTHON, covering how it has been used to leverage faker. Putting PROC PYTHON to use. Rather than present the complete code from the “Generate Data with Faker” step, I’m going to distil that code down to its essentials, for clarity, to focus on the core aspects of how we used PROC PYTHON to with faker. The “full” code is more or less just this “core code” but elaborated to deal with some variations in how some of the providers present their results, and to implement some of the “advanced options” like the uniqueness constraints, and to add some error handling. So, the “core” code looks like this. proc python; submit; from faker import Faker import pandas as pd import gc # start faker with the selected locale fake = Faker(SAS.symget("locale")) # populate a dataframe with "numfakes" rows using "provider" df_fakes = pd.DataFrame([getattr(fake,SAS.symget("provider"))() for _ in range(int(SAS.symget("numfakes")))],columns=[SAS.symget("provider")]) # export the dataframe to the designated SAS dataset ds_fakes = SAS.df2sd(df_fakes,SAS.symget("_output1")) endsubmit; run; /* use SAS data step to add idx */ data &_output1; set &_output1; idx = _N_ + &indexOffset; run; Let’s break this down. The main PROC PYTHON block is proc python; submit; <python code here> endsubmit; run; which runs the Python code between submit and endsubmit. So, let’s look at the python code inside PROC PYTHON that we've used (omitting the comments). First of all, we have the typical importing of modules, in this case faker, pandas and gc. The faker module had to be added to our Python environment using the Custom Steps discussed in the previous article, whereas pandas and gc were already available. from faker import Faker import pandas as pd import go Next, we invoke faker, specifying the locale to be used. This is where we see the first bit of SAS integration – SAS.symget(“locale”) – this reads the content of the SAS macro variable “locale” and substitutes that value into the Python code. fake = Faker(SAS.symget("locale")) Then we create a dataframe and populate it with values generated by faker again using the SAS.symget mechanism to substitute values from SAS macro variables into python code. df_fakes = pd.DataFrame([getattr(fake,SAS.symget("provider"))() for _ in range(int(SAS.symget("numfakes")))],columns=[SAS.symget("provider")]) Finally, we export the contents of the dataframe to a SAS dataset, using SAS.df2sd. ds_fakes = SAS.df2sd(df_fakes,SAS.symget("_output1")) And then, after the PROC PYTHON block, we add the variable idx to the dataset we loaded the python results into, using SAS data step. /* use SAS data step to add idx */ data &_output1; set &_output1; idx = _N_ + &indexOffset; run; The SAS.symget and SAS.df2sd are PROC PYTHON Callback Methods, which are accessed within Python as a virtual module, available to Python as it’s run inside PROC PYTHON. You can read more about the callback methods here. So, if we were to distil what we’ve done overall here, using the Custom Step, we: captured some options on how to run faker in some SAS macro variables used the values of the macro variables in Python code exported the contents of the resulting dataframe to a SAS dataset added to that dataset using straight forwards SAS data step However, by implementing all of this in a SAS Studio Custom Step, the user doesn’t need to know about the coding details of this at all, they just have to add the Custom Step to their Flow, make some menu selections, run it and use the results. It’s a “drag and drop”, low-code/no-code experience. Using the Data Synthesis toolkit yourself We will be adding the three Custom Steps to the GitHub repository shortly. You'll be able to download these from the repository, and add them to your won Viya environment, and try them out. You can read about the GitHub repository here. We may have to refactor the steps slightly for this, probably only just renaming them to conform to the naming patterns used by other steps. These steps are offered as-is, on the basis of Duncan and I having put them to use for purposes and got them working nicely. We have only implemented the providers and locales that we needed or thought looked interesting. It's easy enough to add in any others from what's available in faker if needed. They've been used on a number of different installations and version of Viya, so we are pretty confident that they should work for everyone. We have used SAS Studio Analyst and Studio Engineer. From my own perspective, these steps allow you to very quickly generate synthetic data for all sorts of topics, well enough to use for most purposes. I'd particularly highlight the option to generate data from different locales, allowing you to collate multi-lingual, multi-national, data. However, whilst faker is very good, it isn't perfect. Some of the issues I noted in the previous article could be "showstoppers" for some use cases. What I have learnt over many years is that there is no substitute for the actual data you are going to have work with, it's almost impossible to simulate the types of errors, omissions and out and out mischief to be found in "real world" data. The toolkit we outlined lets us create a solid alternative using "real data". I'll post an update to both articles when they are available.

AngusLooney · ‎11-14-2022

When building demonstrations or prototypes, or even when developing real-world systems, a problem that regularly comes up is securing data to support this. This can be particularly awkward when you need data relating to people, addresses, credit cards, emails addresses, what is sometimes called “Nominal Data”. With the adoption of data privacy regulations like GDPR and others, and growing concerns about Data Risk Management in general, it is rightly not sustainable to use genuine real data for these purposes, or at least the circumstances where this can be done are more and more limited these days. What you want is data that “looks like” real data but isn’t. This is where Data Synthesis – the generation of fake but realistic data – can help. One widely used resource for Data Synthesis is the faker Python module. This article is going to look at putting this to work using SAS Studio on SAS Viya, building on some work by myself and my colleague Duncan Bain. Aspirations For the project I was working on, I needed to create a collection of data containing details of: People – names, dates of birth, UK National Insurance Numbers, job titles, email addresses Telephone Numbers Credit Card Numbers I needed a reasonable volume of these item – as many as 5 Mn person records. This data was to be used simulate the kind of data an organisation might hold for its customers or users and will likely contain repetitions of the same real-world entity in divergent pieces of data. In approaching this, I wanted to put together a reusable set of tools for Data Synthesis that were general purpose and composable, and suitable for subsequent no- code/low-code usage, rather than a one-time use focused development. The solution we arrived at utilises Custom Steps in SAS Studio on SAS Viya, Viya’s integration with Python, and the no-code/low-code approach of SAS Studio Flows, so we will be covering all of these in this blog. What is faker Faker is a Python package that “generates fake data for you”. It can do this for a wide variety of topics, like addresses, company names, credit card numbers, and specifically for a range of locales, like the US, the UK, India and many others. The faker website goes into great depth on all the functionality it overs, but I’ll very briefly summarise this to two key aspects: Providers Providers are the different types of data that faker can generate, for example “company” for the names of Companies or “automotive” for licence plates. Providers vary by Locale – see below – both in terms of which ones are available, and the content they produce. Locales Locales are a combination of language and country for instance “fr_FR” is French as used in France, whereas “fr_CH” is French as used in Switzerland. Each Locale supports a different set of providers, and the output of each provider will reflect the locale being used. For example, the “ssn” provider with product a fake UK National Insurance Number when used with the “en_FB” locale, but a US Social Security Number when used with the “en_US”. There’s a lot of depth available under the covers, accessible in raw Python coding, but our objective is to make this power available in an easily used form within SAS Studio, so let’s look at how we go about that. Python in SAS Studio SAS Viya provides a simple way to integrate Python processing into your workflows, allowing you to combine Python and SAS based processing and leverage the strengths of both. Within Python itself, a key feature is the use of additional packages that add new functionality, and for this undertaking we need to use faker, so we are going to ensure we have access to it. As we will be working in a multi-user SAS Viya environment, we need to be careful to avoid affecting other users of the platform, so we can’t just add faker into the “central” Python instance available to us, so we will be taking care to ensure our usage of faker is discrete and non-intrusive to the environment in general. Building a Data Synthesis toolkit. In this article I’m going to describe the “toolkit” we developed, without going into the inner details of the SAS and Python code involved. This will be covered in a follow-up post, together with sharing the Custom Steps via the GitHub Custom Step repository, which you can find out more about here. UPDATE 2022/12/08: The Custom Steps are now available in the GitHub repository. SAS Studio Custom Steps Custom Steps allow you to encapsulate complex code in reusable components that can be used in Flows just like the built-in steps like Query, Filter, and Sort. This allows Studio users to build workflows in a drag-and-drop, non-code/low-code fashion, concentrating on the logic of what they are setting out to do rather than having manually write code themselves. This is a great way for experienced programmers to support less experienced or technical users by extending the set of steps available to them for them to use in a way they are already familiar with. The Toolkit steps The “Data Faking” toolkit that we developed is comprised of three Custom Steps, as follows, in reverse order of use: Generate Data with Faker Uses Python and Faker to generate a dataset with fake data based on specified provider and locale. Faker has to have been made available previously (see below) Add Python Path Loads a previously downloaded (see below) Python package from a file location for subsequent use. Restricted to the current user session, does not affect any other users. Download Python Package Downloads, or updates, a specified Python package using pip, and storing it in user specified location. Doesn’t load the package for use by Python. In summary, for our Data Synthesis work, these custom steps need to be used in the following order. Download Python Package – specifying the faker package and the download location. This does not need to be run every time, once the package files have been downloaded, you just need to run Add Python Path Add Python Path – specifying the location where faker was downloaded to previously. Generate Data with Faker – specifying the options to use with faker. This is intended to be used multiple times to with different options to combine various types of data into the sort of data needed, to combine person-related attributes with financial details, or addresses. For example, the Studio Flow below uses all three Custom Steps to generate a list of Names and Telephone numbers. Let’s look at each of these Custom Steps in turn to see how they’re used. Download Python Package There are two values we need to specify: the name of the Python package to acquire the location to store the package files Running this will download or update the files for the specified Python Package in the specified location. Add Python Path There’s only one value needed, the path to the downloaded package. This should be the location that was used for the previous step. Running this will add the specified path to list of paths that Python will load package files from. Generate Data with Faker This step has a quite a few options. Let’s have a look at the options and elaborate on what they do. Locale Specify the locale to use, so that generated data is specific to the country and language you need, for instance addresses, telephone numbers, people and company names etc follow the local patterns. Provider This selects the faker provider you want to use, for example “name” for names of people, “street_address” for addresses, “credit_card_number”. Provider Options Some providers support options – refer to the faker documentation for more info Number of fakes The number of records you want to have generated. Depending on your environment, and the provider selected, you can set this to multiple millions. Ensure unique Specifies that faker should attempt to make each record unique, or throw an exception if it can’t. This option has to be used with care as not all combinations of Provider and Number of Fakes will sustain this. Seed Allows a specific seed value to be specified if repetition of results is needed. 0 results in random values for each execution. Offset The step outputs a field “idx” that increments from 1 for each output row, Offset allows you to easily set the starting value another value. The “idx” field is intended to facilitate assembling data from multiple instances of the step using multiple providers and/or locales. Running this step will run Python and generate the specified data with faker. A general guide to using the toolkit. To scratch the surface of the possibilities, here are couple of usage patterns. Getting what you want Delving into the details of the various providers available in faker, they individually tend to be limited in scope - "name" does only person names, "credit_card_number" just does, well, credit card numbers. There are a number of "collection" providers like "profile" that combine a number of separate providers into a single output. The "profile" provider outputs a number of directly person-related data items like name, data of birth, gender, as well as things like job title, employer. If you were generating data for a customer record, you could use the "profile" provider, and then simply discard any of the "profile" fields you're not interested in, or you could call each of the smaller, focused, providers like "name", and "organisation" and join the results together afterwards. This is exactly what the idx field is intended for, adding a simple kay value to the output to facilitate collating disparate sets of records together. For example, creating a Contact Record of Person Name with a Telephone Number. Mixing it up Another idea would be to generate data for multiple locales and combine that. You can choose a different locale for each instance of Generate Data with Faker, then stack the outputs from each, to create a set of names in multiple languages, or from multiple countries. Some observations and considerations about faker. Faker is certainly very convenient, and the Custom Steps presented here make it very simple to use. Is using these for some recent experimentation, some issues and phenomena were observed which are worth discussing. The data is demonstrably “fake” and “inconsistent” A number of the data items that faker can generate are unlikely to represent real-world entities. Using the UK as an example, whilst generated addresses resemble real ones, they don’t contain real postcode or towns, however generated telephone numbers could well coincide with real ones. The “profile” provider - which outputs a variety of person-level attributes like name, gender and date of birth, email address etc. - generates these attributes in a disconnected fashion, so that the gender field is regularly inconsistent with the name field, nor do the email addresses tally with name as you might imagine. You will see records like: Name: John Smith Email: iroberts@hotmail.com Sex: F Additionally, the values generated in the ssn (social security number) field clearly resemble the high-level structure of UK National Insurance Numbers but are not valid values. So, the data generated by faker will have the general appearance of “real” data but when examined more closely can be seen to be “fake”. This can limit the usefulness of the data faker produces, for instance if you wanted to use the data to test data pipelines that validate and enrich addresses or detect invalid National Insurance Numbers. But it also visibly avoids the suggestion that the data is genuine – a human looking at the data can tell it’s not “real”, or at least looks wrong. Realism gaps In addition to some of the points made above, a couple of the data attributes output by faker, at least for the UK locale, are poorly formatted. Addresses are not correctly cased, e.g. town names aren’t capitalised as you would expect Job Titles are not proper cased These are purely cosmetic, and easily “corrected” later on using readily available SAS functions, or more thoroughly using some of the QKB-based standardisations. Indeed, the fact that faker output isn’t “perfect” allows scope to demonstrate how the data can be improved - using the tools available to us in the SAS toolset. Uniqueness Faker appears to use a limited pool of “seed values” in its generation process, and the fact that this pool is limited in size can cause issues, but equally creates some opportunities. The following Studio Flow illustrates what seems to be going on, under the covers. The steps in this Flow: generates 10,000 UK telephone numbers – in a field “phone_number” adds a “target” field to hold a standardised version of the “phone_number” standardises “phone_number” into “Stnd_Phone_Number” using the SAS QKB calculates the overall row count and the count of distinct values in “phone_number” and “Stnd_Phone_Number” Let’s look at the result we get: Looking at the results from step one - the core data generation - we see values in “phone_number” that are visibly UK style telephone numbers and show the variety of ways that these tend to be entered in real data: 0909 879 0660 +44118 496 0876 +443069990475 (020)74960382 0121 496 0219 0808 157 0489 So pretty much exactly what we are looking for. However, when we calculate how distinct the values of “phone_number” are, and even more so how distinct those values are after they’ve been standardised, we can clearly see that faker is “reusing” values from a limited pool. This becomes even more prevalent when the size of the generate data is increased. Total Rows 10,000 500,000 Distinct values in “phone_number” 9,769 202,584 Distinct values in “Stnd_Phone_Number” 7,917 22,788 This isn’t so much a problem, as it you can see this as representing the kind of “duplication” often seen in real data, but it is something to be aware of. Indeed, you can see such duplication, and that the generated data displays the variety value formatting that generally occurs in “real” data, as an opportunity to demonstrate the benefits of Entity Resolution in clarifying activity within customer data. Generating large row counts One issue that can occur is when trying to create large row counts of data with faker, e.g. multiple millions, is that the Python process may run out of memory. Whether it is an issue with faker, or Python in general, or a configuration issue particular to the environment I have been using, I can’t say, but I have encountered this when trying to create more than 5Mn rows using the “profile” provider for the en_GB locale. There is a simple way to accommodate this - the code inside the Generate Data with Faker custom step resets the Python session at the end of each invocation, deletes its data frame and does garbage collection. This means that each run is “self-contained” and there shouldn’t be an issue with running multiple separate large generation runs. So, for example, if you needed 20 Mn rows of person data, you could use 4 separate instances of Generate Data with Faker in same Flow, possibly with different locales, setting the offset value for the idx field at 0, 5Mn, 10Mn and 15Mn which will yield a single table with values idx running from 1 all the way up to 10Mn once you’ve appended the four output datasets together. Advanced use of faker The faker package offers a range of functionality beyond the scope of what we've implemented here. As mentioned above, a future article will look at the details of the code inside the Custom Steps I've discussed here, and we'll touch on some aspects of "advanced" usage there.

AngusLooney · ‎12-28-2019

Trever, some of the visualisations do indeed look suitable for the Virtual Reality treatment! Point me at some data, and I'll render them in the VR software, and upload a video. If that makes sense?

AngusLooney · ‎07-09-2019

Well, if it won't work implicitly, you've got no choice. Odd though.

AngusLooney · ‎07-09-2019

With Windows, you can usually just use the UNC, no need to mount it in order to be able to access it. So, you can use "\\xxxx\zzz\thefile.txt" as the path to the file, in an import for instance.

AngusLooney · ‎07-08-2019

file pipe can be disabled entirely using the -noxcmd start up option, in facts, that's the default in most configurations. It's a difficult topic, XCMDs bring very significant and powerful capabilities, but Spiderman allies (great power, great responsibilities). You can basically do whatever your user account is allowed to do, and if that means file deleting, you can. So, it throws a focus on to the SAS environment being properly configured and setup, that end users DO NOT have permissions to write and delete inside the SAS configuration directory, etc. Perhaps the greatest risk is that mistakes can happen, a bit of code to "delete files" is fine when you use it as intended on "your" files, but the day you pass it a blank "directory" value, it ends up working on a directory you didn't intend, and merrily deletes everyone else's files.

AngusLooney · ‎07-07-2019

Embracing the fact that we're using DI Studio, what not try a minimal code approach. Create a job that backs up your dataset. Then inside your main job, add a User Written Transform with the following code in it: data &_output.; format Is_It_Sunday 8.; Is_It_Sunday = weekday(date()); if Is_It_Sunday = 1 then output; run; and use the output of this node to feed a loop node with the "backup job" attached. Bingo, the back up "sub job" only runs on Sundays. Easily permutated into running all sorts of days of the week. Or, consider if this isn't a question of scheduling/orchestration, and use LSF and timed triggering of a dedicated backup flow once a week. Is it really the responsibility of a data processing job to do backups?

AngusLooney · ‎05-30-2019

In general, at least in my experience, you're going to be populating your External File Object's physical file path/names with run-time values, from a directory read or a control table, and very often be processing multiple files at a time. So the Ext File has &filepath. as it's physical path value (remember to check the "double quotes" option). My preference approach is to put any file processing steps in an parameterised "inner job" and feed it 1 to N real filenames via a Loop construct, passing the appropriate value(s). This has the advantage of allowing you to use the value of the filename as a field value in your output, for instance to support audit and/or debuging uses. When developing the inner job, just set a default value pointing to a test file, so you can test the job in isolation. You may find that these inner jobs morph into "toolbox" items - for instance, a generic "delete a file" job. Empty target dataset, read list of files to process, loop throgh them one by one appending to target as you do.

AngusLooney · ‎05-27-2019

The alias is well worth setting up. Additionally, it's worth looking into setting up .authinfo files so you don't have to supply username and password on the command line, useful for scripts etc...

AngusLooney · ‎05-16-2019

No, it should not be a User Written Node, it should be a Custom Transform, coded to make it a generic, reusable addition to the over all "Toolbox". There is rarely a case in my experience where a User Written Noe isn't an instance of generic task that should be a User Transform!

AngusLooney · ‎05-16-2019

The way I tackle this sort of thing, in DI Studio, is to run a UT/Macro that collects all the filenames and paths into a table, then use that table to control a Loop, cycling though the records in the "file listing" table, one at at time, which nicely gives you the current filename as a macro variable, which you pass to the infile statement, and use in the population of the table(s) you create from the file. Equally, you can route file to various different "readers" based on assessing their nature by parsing their filenames, assuming there's a structure/schema in their naming (which there should be). Easy to add the filename each record came from to the data. Pretty standard to have data/metadata encoded into the filename, and simple to extract it from them, dates, file content types etc.

AngusLooney · ‎05-14-2019

Very interesting, only skim read so far. One thing I would 100% endorse is the need for "instrumentation" on jobs, to log their execution, starts and stops. This is critical really, however you do it. Definitely recognise the issues around naming conventions and structured approaches. To me, the need over a pervasive, overarching conceptual framework rapidily becomes critical. In a way, we're dealing with a series of levels of consideration: - the sequence and interdependancy of jobs within a "flow" - the sequence and interdependancy of flows within an "estate" I've recently been looking into the ideas/concepts around decomposing ingest processes into completely decoupled stages, where bundles of data transistion through a series of states, where those transistion happen by the actions of jobs/flows, being read as input and written as output, which is then the input the downstream processes. The ideas of "data queues" and viewing instances of flows almost like "worker threads", including mutiple parallelised instances of the same flow, action on discretly allocated collections of the bundles of data. It started from looking at the challenges of ingesting very large volumes of raw data files, particularly XML where neither "by file" or "all in one go" approaches are performant or sustainable. Starts to morph into streaming territory, queues, prioritisation, "backpressure" and the like.

Online Status	Offline
Date Last Visited	‎08-20-2024 10:01 AM

SAS Support Communities

Re: Data Synthesis with SAS Viya and Python – Part Two

Re: Data Synthesis with SAS Viya and Python

Re: Data Synthesis with SAS Viya and Python

Data Synthesis with SAS Viya and Python – Part Two

Data Synthesis with SAS Viya and Python

Re: Large-scale correlation analysis with the hyperGroup CAS Action

Re: [file PIPE] - how to secure?

Re: [file PIPE] - how to secure?

Re: [file PIPE] - how to secure?

Re: SAS DI

Re: Error Libref is not assigned

Re: Optimal Pass Through SQL with SAS Data Integration Studio

Re: [file PIPE] - how to secure?

Re: Dynamically Deleting External Files in DI Studio

Administering the Grid from the Command Line in SAS 9.4 M6

Re: how to get list of files available in Directory including sub fold...

Re: DI Studio mappings

Re: Importing excel file with multiple sheets into SAS DI Studio

Decomposition and Orchestration in DI Studio and LSF.

Re: DataFlux Data Management Studio Quality Knowledge Base (QKB) Acces...

Data Synthesis with SAS Viya and Python

Data Synthesis with SAS Viya and Python – Part Two

Re: Data Synthesis with SAS Viya and Python – Part Two

Re: Data Synthesis with SAS Viya and Python

Re: Data Synthesis with SAS Viya and Python

Data Synthesis with SAS Viya and Python – Part Two

Data Synthesis with SAS Viya and Python

Re: Large-scale correlation analysis with the hyperGroup CAS Action

Re: [file PIPE] - how to secure?

Re: [file PIPE] - how to secure?

Re: [file PIPE] - how to secure?

Re: SAS DI

Re: Connect External File tot Custom Transformation

Re: Administering the Grid from the Command Line in SAS 9.4 M6

Re: Dynamically Deleting External Files in DI Studio

Re: How to extract variable from filename?

Re: DI Studio mappings

Follow Us

What is...