Hello all,
I am beginning a master thesis project requiring enterprise datasets. The intended approach is to use process mining techniques to estimate anomalies and to perform other analytics. Does SAS provide any such datasets for research? Could anyone suggest a good source where these could be found (academic or governmental, especially)? The requirement is for a dataset capable of describing organizational processes and activities. Large amounts of ERP data would probably suffice. But it also need not be enterprise/private in nature; it could just as well be governmental or healthcare data.
Example scenario:
A manufacturer tracks its production/orders using a unifying planning and inventory system (eg, Dataflo). Occasionally, defects occur when individuals deviate from process. We wish to use the data generated from the system to detect anomalies, such as deviations from known processes (employee A suddenly inverts steps in a method; or software developer B closes a bug without approval, etc).
It has been exceptionally difficult to find enterprise-oriented datasets for research. Despite their abundance "in the wild", most are confidential in nature. Any help or suggestions is appreciated. I thought it would be appropriate to ask, since this approach closely aligns with the solutions that SAS has provided to various clients.
Thanks!
Hi Yeti,
Not sure if I get the right idea of your example scenario. What would this data look like (inputs, target, ids, etc)?
Show us a concrete mockup of this data. Maybe there is something similar from a SAS course or a data set in sashelp or sampsio libraries.
In the meantime, this book is a go-to for research and proofs of concept:
by Rick Wicklin
Good luck!
-Miguel
What would this data look like (inputs, target, ids, etc)?
Basically data that I can assign to processes, or from which I identify processes. Process mining lit usually refers to "event logs" and other event-oriented data, but practitioners are usually extracting and normalizing their own event-log datasets from multiple, heterogeneous data sources. In several examples, they generate hospital event logs from the admittance, care-progression, or other data warehouses at a hospital. Another example might be process-mining a computer hardware company by aggregating event data from a defect-log system, an inventory system, and a communication system, and incorporating these sources into a single dataset describing their processes.
The benefit is that process mining is capable of looking across multiple data sources, rather than at a single data source, to identify processes of various kinds: bottlenecks, anomalies, or just overall organizational descriptions. No software system fully-encompasses any organization's needs, so there is typically some sort of aggregation involved: taking multiple data sources, and converting them to something that neatly describes processes (events) and is consumable by common process mining methods.
Sorry, I should have clarified. I assume I'll have to do the aggregation/extraction myself, so I guess the question is where one might find organizational datasets of that scale, encompassing multiple data sources. Many companies and state agencies use such all-encompassing ERP systems for inventory, defect-tracking, CRM, and so on. The difficulty is finding anything of that scale in the public domain, and suitable for research. But you can imagine the benefit it might have for something like a large healthcare organization, whether state or private. Which I mention suggestively: there is so much data tracking in healthcare, it seems like a fruitful source of datsets, I just haven't found any yet.
Thanks for your response!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.