BookmarkSubscribeRSS Feed
Quentin
Super User

Warning: this is really a python question, using SAS as background explanation.  Asking here because it's helpful to explain it via SAS, it's a question about python for analytics, and I don't know any friendly python forms (suggestions welcome).

 

If I have a complex SAS program, I will often use macros to modularize the code, following recommendations from Ed Heaton's excellent paper, https://www.lexjansen.com/nesug/nesug01/at/at1010.pdf.  So I might end up with a program that looks like:

 

%macro makereport(...);
  %getdata(...)
  %cleandata(...)
  %fitmodel(...)
  %plotit(...)
%mend makereport;

%makereport()

 

I've started playing with python, and curious if folks writing a complicated analytic program would use functions to modularize their code, or if they go further into OOP and write classes.  I've scanned a couple python analytics books, but they seem to show how to call pandas (or whatever) in a script to get things done, and not so much on how to structure your code.  I saw one blog post in favor of data scientists fully embracing OOP and building classes, but other posts that basically say 'just because python is object-oriented doesn't mean you have to create your own objects, if creating objects isn't useful don't do it.' 

 

As an example, consider a simple python script to read a CSV with X and Y, fit a regression, and make a plot:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

#get data
df=pd.read_csv("linear.csv")

#fit model
model = smf.ols('y ~ x', data=df)
res=model.fit()

#make plot
y_hat=res.predict()
plt.plot(df.x,df.y, 'o')
plt.plot(df.x, y_hat, linewidth=2)
plt.show()

You could use functions to modularize it like:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

def getdata(csv):
    df=pd.read_csv(csv)
    return df

def fitmodel(df):
    model = smf.ols('y ~ x', data=df)
    res=model.fit()
    return res 

def plotit(df,res):
    y_hat=res.predict()
    plt.plot(df.x,df.y, 'o')
    plt.plot(df.x, y_hat, linewidth=2)
    plt.show()
    
def runall(csv):
    df=getdata(csv)
    res=fitmodel(df)
    plotit(df,res)
        
runall("linear.csv")

Or define a class, and use it like:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

class curve:
    def __init__(self, csv):
        self.df = self.getdata(csv)       
        model = smf.ols('y ~ x', data=self.df)
        self.res=model.fit()
    
    def getdata(self,csv):
        df=pd.read_csv(csv)
        return df
    
    def fitmodel(self,df):
        model = smf.ols('y ~ x', data=df)
        res=model.fit()
        return res 
    
    def plotit(self):
        y_hat=self.res.predict()
        plt.plot(self.df.x,self.df.y, 'o')
        plt.plot(self.df.x, y_hat, linewidth=2)
        plt.show()

mycurve=curve("linear.csv")    
mycurve.plotit()

Clearly if you are building an application, there are benefits to creating classes.  And I recognize that in analytic work, there is a wide gray zone between an ad hoc script for a one-off analysis, and an analytic application.  (e.g. is a program that you manually run once a month to generate a monthly report an application?)  In my real life SAS programming I don't always modularize my code.

 

So when you're writing python code for data analytics: 

  • do you modularize it?
  • do you create functions to modularize it?
  • do you create classes to modularize it?

 

Related: when you modularize code with functions or classes, do you keep the code to define the functions/classes in your main .py script, or do you put each function/class definition into its own .py file and import them?  I guess putting them into their own .py file and importing them would be analogous to storing SAS macro definitions as .sas files in an autocall library, which is my usual practice.

 

Would also welcome any suggestions for good books / sites etc about python coding patterns / best practices when using python for data management/analytics.  Most of the python sites are about programming, rather than analytics.

The Boston Area SAS Users Group (BASUG) is hosting an in person Meeting & Training on June 27!
Full details and registration info at https://www.basug.org/events.

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 0 replies
  • 328 views
  • 1 like
  • 1 in conversation