BookmarkSubscribeRSS Feed
Quentin
Super User

Warning: this is really a python question, using SAS as background explanation.  Asking here because it's helpful to explain it via SAS, it's a question about python for analytics, and I don't know any friendly python forms (suggestions welcome).

 

If I have a complex SAS program, I will often use macros to modularize the code, following recommendations from Ed Heaton's excellent paper, https://www.lexjansen.com/nesug/nesug01/at/at1010.pdf.  So I might end up with a program that looks like:

 

%macro makereport(...);
  %getdata(...)
  %cleandata(...)
  %fitmodel(...)
  %plotit(...)
%mend makereport;

%makereport()

 

I've started playing with python, and curious if folks writing a complicated analytic program would use functions to modularize their code, or if they go further into OOP and write classes.  I've scanned a couple python analytics books, but they seem to show how to call pandas (or whatever) in a script to get things done, and not so much on how to structure your code.  I saw one blog post in favor of data scientists fully embracing OOP and building classes, but other posts that basically say 'just because python is object-oriented doesn't mean you have to create your own objects, if creating objects isn't useful don't do it.' 

 

As an example, consider a simple python script to read a CSV with X and Y, fit a regression, and make a plot:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

#get data
df=pd.read_csv("linear.csv")

#fit model
model = smf.ols('y ~ x', data=df)
res=model.fit()

#make plot
y_hat=res.predict()
plt.plot(df.x,df.y, 'o')
plt.plot(df.x, y_hat, linewidth=2)
plt.show()

You could use functions to modularize it like:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

def getdata(csv):
    df=pd.read_csv(csv)
    return df

def fitmodel(df):
    model = smf.ols('y ~ x', data=df)
    res=model.fit()
    return res 

def plotit(df,res):
    y_hat=res.predict()
    plt.plot(df.x,df.y, 'o')
    plt.plot(df.x, y_hat, linewidth=2)
    plt.show()
    
def runall(csv):
    df=getdata(csv)
    res=fitmodel(df)
    plotit(df,res)
        
runall("linear.csv")

Or define a class, and use it like:

import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

class curve:
    def __init__(self, csv):
        self.df = self.getdata(csv)       
        model = smf.ols('y ~ x', data=self.df)
        self.res=model.fit()
    
    def getdata(self,csv):
        df=pd.read_csv(csv)
        return df
    
    def fitmodel(self,df):
        model = smf.ols('y ~ x', data=df)
        res=model.fit()
        return res 
    
    def plotit(self):
        y_hat=self.res.predict()
        plt.plot(self.df.x,self.df.y, 'o')
        plt.plot(self.df.x, y_hat, linewidth=2)
        plt.show()

mycurve=curve("linear.csv")    
mycurve.plotit()

Clearly if you are building an application, there are benefits to creating classes.  And I recognize that in analytic work, there is a wide gray zone between an ad hoc script for a one-off analysis, and an analytic application.  (e.g. is a program that you manually run once a month to generate a monthly report an application?)  In my real life SAS programming I don't always modularize my code.

 

So when you're writing python code for data analytics: 

  • do you modularize it?
  • do you create functions to modularize it?
  • do you create classes to modularize it?

 

Related: when you modularize code with functions or classes, do you keep the code to define the functions/classes in your main .py script, or do you put each function/class definition into its own .py file and import them?  I guess putting them into their own .py file and importing them would be analogous to storing SAS macro definitions as .sas files in an autocall library, which is my usual practice.

 

Would also welcome any suggestions for good books / sites etc about python coding patterns / best practices when using python for data management/analytics.  Most of the python sites are about programming, rather than analytics.

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 0 replies
  • 244 views
  • 1 like
  • 1 in conversation