10-08-2024
ChrisBrooks
Ammonite | Level 13
Member since
05-29-2014
- 751 Posts
- 93 Likes Given
- 107 Solutions
- 242 Likes Received
-
Latest posts by ChrisBrooks
Subject Views Posted 2670 05-25-2023 04:05 AM 2758 05-23-2023 07:23 PM 3000 05-19-2023 04:46 AM 961 02-17-2023 03:59 AM 946 12-02-2022 03:52 AM 1886 09-12-2022 06:34 PM 1946 09-12-2022 03:49 AM 2655 08-15-2022 12:05 PM 2722 08-15-2022 05:20 AM 3212 07-22-2022 12:06 PM -
Activity Feed for ChrisBrooks
- Got a Like for Re: Round with PROC SQL. 11-14-2024 12:45 AM
- Got a Like for Re: PROC SQL UNION. 08-10-2023 12:30 AM
- Posted Re: Fun with Integer Sequences and SAS on SAS Communities Library. 05-25-2023 04:05 AM
- Posted Re: Fun with Integer Sequences and SAS on SAS Communities Library. 05-23-2023 07:23 PM
- Posted Fun with Integer Sequences and SAS on SAS Communities Library. 05-19-2023 04:46 AM
- Posted What's in a Name? Exploring Album Titles with SAS on SAS Communities Library. 02-17-2023 03:59 AM
- Posted Plan Your Next Vacation with SAS on SAS Communities Library. 12-02-2022 03:52 AM
- Posted Re: Programming with CASL in SAS Viya for Learners – basic data manipulation on SAS Communities Library. 09-12-2022 06:34 PM
- Posted Programming with CASL in SAS Viya for Learners – basic data manipulation on SAS Communities Library. 09-12-2022 03:49 AM
- Posted Re: Programming with CASL in SAS Viya for Learners – first steps on SAS Communities Library. 08-15-2022 12:05 PM
- Posted Programming with CASL in SAS Viya for Learners – first steps on SAS Communities Library. 08-15-2022 05:20 AM
- Posted Re: Explore Twenty Years of Spotify Data with SAS on SAS Communities Library. 07-22-2022 12:06 PM
- Posted Re: Explore Twenty Years of Spotify Data with SAS on SAS Communities Library. 07-22-2022 09:09 AM
- Posted Explore Twenty Years of Spotify Data with SAS on SAS Communities Library. 07-15-2022 04:31 AM
- Posted Re: Draw Your Family Tree with SAS on SAS Communities Library. 06-04-2022 03:47 AM
- Posted Draw Your Family Tree with SAS on SAS Communities Library. 06-03-2022 05:02 AM
- Posted Build your first Custom Transformation in SAS Viya for Learners on SAS Communities Library. 05-10-2022 05:53 AM
- Posted Discover San Francisco's Favourite Street Food with SAS on SAS Communities Library. 03-18-2022 05:03 AM
- Got a Like for Re: delete files with sas Data step. 03-15-2022 12:46 PM
- Posted Tracking the rise and fall of an industry with SAS on SAS Communities Library. 02-18-2022 02:51 AM
-
Posts I Liked
Subject Likes Author Latest Post 13 7 41 3 1 -
My Liked Posts
Subject Likes Posted 1 04-04-2018 08:27 PM 1 01-25-2018 07:20 PM 1 03-31-2018 04:56 PM 2 01-31-2019 08:51 PM 1 12-29-2018 05:19 PM -
My Library Contributions
Subject Likes Author Latest Post 7 1 2 1 4
05-25-2023
04:05 AM
Thanks @Tom you're right using the SAS function comb(n,2) does generate triangular numbers and I'd missed that. The only difference is that for values less than 2 the comb function returns a missing value whereas mine returns 1 where n=0 and 0 where n=1 which is in line with the encyclopaedia.
To answer your other question I wanted to create a function which would return the nth number in a sequence. As you say it's much easier to generate full sequence every time but I wanted to show how you could do quite complex things with Proc FCMP and of course it's more effient to only generate the number(s) you need.
... View more
05-23-2023
07:23 PM
Thanks @MichelleHomes that's a really interesting article. The encyclopaedia can get quite addictive browsing through the sequences and trying to figure out how to code them in SAS!
Also it's great to see the alternative versions from @yabwon and @Ksharp - I'd have to agree that arrays are probably simpler, particularly if you're not overly familiar with hash tables. I've done some simple testing of the two different methods with increasing numbers of padovan numbers calculated with the following results
The timings for the hash table version seem to be constant upto 2,500 calls but the array version times start to rise at some point over 1,000 where the two methods are neck and neck. If you try to calculate more than about 2,500 numbers you get an overflow error with both versions, which I was sort of expecting. I'd therefore say that the hash table method has a tiny theoretical advantage in speed on calculating a long sequence but it's not really noticeable in practice.
... View more
05-19-2023
04:46 AM
7 Likes
SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
We’ve all seen those number puzzles where you are given a sequence of integers and have to work out the next number in the sequence. Once you’ve got the hang of them, they tend to be pretty straightforward, involving adding some of the previous numbers in the sequence, possibly skipping one or more. There are, however, much more complex ways of generating sometimes interesting and useful sequences. In this edition of Free Data Friday, we will be looking at The Online Encyclopaedia of Integer Sequences which holds data for over a quarter of a million integer sequences of varying complexity and usefulness. We will see how you can use SAS to create custom functions to generate the nth number in a sequence. Even where there is no immediate practical value to a sequence, they can be useful for practicing your programming techniques, setting problems for trainees or even code golf type puzzles for the more experienced, so don’t dismiss any sequence out of hand.
Get the data
There’s no real data which needs to be downloaded here. Instead, each encyclopaedia entry consists of a web page giving the first few numbers of a sequence along with a description, formula for calculating the sequence, cross references to other sequences, programs for generating the sequence in mathematica and other languages (but not SAS) and other information. You can also search for a particular sequence or request a random sequence to be displayed.
Get started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Examples
We’re going to choose two sequences to show the process of creating custom SAS functions to generate the nth number in each sequence.
Triangular Numbers
A triangular number is a number that can be represented by a pattern of dots arranged in an equilateral triangle. For example, 1, 3, 6 and 10 are all triangular numbers.
We can use the following code to create a function which calculates a triangular number.
proc fcmp outlib=work.maths.intseqs;
function gettriangular(n);
retval=n*(n+1)/2;
return(retval);
endsub;
run;
options cmplib=work.maths;
data triangular;
do i=0 to 53;
triangular=gettriangular(i);
output;
end;
run;
As you can see the data step generates the first 54 triangular numbers (starting at zero). You can check the output against the list supplied in the encyclopaedia.
We are using the FCMP Procedure with the outlib option. This saves the function to the specified library, dataset and package removing the need to regenerate the function every time you need to use it if you save it to a permanent library (as here I usually save to the temporary work library when I am developing functions). However, I strongly recommend saving the source code in case the dataset gets lost or accidentally deleted.
Triangular numbers have quite a few different uses including solving “handshake problems” e.g. In a room of x number of people how many unique handshakes must take place before everyone is introduced to everyone else?
Padovan Numbers
Padovan numbers are defined by the following sequence and have applications in number theory and architecture.
P(0)=1
P(1)=0
P(2)=0
Thereafter
P(n)=P(n-3)+P(n-2)
This is much more complex to program as in order to calculate a number in the sequence beyond P(2) we need to know two other numbers in the sequence, which are themselves also Padovan numbers. We could, of course, calculate the full sequence of numbers every time we run the function but that would be very inefficient. Instead we will store all previously calculated numbers in a hash object and check to see if we have already calculated the required numbers. If we have, we will use them, if not then the function will call itself in a process known as recursion.
Here is the code (I've tried to make it and the comments as easy to follow as possible).
proc fcmp outlib=work.maths.intseqs;
function padovan(n);
/* Create a hash table - this persists for the life of */
/* the data step in which it will be called. It will */
/* NOT be recreated on subsequent calls within the */
/* data step */
declare hash h();
rc=h.definekey("seqno");
rc=h.definedata("padno");
rc=h.definedone();
/* Set the first 3 padovan numbers */
seqno=n;
if n=0 then do;
notfound=h.check();
if notfound ne 0 then do;
padno=1;
rc=h.add();
end;
retval=padno;
end;
else if n=1 then do;
notfound=h.check();
if notfound ne 0 then do;
padno=0;
rc=h.add();
end;
retval=padno;
end;
else if n=2 then do;
notfound=h.check();
if notfound ne 0 then do;
padno=0;
rc=h.add();
end;
retval=padno;
end;
/* Calculate subsequent padovan numbers */
else do;
seqno=n-2;
notfound=h.find();
if notfound ne 0 then do;
/* The Function calls itself recursively */
padno=padovan(seqno);
rc=h.add();
end;
prev=padno;
seqno=n-3;
notfound=h.find();
if notfound ne 0 then do;
/* The Function calls itself recursively */
padno=padovan(seqno);
rc=h.add();
end;
prevprev=padno;
retval=prev+prevprev;
end;
num=h.num_items();
return(retval);
endsub;
run;
options cmplib=work.maths;
/* Calculate a single Padovan Number */
data padosingle;
i=30;
b=padovan(i);
run;
/* Calculate a sequence of Padovan numbers */
data test;
do i=0 to 60;
b=padovan(i);
output;
end;
run;
There are two things to note:
Hash objects in an FCMP function behave slightly differently to “normal” hash objects. In particular they persist for the life of the data step they are being used in. The object is only created once and not every time the declare function is encountered; and
The function calls itself in a process known as recursion. It is vital that in recursive functions there is a guaranteed way out of the recursive loop otherwise the program can fall into a “black hole” of infinitely calling itself, running out of memory and crashing horribly. In our function the guaranteed way out is when the parameter n is less than 3, if not earlier.
Again, you can check the output from the list in the encyclopaedia.
Now it's your turn!
Did you find a better way to generate these sequences or another really interesting sequence in the encyclopaedia? Share in the comments. I’m glad to answer any questions.
Hit the orange button below to see all the Free Data Friday articles.
Free Data Friday
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
02-17-2023
03:59 AM
1 Like
SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
There’s a famous British TV situation comedy series called I’m Alan Partridge in which the eponymous hero (played by Steve Coogan) is a failed TV talk show host making a precarious living as a DJ on the graveyard shift of a small local radio station. In one episode Alan is asked what his favourite Beatles album is. He pauses for a moment and then says “Tough one……I think I would have to say, The Best of the Beatles.” We’re meant to laugh at Alan’s woeful lack of musical knowledge but of course most bands of any longevity will have a “Best of” album and it raises the question of how do bands choose names for their albums. Some (particularly for their first album) will choose the name of the band as their title (Led Zeppelin took this to extremes with Led Zeppelin, Led Zeppelin II, III and IV), some will pick one of the individual song titles and some will think up something totally unrelated and random. Whatever they choose though, they will want it to be memorable and in this edition of Free Data Friday we will be looking at the Wasabi Corpus of music albums to see if we can find any pattern in the length of an album title and whether that is affected by musical genre.
Get the data
The data can be downloaded as a CSV file from Github
Get started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the data ready
The main problem with this data file is it is HUGE. It is far too big to be uploaded into SAS OnDemand for Academics in its original form. You can drastically reduce its size by opening the CSV in something like Excel or Python and removing all columns except those you actually need before uploading the file. Once the resulting file was uploaded, I imported it into SAS with a simple PROC Import.
filename reffile '/home/chris52brooks/Albums/albumsshort.csv';
proc import datafile=reffile
dbms=csv
out=import;
getnames=yes;
run;
The results
I then used the countw function to count the number of words in each title. While I was scanning through some of the observations, I noted that quite a large number of them had the title “Other Songs.” I’m guessing these may be compilation albums of some sort and I decided to delete them to stop them skewing the results.
data lengths;
set import;
words=countw(title);
if title="Other Songs" then delete;
run;
Having done that, we can now move on to some analysis. The first thing I want to do is calculate, for each genre, how many album titles have specific numbers of words in them.
proc sql;
create table genrecount as
select genre,
words,
count(words) as num
from lengths
group by genre, words;
quit;
We can see, for example, that there are 15 A Cappella albums with one-word titles.
Next, we can use PROC Rank to rank the word counts for each genre to find out which is the most common word count per genre. Then we create a file holding the most common word count (ranking equal to one) for every genre.
proc rank data=genrecount out=ranking descending;
by genre;
var num;
ranks ranking;
run; proc sql; create table topranks as select genre, words, ranking from ranking where ranking=1 order by words; quit;
Of course, there are a large number of genres with the same ranking for a lot of the word counts. As a tie-break I ranked the results file by number of albums in the genre and selected the largest genre for each word count.
proc sql;
create table genretotals
as select genre,
count(title) as num
from lengths
where genre ne ""
group by genre
order by num desc;
quit;
proc sql;
create table results
as select
t1.genre,
t1.words,
t1.ranking,
t2.num as genrecount
from
topranks t1,
genretotals t2
where t1.genre=t2.genre;
quit;
proc rank data=results out=resultsranking descending;
by words;
var genrecount;
ranks ranking2;
run;
proc sql;
create table last
as select *
from resultsranking
where ranking2=1;
quit;
Finally we can get the overall totals for all genres by word lengths:
proc sql;
create table overall
as select
words,
count(words) as numb
from lengths
group by words
order by numb desc;
quit;
While I wouldn't claim this anaysis is perfect (if only because of the sometimes eccentric nature of album naming) I think it is safe to say that shorter titles predominate. In particular there is quite a sharp drop in numbers after about four or five words. It's hard to determine any pattern between genres although I was surprised to see Progressive Rock top the list of single word titles as I often think of that genre as somewhat pretentious and would have expected titles to match!
Now it's your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Hit the orange button below to see all the Free Data Friday articles.
Free Data Friday
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
12-02-2022
03:52 AM
2 Likes
SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
This is the time of year when people often start to look forward and plan for next year's vacation. It helps to lift spirits Cardiff Castle during the cold, dark winter months. However, people’s idea of what makes a great vacation varies a lot. Some people like to spend their time sunning themselves on a beach while others like to learn about the history and culture of the places they visit. In this article we will be using SAS to analyze data from the Mapping Museums Project to help people who fall into that latter category and are visiting the UK to plan their vacation.
Get the data
The data can be downloaded from the project web site here in several different formats. I initially downloaded the data as a CSV file but as I had a couple of minor issues importing it I converted the file to XLSX format and uploaded it to SAS OnDemand.
Get started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the data ready
Firstly I used the XLSX libname engine to read the file as a SAS data set.
libname xl xlsx "/home/chris52brooks/Museums/MappingMuseumsData2021_09_30.xlsx";
The file contains a large number of variables and records for places which are no longer open. In order to to make it a little easier to work with I used the following code to remove those closed places (open venues have a value in the year_closed variable of “9999”) and to only keep variables which might be interesting.
data museums_abbr(drop=year_closed);
set xl.MAPPINGMUSEUMSDATA2021_09_30(where=(substr(year_closed,1,4) eq "9999") keep=
name_of_museum
size
subject_matter
admin_area
postcode
latitude
longitude
year_closed);
run;
The first thing I wanted to do was find out which were the predominant type of museums in the UK. To do this all I need is a simple PROC FREQ followed by a PROC SORT.
proc freq data=museums_abbr;
table subject_matter / out=freq_subject;
run;
proc sort data=freq_subject out=subject_sorted;
by descending count;
run;
I then used PROC SGPLOT to generate a bar chart showing me the top 10 museum categories.
ods graphics / reset;
proc sgplot data=subject_sorted(obs=10);
title1 "Top 10 Museum Subjects in the UK";
footnote j=r "Data From: The Mapping Museums Project";
hbar subject_matter / response=count
datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ;
xaxis grid label="Total Museums";
yaxis grid label="Museum Subject";
run;
This gives me the answer but the labels on the vertical axis look ugly, with hyphens and underscores where you would expect spaces. I therefore ran a Data step to replace those characters with spaces using the translate function and regenerated my chart.
data subject_sorted2;
set subject_sorted;
subject_matter=translate(subject_matter," ","-_");
run;
ods graphics / reset;
proc sgplot data=subject_sorted2(obs=10);
title1 "Top 10 Museum Subjects in the UK";
footnote j=r "Data From: The Mapping Museums Project";
hbar subject_matter / response=count
datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ;
xaxis grid label="Total Museums";
yaxis grid label="Museum Subject";
run;
I can see that the two biggest categories are local history and large (what we often call “stately” houses).
Of course, if you’re planning a vacation you will want to know what’s available in the area you’re planning to visit. For example, if you’re visiting my home city, Cardiff, you could use the following Proc SQL statement to filter the places using the admin_area field and then print a list of what is available.
proc sql;
create table cardiff_museums
as select *
from museums_abbr
where find(admin_area,"Cardiff");
quit;
proc print data=cardiff_museums;
var name_of_museum subject_matter size postcode;
title "Museums in Cardiff";
footnote
"Data from The Mapping Museums Project";
run;
I’ve visited most of these places and can strongly recommend them – there’s the added bonus of them all being within a reasonable distance of each other and you have their Post Code to plug into your Sat Nav to find your way there!
References
Data downloaded from the Mapping Museums website at www.mappingmuseums.org, Accessed on 5 November 2022.
Now it's your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Hit the orange button below to see all the Free Data Friday articles.
Free Data Friday
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
09-12-2022
06:34 PM
Hi @dipand and thanks for the question. You can use the same deleteRows action to remove rows from any CAS table that you have write access to. It's important to remember though that a CAS table is an in-memory structure and you are NOT deleting rows from the underlying data set (or any other file used to create the table).
I hope this answers your question - if not please let me know, preferably with a cncrete example.
... View more
09-12-2022
03:49 AM
1 Like
In the first article of this series on CASL programming in SAS Viya for Learners we discovered how to use SAS Studio to make a connection with the CAS server, assign libraries to explore the sample data, perform a simple analysis and terminate the CAS session.
In this article we will be looking at basic data manipulation of the CAS tables. It is important to mention here that in CASL, as in all other languages, there is often more than one way to perform a task. It isn’t possible to cover all possible methods in these articles so we will explore what I believe to be the simplest and clearest way to achieve our goals. In time, as you practice, read the documentation and learn more you will discover other techniques which may well, in the right circumstances, be improvements to those you will see here. However, these methods will work and will get you up and running with CASL programming.
Starting up
Firstly, we will start our session and set up our libraries.
/* Connect to the cas server and call the connection casconn*/
cas casconn;
/* Assign SAS librefs for all existing caslibs */
caslib _all_ assign;
The data set we will use for this article is called “Accepts” and is held in the ACADEMICS library. One important thing to note is that in SAS Viya for Learners we don’t have write access to that library. This is, actually, a good thing as it is a global library used by others and in any case it is not generally good practice to alter the original copy of your source data. We do, however, have write access to the CASUSER library. We can see that with the caslibinfo action in the tables action set.
/* Get information about caslibs */
proc cas;
table.caslibinfo caslib="academic";
run;
table.caslibinfo caslib="casuser";
run;
quit;
You can see from the results that in contrast to the ACADEMICS library CASUSER is both personal and transient. This means that only you can access that data and any data held in it only has session scope i.e. it only exists for the lifetime of the session so you must rerun your analysis to regenerate it if you terminate and restart your session. Again, this is not a bad thing as it prevents mistakes if your data changes or you alter any of your code.
Copy a data set
Unfortunately, there is no copy action currently available, but we can utilise a little trick to mimic one. If we execute an action which has a casout parameter (not all do hence the need to sometimes make a copy) but don’t specify the parameter which alters the table we can simply make a direct copy. It is easier to understand this with our example.
/* Make a copy of the table */
proc cas;
table.index
table={caslib="academic", name="accepts"}
casout={caslib="casuser", name="accepts_1", replace=true};
quit;
Here we have used the index action but haven’t used the indexVars parameter to specify variables to create indexes for. Therefore, a straight copy will be written to CASUSER.
Dropping and keeping columns
If we only want to use a small number of the columns in a table, we can use the alterTable action with either the drop or keep parameter. As the names imply drop doesn’t copy the columns specified to the output table whereas keep only copies the specified columns. Which one you choose will largely depend on how many columns you want to drop or keep. It is simply easier to use the parameter which requires you to specify fewest columns. In our case we only want to keep three columns, so we’ll use keep. This can be confirmed with the columnInfo action.
/* Alter the table keeping only 3 columns */
proc cas;
table.altertable
caslib="casuser",
name="accepts_1",
keep={"app_id", "down_pyt", "loan_amt"};
quit;
/* Get information about the columns in the table */
proc cas;
table.columninfo
table={caslib="casuser", name="accepts_1"};
quit;
Creating a Calculated Column
We may wish to create a new column in the table, possibly based on values from other columns. We can do this by creating what is known as a calculated column. We can do this by using the table action parameter to an action, along with the table actions “subparameters” computedVars (which names the new columns) and computedVarsProgram (which specifies the formulae for creating them). Again, this is probably clearer in our example.
/* Create a calculated column */
proc cas;
table.index
table={
caslib="casuser"
name="accepts_1"
computedVars={{name="perc"} }
computedVarsProgram="perc=(down_pyt/loan_amt)*100;"
}
casout={caslib="casuser", name="accepts_2", replace=true};
quit;
/* Get information about the columns in the table */
proc cas;
table.columninfo
table={caslib="casuser", name="accepts_2"};
quit;
Subsetting a table
This is quite simple – we merely use a where statement in the table parameter when we copy our table. Here we only want to keep rows where the value of the perc column is greater than zero.
proc cas;
table.index
table={caslib="casuser", name="accepts_2" where="perc>0"}
casout={caslib="casuser", name="accepts_3", replace=true};
quit;
Updating values in a table
If we want to alter values in the table, we can do so by using the set parameter in the update action. Here we want to multiply all the values in the app_id column by 10.
proc cas;
table.update
table={caslib="casuser", name="accepts_3"}
set={{var="app_id", value="app_id*10"}};
quit;
Deleting rows from a table
If we want to delete rows from a table we can use the deleteRows action specifying the condition to use in deciding whether to delete the rows. Here (after first creating a copy of the table as shown earlier) we delete all rows where the value of loan_amt is less than twenty thousand.
proc cas;
table.index
table={caslib="casuser", name="accepts_3"}
casout={caslib="casuser", name="accepts_4", replace=true};
quit;
proc cas;
table.deleterows
table={caslib="casuser", name="accepts_4", where="loan_amt<20000"};
quit;
Finally we terminate our session:
/* Terminate the session */
cas casconn terminate;
I would encourage you to try out some of the techniques shown in this article and if you have any questions or comments, please leave a message below.
... View more
08-15-2022
12:05 PM
Thanks @Quentin - unfortunately SAS Viya for Learners access is currently limited to those with an accredited university affiliation. Having said that I know SAS is a company which does listen to its customers so if there is enough interest maybe it will be opened up to independent learners in the future.
... View more
08-15-2022
05:20 AM
4 Likes
I often used to joke with younger members of one of the programming teams I worked with that “real programmers don’t use GUIs”. I was only half joking though as while I use GUIs (graphical user interfaces) for programming and I’ve been a big fan of SAS Enterprise Guide and SAS Viya’s visual programming products I still believe that in order to be a really good professional programmer nothing can beat actually learning to hand craft the code yourself rather than relying too much on code generators.
Therefore in this series of articles I shall be demonstrating how to use CASL (Cloud Analytical Services Language) in SAS Viya for Learners using SAS Studio. We will start by seeing how to initialize a CAS session, examine the many sample data sets provided, run a simple statistical process and terminate the session.
What is CASL?
CASL is the scripting language used to execute CAS actions in a CAS server. CAS actions perform tasks and are grouped together in action sets based on common functionality.
Starting a session
Start SAS Viya for Learners and open SAS Studio from the menu.
Create a new SAS program and start a CAS session. This can very simply be done by running the CAS statement and (optionally) naming the session. In this case we will call it CASCONN (if we don’t give it an explicit name the session will, by default, be called CASAUTO).
/* Connect to the cas server and call the connection casconn*/
cas casconn;
Discover which data sets are available
SAS Viya for Learners has a large number of sample data sets preloaded and organized into caslibs (libraries on the CAS server). We can discover what they are by assigning SAS librefs for all the existing caslibs. We can also run Proc Caslib to get details on the data sets held in a specific caslib.
/* Assign SAS librefs for all existing caslibs */
caslib _all_ assign;
/* List the tables in one of the sample caslibs */
proc casutil;
list tables incaslib="busana4m";
run;
Perform our first analysis
As previously mentioned, CAS actions operate on the data sets in the CAS server and are grouped into action sets. We will use the COUNTRIES data set in the BUSANA4M caslib and run the CORRELATION action in the SIMPLE action set to see if there is any correlation between years of education, UN id and Gross Domestic Product per head. This will produce a matrix showing the Pearson Coefficient for each combination of the variables and is equivalent to running Proc Corr in a non-CAS session. The code is quite simple, although the action set does have many options which you can add to vary the methods used.
/* Perform a simple correlation test */
proc cas;
simple.correlation /
table={name="countries" caslib="busana4m"}
inputs={"gdp_per_capita" "years_of_education" "un_id"};
run;
quit;
We can see from the results (highlighted in green) that, not surprisingly, there is a strong positive correlation between years of education and GDP per head and a negative correlation between UN ID number and GDP per head.
Terminate the session
Finally, once you have finished your work it is good practice to explicitly terminate your CAS session to free up resources on the server.
/* Terminate the session */
cas casconn terminate;
In future articles we will be taking a deeper look into CAS programming in SAS Viya for Learners but if there are any areas you would like covered or any other questions or comments, please leave a message below.
... View more
07-22-2022
12:06 PM
Ah I didn’t know you could do that - we use Apple Music as I have a bundle which gives us Apple TV+, free Apple Arcade games and a larger iCloud allowance included.
I suppose it all depends on what Spotify know about each individual account holder within the family account. It would certainly be interesting to know if you could identify demographics from song selection, not only from an academic point of view but from a marketing one as well…
... View more
07-22-2022
09:09 AM
That would be very interesting but unfortunately it’s not available in this data set. I have seen Spotify chart data by country so that should be possible but I think age groups would be difficult. Imagine a family scenario with one account holder (probably a parent) but every member of the family listening to different songs. You’d probably end up with Taylor Swift, Rammstein and Deep Purple all mixed in together with no way to programmatically split them up.
... View more
07-15-2022
04:31 AM
4 Likes
SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
Spotify long ago established itself as one of the worlds leading music streaming platforms. It seems that virtually every song you might wish to listen to is available on the platform and so it can be regarded as a good guide to what music is and is not popular.
In this edition of Free Data Friday we will be looking at data from the web site Kaggle which details the most popular songs on Spotify over a twenty year period.
Get the data
The data can be downloaded from Kaggle as a CSV file.
Get started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the data ready
The CSV file was imported into SAS with Proc Import.
filename reffile '/home/chris52brooks/Spotify/songs_normalize.csv';
proc import datafile=reffile
dbms=csv
out=songs
replace;
getnames=yes;
run;
I then used Proc SQL to see how many songs are included for each year in the span.
proc sql;
select
distinct year,
count(song) as num_songs
from songs
group by year;
quit;
I noticed that there were only a handful of songs for the years 1998 and 2020. I suspect that these figures are for only part-years and so I decided to delete those records in case they skewed any of my results.
proc sql;
delete from songs
where year = 1998 or year=2020;
quit;
The results
I first decided to discover who was the most popular artist.
proc sql;
create table artists
as select
distinct artist,
count(song) as num_songs
from songs
group by artist
order by num_songs desc;
quit;
ods graphics / reset;
proc sgplot data=artists(obs=10);
title1 "Top 10 Most Popular Artists on Spotify";
title2 "1999-2019";
footnote j=r "Data From: Kaggle";
hbar artist / response=num_songs
datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ;
xaxis grid label="Total Songs";
yaxis grid label='Artist Name';
run;
The leading artist turned out to be Rihanna, followed by Drake and Eminem. I was a little surprised to see Katy Perry so high on the list as I hadn’t thought she’d had a long enough period of extended popularity to make the top ten but perhaps that reflects my own lack of knowledge of modern pop music! The other thing which surprised me was that all the top ten are solo acts. You have to go all the way down to joint thirteenth to find the first group – The Black Eyed Peas.
Next, I wanted to find the most popular genre of music. Here I had a problem since many songs are assigned to multiple genres. I found it impossible to designate a “major genre” programmatically so decided to leave them combined.
proc sql;
create table genres
as select
distinct genre,
count(song) as num_songs
from songs
group by genre
order by num_songs desc;
quit;
ods graphics / reset;
proc sgplot data=genres(obs=10);
title1 "Top 10 Most Popular Genres on Spotify";
title2 "1999-2019";
footnote j=r "Data From: Kaggle";
hbar genre / response=num_songs
datalabel datalabelattrs=(weight=bold) categoryorder=respdesc ;
xaxis grid label="Total Songs";
yaxis grid label='Genre Name';
run;
Not surprisingly “Pop” and its cross-over genres dominate the listing. The top non pop cross-over genre is hip hop. It’s clear from this that pop is by far the predominant genre listened to on Spotify.
Finally, I wated to see if there was any correlation between the various attributes of the top song. To do this I ran Proc Corr specifying the variables I wanted to analyse.
proc corr data=songs;
var danceability energy loudness valence popularity;
run;
Proc Corr generates a correlation matrix where the closer the value at the intersection is to one the more closely correlated are the variables. The most closely correlated values are (perhaps not surprisingly) loudness and energy.
Now it's your turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Hit the orange button below to see all the Free Data Friday articles.
Free Data Friday
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
06-04-2022
03:47 AM
Hi @NinaL I'd never heard of GEDCOM before but Wikipedia tells me it's a plain text file consisting of different record types holding genealogical data. I can't find anything which will convert them directly to SAS format but these are the type of files I used to work with a lot when I started on COBOL many years ago so it should be possible to write your own parser without too much difficulty providing you work out a good plan first (I see for example that an individual can have muliple birth dates shown where there is some doubt - you'd have to decide how you handle that).
It also appears that there are some freely available converters to e.g. Excel or JSON. It might be easier to first convert GEDCOM to one of them, import that file into SAS and then make whatever changes to the SAS file that you want, just google GEDCOM to whatever format you want to try.
... View more
06-03-2022
05:02 AM
2 Likes
SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
This year the United Kingdom is celebrating the platinum jubilee of the accession to the throne of Queen Elizabeth II. A reign of seventy years makes her the longest serving monarch in British history. One thing that is of vital importance in a monarchy is the line of succession, often expressed by a family tree showing the relationship between the monarch and his or her relatives. Of course we all have a family tree and many people take great interest in researching their family and depicting it graphically.
In this episode of Free Data Friday, we will be using data from Wikipedia with Proc Netdraw to draw the family tree of Queen Elizabeth II as an example of how you can use SAS to draw your own family tree or any other hierarchy such as an organisation chart, system menu diagram or object hierarchy.
Get the data
The data is transcribed from Wikipedia. There isn’t a huge amount of data to type in manually, so this isn’t a significant burden.
Get started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the data ready
The preparation necessary for drawing the family tree is all done in the creation of the data file to be used. Proc Netdraw is very versatile but for our purposes we need a data set with an activity (parent) variable and a successor (child) variable. Also, we need _x_ and _y_ variables to tell the procedure where to place the nodes and a _pattern variable to control the appearance of the nodes. Here is the code used to create the data set
data royalfamily;
length name $20 successor $20 _x_ 8. _y_ 8. _pattern 8.;
infile datalines delimiter="," missover;
input name $ successor $ _x_ _y_ _pattern;
datalines;
Elizabeth,Charles,1,3,1
Elizabeth,Anne,1,3,2
Elizabeth,Andrew,1,3,2
Elizabeth,Edward,1,3,2
Charles, ,2,3,1
Anne, ,2,6,2
Andrew, ,2,2,2
Edward, ,2,1,2
Charles,William,2,3,1
Charles,Harry,2,3,2
Anne,Peter & Zara,2,6,2
Peter & Zara, ,3,6,2
Andrew,Beatrice & Eugenie,2,2,2
Beatrice & Eugenie, ,3,2,2
Edward,Louise & James,2,1,2
Louise & James, ,3,1,2
William, ,3,3,1
Harry, ,3,5,2
William,George,3,3,1
William,Charlotte,3,3,2
William,Louis,3,3,2
Harry,Archie & Lilibet,3,5,2
Archie & Lilibet, ,4,5,2
George, ,4,3,1
Charlotte, ,4,4,2
Louis, ,4,2,2
;
run;
A few points to note
Because of space constraints on the page/screen I chose to omit spouses from the tree and to only show a single node for children of everyone except the eldest child in the direct line of succession. If you were creating your own family tree you may want to add them in, using separate pages if necessary. You can use an annotate data set if desired.
I decided to show the nodes for those individuals in the direct line of succession in a different color to the others, so I used values 1 and 2 in the _pattern field to do this.
You can see how the data pattern works in the following figure with the parent and child nodes of the direct line numbered (the _x and _y numbers are row and column numbers respectively in the resultant output). Ifyou accidentally omit a required node or specify an illogical hierarchy Proc Netdraw will generate an error.
This gives me a data set looking like this
The results
Drawing the family tree is now straightforward. After two pattern statements and a title statement I simply run Proc Netdraw specifying the activity and successor variables and setting text height.
pattern1 color=red;
pattern2 color=yellow;
title "British Royal Family";
proc netdraw data=royalfamily;
actnet / act=name
succ=(successor)
htext=1.5
;
run;
This generates the following chart.
Now it's your turn!
There is a lot more you can do with Proc Netdraw to design and enhance different types of network diagrams - why not build your own family tree and explore all the possibilities? I’m glad to answer any questions.
Hit the orange button below to see all the Free Data Friday articles.
Free Data Friday
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
05-10-2022
05:53 AM
1 Like
In a recent article in the Free Data Friday series, I demonstrated a simple approach to identifying and handling outliers in “classic” SAS using SAS OnDemand for Academics. There we used SAS Procedures (SGPlot and SQL) to deal with some unrealistically large numbers. In this article I will be using SAS Viya for Learners to show how we can do the same thing with the SAS Cloud Analytic Services Language (CASL) during the Data Preparation phase of a project by creating a Custom Data Transformation.
What is SAS Viya for Learners
SAS Viya for Learners is a free, cloud-based software offering which allows educators and students to explore the entire analytics life cycle from data preparation through analysis and vizualization - see here for more information and getting started.
The method
We will be using the Bank Customers data set which can be found by selecting Prepare Data from the SAS Viya for Learners Main Menu as shown below:
We need to use a Plan so choose New Plan:
This takes us to a screen where we can choose which data set to use - choose the Bank Customers Data Set from the left hand panel and click OK:
This returns us to the Prepare Data Screen with the Bank Customers file loaded ready to add Transforms to get our data ready for processing.
There are a large number of pre-built transformations available from the Transforms menu, however, none of them facilitate outlier handling so we will need to create our own. To do this we can double-click the Code option under Custom Transforms. This adds a Custom Code transform to the plan where we can choose Data Step or CASL from the drop-down menu. We need to choose CASL from the drop down menu and some boilerplate code will automatic be added to the code window.
This text can now be replaced with the code to carry out our outlier detection and handling. This can be seen below.
The important points to note are as follows:
Line 2 - Outlier handling is part of the Data Preprocess Action Set.
Line 3 - The input library and table names are both held in variables (_dp_InputCaslib and _dp_InputTable). This allows you to change the input table name without changing the code.
Line 4 - The variable hh_income is the one to be analysed for outliers.
Line 5 - There are a number of different outlier detection methods available. For the sake of simplicity I have chosen the Udflimits method. This allows me to specify minimum and maximum values outside of which values will be treated as outliers.
Line 6 - Any observations with values found to be outliers will be removed from the file.
Line 9 - The minimum and maximum values described at Line 4 are specified.
Line 10 - Outlier information can be written to a separate file.
Line11 - As with the input library and table name the output library ad table names are held in variables.
Finally when we run the plan, the transform will be exexcuted and observations outside the specified minimum and maximum ranges will be removed from the output file.
This is only one example of what you can do with Custom Transforms in both data step and CASL code. The next time you need to use a process which isn't covered by one of the pre-built transforms why not try creating one of your own?
... View more
- Find more articles tagged with:
- SAS Viya
Labels: