10-08-2024
ChrisBrooks
Ammonite | Level 13
Member since
05-29-2014
- 751 Posts
- 93 Likes Given
- 107 Solutions
- 242 Likes Received
-
Latest posts by ChrisBrooks
Subject Views Posted 2699 05-25-2023 04:05 AM 2787 05-23-2023 07:23 PM 3029 05-19-2023 04:46 AM 967 02-17-2023 03:59 AM 947 12-02-2022 03:52 AM 1889 09-12-2022 06:34 PM 1949 09-12-2022 03:49 AM 2662 08-15-2022 12:05 PM 2729 08-15-2022 05:20 AM 3249 07-22-2022 12:06 PM -
Activity Feed for ChrisBrooks
- Got a Like for Re: Round with PROC SQL. 11-14-2024 12:45 AM
- Got a Like for Re: PROC SQL UNION. 08-10-2023 12:30 AM
- Posted Re: Fun with Integer Sequences and SAS on SAS Communities Library. 05-25-2023 04:05 AM
- Posted Re: Fun with Integer Sequences and SAS on SAS Communities Library. 05-23-2023 07:23 PM
- Posted Fun with Integer Sequences and SAS on SAS Communities Library. 05-19-2023 04:46 AM
- Posted What's in a Name? Exploring Album Titles with SAS on SAS Communities Library. 02-17-2023 03:59 AM
- Posted Plan Your Next Vacation with SAS on SAS Communities Library. 12-02-2022 03:52 AM
- Posted Re: Programming with CASL in SAS Viya for Learners – basic data manipulation on SAS Communities Library. 09-12-2022 06:34 PM
- Posted Programming with CASL in SAS Viya for Learners – basic data manipulation on SAS Communities Library. 09-12-2022 03:49 AM
- Posted Re: Programming with CASL in SAS Viya for Learners – first steps on SAS Communities Library. 08-15-2022 12:05 PM
- Posted Programming with CASL in SAS Viya for Learners – first steps on SAS Communities Library. 08-15-2022 05:20 AM
- Posted Re: Explore Twenty Years of Spotify Data with SAS on SAS Communities Library. 07-22-2022 12:06 PM
- Posted Re: Explore Twenty Years of Spotify Data with SAS on SAS Communities Library. 07-22-2022 09:09 AM
- Posted Explore Twenty Years of Spotify Data with SAS on SAS Communities Library. 07-15-2022 04:31 AM
- Posted Re: Draw Your Family Tree with SAS on SAS Communities Library. 06-04-2022 03:47 AM
- Posted Draw Your Family Tree with SAS on SAS Communities Library. 06-03-2022 05:02 AM
- Posted Build your first Custom Transformation in SAS Viya for Learners on SAS Communities Library. 05-10-2022 05:53 AM
- Posted Discover San Francisco's Favourite Street Food with SAS on SAS Communities Library. 03-18-2022 05:03 AM
- Got a Like for Re: delete files with sas Data step. 03-15-2022 12:46 PM
- Posted Tracking the rise and fall of an industry with SAS on SAS Communities Library. 02-18-2022 02:51 AM
-
Posts I Liked
Subject Likes Author Latest Post 13 7 41 3 1 -
My Liked Posts
Subject Likes Posted 1 04-04-2018 08:27 PM 1 01-25-2018 07:20 PM 1 03-31-2018 04:56 PM 2 01-31-2019 08:51 PM 1 12-29-2018 05:19 PM -
My Library Contributions
Subject Likes Author Latest Post 7 1 2 1 4
01-23-2020
07:57 PM
Hi @tomrvincent - yes there certainly will be a lag in some cases between purchase date and play date. Unfortunately I don't think Steam release sales figures for individual games (certainly not at the time granularity which we'd need for this) so I used relative change in active players as what I think is a reasonable proxy. Of course at least some of the increase in active players following an expansion pack release will be people returning to the game after a period of inactivity but in any case an active player base seems a good pointer to continuing sales.
... View more
01-10-2020
06:33 AM
10 Likes
Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
If, like me, you have spent a large part of the Christmas holidays binge-watching the new Netflix series, The Witcher, you will probably be aware that before the TV series came the Witcher video games and before the games came the books written by Polish author Andrzej Sapkowski. These sort of crossover products have become increasingly common with books, comics, games and film or TV titles all feeding interest in each other.
In this edition of Free Data Friday, we will use data from Steamdb – a web site providing data downloaded from Steam, the popular video game platform, to examine what factors influence the number of active players of the most recent game in the sequence, The Witcher 3: Wild Hunt.
Get the Data
There are three pieces of data we will be using from the Witcher 3 page on Steamdb – a CSV file of maximum number of active players each day, a CSV of the price of the game every time it changes (Steam hold frequent sales, sometimes at quite deep discounts) and details of items of downloadable content (DLC) – this is not available as a downloadable file, but that isn’t a problem for our analysis.
I used two simple PROC Imports to import the CSV files into SAS like so:
/* Files downloaded from https://steamdb.info/app/292030/graphs/ */
/* data to 29/12/2019 */
filename userfile '/folders/myshortcuts/Dropbox/chart.csv';
proc import datafile=userfile
dbms=csv
out=witcherstats
replace;
getnames=yes;
run;
filename costfile '/folders/myshortcuts/Dropbox/price-history-for-292030.csv';
proc import datafile=costfile
dbms=csv
out=witcherprice
replace;
getnames=yes;
run;
Get Started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the Data Ready
The files imported perfectly, but I decided to extract the date portion from the datetime field in both files as I wasn’t interested in the time portion (although that could form an interesting analysis in itself). There were also a handful of records in the player statistics file earlier than 19 May 2015, the games official release date. Apart from one day these showed a player number of one. I’m guessing this was either an error or some form of pre-release testing, so I deleted these earlier records.
data witcherstats(drop=datetime flags);
format date yymmdd10.;
set witcherstats;
date=input(substr(datetime,1,10),yymmdd10.);
run;
data witcherprice(drop=datetime);
format date yymmdd10.;
set witcherprice;
date=input(substr(datetime,1,10),yymmdd10.);
run;
proc sql;
delete
from witcherstats
where date < '19May2015'd;
quit;
Finally, I appended the two files together and extracted the day of the week from the date field.
data alldata;
length weekday $9;
set witcherstats witcherprice;
weekday=left(put(date,downame.));
run;
The Results
I decided to plot the maximum number of players per day against the price in order to see if price reductions encouraged more players (purchase data isn’t available from steamdb but this feels like a reasonable proxy). The problem here is that we have such a long timeline that plotting it all on one chart would make it difficult to see the level of detail we need. I, therefore, decided to plot one year at a time. Also, I used a series plot for player data but a step chart for price data. You don’t see step charts used very often but they can be extremely useful in cases where values move in “jumps” without passing through intermediate points such as interest rates or, as in our case, prices.
Here’s the PROC SGPlot code for 2015 – to produce charts for other years simply amend the dates in the where clause:
ods graphics / reset imagemap;
title1 'The Witcher 3 - Player Numbers & Price';
title2 'Data for 2015';
footnote1 j=r 'Data from https://steamdb.info/app/292030/graphs/';
proc sgplot data=alldata(where=(date>='01Jan2015'd and date<='31Dec2015'd));
series x=date y=players / legendlabel="Maximum Number of Active Players";
step x=date y=final_price / y2axis lineattrs=(color=red) legendlabel="Price";
yaxis label="Maximum Number of Active Players" valuesformat=comma6.;
y2axis label= "Price";
xaxis label="Date";
run;
This is what the code produces for each year from 2015 to 2019:
There are a few points to note from these charts:
There appears to be a pattern of multiple peaks and troughs during each month which we should investigate; and
It does appear that temporary price reductions spike an increase in players, but only in 2017 and beyond. This indicates that there may be little point in discounting prices early in a games lifespan; and
There are two spikes (a small one in October 2015 and a large one in May 2016) uncorrelated with price reductions. The data from the DLC page tells us that these were the two dates for releases of the major expansion packs “Hearts of Stone” and “Blood and Wine”. Other minor pieces of downloadable contact (new armour sets, appearances etc) have no significant effect on the number of players; and
In late 2019 there is a massive spike in players, no doubt triggered by the release of the Netflix series based on the books.
In order to confirm some of these findings I used some SQL to firstly check the monthly pattern of peaks and troughs. I suspected that there may a difference in player numbers on different days of the week, so I averaged the number of players per day.
title1 "Average Number of Players on each Day of the Week";
proc sql;
select distinct weekday as Day_Of_the_Week,
avg(players) as Average_Number_of_Players format=comma9.
from alldata
group by weekday
order by Average_Number_of_Players desc;
quit;
This, not surprisingly, shows a very strong “weekend effect” with the number of active players peaking on Saturday and Sunday.
I also used SQL to check the number of active players during each price point for each year. In order to do this I had to merge the witcherstats file with the witcherprice file and fill in the missing values for the price where it didn't change from the previous value.
data merged(drop=pre_price);
merge witcherstats(in=a) witcherprice(in=b);
by date;
if a and not b then final_price=pre_price;
pre_price=final_price;
retain pre_price;
run;
title1 "Number of Players at each Price Point";
title2;
proc sql;
select
distinct final_price as final_price,
year(date) as year,
avg(players) as avg_players format=comma9.
from merged
group by year, final_price
order by year desc, final_price desc;
quit;
This shows that only in 2015, the year of the game’s release, was the highest number of players seen during the games highest price point. In 2016 the record number was at the third lowest price point and since then the highest number of players has always been seen during the lowest price point. This raises a question mark in my mind as to whether placing the game on sale in its first year to year and a half is justified as it appears to not increase the popularity of the game in terms of player numbers.
Now it's Your Turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Visit [[this link]] to see all the Free Data Friday articles.
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
12-10-2019
08:32 PM
@JackHamilton this is an intriguing problem. I've found an old paper from SUGI 31which shows how you can create named ranges using the EXCEL engine. I've tried this with the XLSX engine and it works with that too. Unfortunately you can only create new ranges starting at the top left hand cell but if the range already exists elsewhere you can write to that.
I'll keep pondering this and come back if I find anything better.....
... View more
12-09-2019
08:19 PM
Hi @JackHamilton - I'm afraid I don't know of any straightforward way of writing to an Excel named range directly from SAS either. Some years ago I wrote a SAS Global Forum paper on using VBA with ADO to call SAS programs and exchange data but that's possibly a little out of date now and maybe overkill for most scenarios. It does work well though and I have used it a number of times for customers. The second part of the paper covers creating Excel add-ins to wrap these calls where the user specifies the output range which is the closest I can come up with 🙂
You and @tomrvincent are of course correct about Santa being more modern than 'time immemorial' but I think I can defend the argument for his predecessor in the job, St Nick, here. I checked Wikipedia and found that in English Common Law time immemorial is anything before 6 July 1189......
... View more
12-06-2019
07:27 PM
6 Likes
SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
Polls consistently show that one of the things people most enjoy about Christmas is spending time with their families. It’s a time when families gather, exchange presents, eat lots of good food and observe time-honoured family traditions. However, for some people this isn’t possible because they have to, or choose to, work on Christmas Day. In this edition of Free Data Friday we will be looking at an occasional series published by the UK Office for National Statistics (ONS) which tells us what percentage of workers turn up for work on Christmas Day.
Get the Data
The data is available for download as an XLS file from the ONS web site. The latest period for which the data is available is Christmas 2016.
Get Started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the Data Ready
Unfortunately, the ONS has published a number of different data tables on the same worksheet which makes importing the data into SAS a little more complicated. In this scenario the easiest way to proceed is to use the RANGE statement in Proc Import. You can either use an explicit cell reference or an Excel range name. It’s at this point that I have to make a confession – I am a former spreadsheet programmer. I have always said that the key to good spreadsheet programming is using range names. For example, in our case if the table we want is positioned in different cells in different years we can still use the same program without amendment. This wouldn’t be possible if we had used cell references. With this in mind I opened the spreadsheet, created range names for each of the tables and saved the file in the more modern XLSX format. You can find more information about using range names with Proc Import in an excellent blog post from Chris Hemedinger.
I decided to use two of the tables on the sheet so issued two Proc Import statements as follows
proc import datafile="/folders/myshortcuts/Dropbox/worker_christmas.xlsx"
out=region_import
replace
dbms=xlsx;
range="region";
run;
proc import datafile="/folders/myshortcuts/Dropbox/worker_christmas.xlsx"
out=occupation_import
replace
dbms=xlsx;
range="occupation";
run;
This gave me two data sets looking like this
The next steps were to sort the regional file, renaming and dropping some variables and tidying up the occupation file, removing the codes at the start of the occupation name, removing quotation marks and similarly renaming and dropping variables.
proc sort data=region_import(where=(k ne 'Total'))
out=region(rename=(k=region as_percentage_of_region=perc) drop=level);
by descending as_percentage_of_region;
run;
data occupation(keep=occupation perc);
length occupation $40;
set occupation_import(rename=(as_percentage_of_occupation=perc));
occupation=dequote(strip(substr(f,6)));
run;
The Results
The data is already aggregated so no further manipulation is required - instead I first printed the region table making use of Proc Print like so
title1 'Who Works Over Christmas?';
title2 'UK Regional Percentages of People Working on Christmas Day';
footnote1 j=c 'Data from the UK Office for National Statistics';
footnote2 j=c 'https://www.ons.gov.uk';
proc print data=region noobs split="*";
label region='Region*';
label perc='Percentage Working*';
run;
This gave me the following listing
One interesting point from this is the position of Scotland at the head of the table. More than twice as many workers in percentage terms work in Scotland on Christmas Day than in London and Northern Ireland. It is true that in Scotland New Year is considered a more important holiday than it is in other parts of the country and Christmas Day a little less so but to my mind this doesn't wholly explain the disparity between Scotland and all the other regions.
For the occupation data set I decided to create a vertical bar chart. Here is the code I used followed by the output
ods graphics / imagemap;
title1 'Who Works Over Christmas?';
title2 'Top 15 Occupations Working on Christmas Day in the UK 2016';
footnote1 j=l 'Data from the UK Office for National Statistics';
footnote2 j=l 'https://www.ons.gov.uk';
proc sgplot data=occupation;
styleattrs
backcolor=biyg
wallcolor=biyg;
vbar occupation /
response=perc
categoryorder=respdesc
fillattrs=(color=red)
dataskin=sheen
datalabel=perc
datalabelpos=data
tip=(occupation perc)
tiplabel=("Occupation:" "Percentge Working:");
xaxis label="Occupation";
yaxis label="Percentage Working";
run;
My initial thought was that this chart would be dominated by the emergency services - fire, police and medical staff but when I saw clergy at the top of the list it made perfect sense. In fact 51.4% seems a little low even though this would be almost exclusively Christian clergy. The other surprise to me was the 8.5% of sales supervisors working when virtually all shops are closed on Christmas Day.
There is, of course, one occupation missing from this chart - it has only one person working in it and is a curious amalgam of delivery services, animal wrangling and toy manufacture. It is, as you might have guessed, the one man who has worked every Christmas Eve and Christmas Day since time immemorial - Santa Claus.
Now it's your Turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Visit [[this link]] to see all the Free Data Friday articles.
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
11-22-2019
07:57 PM
4 Likes
SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
You may be aware that there is currently a general election campaign underway in the UK. These tend to be quite lively affairs and one story in particular caught my attention recently. Jo Swinson, the leader of the Liberal Democrat Party, was the subject of a spoof article on a web site claiming that she was in the habit of shooting squirrels in her garden with a slingshot. Bizarrely this story gained traction on social media with so many people believing it to be true that she was forced to deny it in a radio interview.
At the same time I was noticing that in the local park where I regularly walk our dog the local squirrel population was booming. They seemed to be everywhere, usually being chased up trees by dogs who thankfully weren't fast enough to catch them! This led me to think about squirrels generally, and so I was intrigued to find that volunteers had carried out a squirrel census in New York's Central Park and published the data in New York's Open Data Portal. In this edition of Free Data Friday we will be taking a look at this data to examine the distribution of the squirrel population by its primary fur color.
Get the Data
You can download the data from the New York City Open Data portal as a CSV file and import it into SAS with the following PROC IMPORT code
filename sqfile '/folders/myshortcuts/Dropbox/2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv';
proc import datafile=sqfile
dbms=csv
out=work.squirrels
replace;
getnames=yes;
datarow=2;
guessingrows=500;
run;
The data imports without any issues and has a large number of fields relating to fur color, behaviour, sound and location.
Get Started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the Data Ready
Normally, my go-to procedure for basic data analysis is PROC MEANS. However, given that nearly all of the fields in this file contain categorical variables I will be using a simple PROC FREQ to create counts and percentages of the squirrel population by primary fur color.
proc freq data=squirrels noprint;
tables primary_fur_color / out=colorfreq;
run;
This gives me a file looking like this:
Gray squirrels are by far the most numerous, followed by cinnamon and then black squirrels. A little Google searching told me that cinnamon and black squirrels are not separate species from grey squirrels, but are carriers of recessive genes that alter their fur color. We will ignore the 55 records for which no color is recorded.
The Results
Firstly I decided to create a vertical bar chart of the data in the color frequency file. I wanted the color of the bars to match the fur color (gray for the grey category, cinnamon for the cinnamon category and black for the black category). In order to do that I created an attribute map data set. When I want to use specific colours I normally get the codes from this page on the cloford.com web site.
data getattrs;
retain id 'Color';
infile datalines dlm=",";
length value $20 fillcolor $8;
input value fillcolor;
datalines;
Cinnamon,cxD2691E
Gray,cx808080
Black,cx000000
;
run;
Now that I have my attribute map here is the code for the bar chart along with the output:
ods graphics on;
title1 'Squirrel Population in Central Park';
title2 'Primary Fur Color Percentages';
footnote1 j=l 'Data from The Squirrel Census';
footnote2 j=l 'https://www.thesquirrelcensus.com/';
footnote3 j=r 'Data Available from NYC Open Data';
proc sgplot data=colorfreq dattrmap=getattrs noautolegend;
vbar primary_fur_color / response=percent
group=primary_fur_color
groupdisplay=cluster
dataskin=matte
datalabel=percent
attrid=color;
yaxis label="Percentage of Total";
xaxis label="Primary Fur Color";
run;
Following on from that, it seemed that the distribution of squirrels around such a large park might be interesting, so I used PROC SGMap to map the locations of the squirrels. Here is the code and map for grey squirrels:
ods graphics on;
title1 'Squirrel Population in Central Park';
title2 'Location of Gray Squirrels';
footnote1 j=l 'Data from The Squirrel Census';
footnote2 j=l 'https://www.thesquirrelcensus.com/';
footnote3 j=r 'Data Available from NYC Open Data';
proc sgmap plotdata=squirrels(where=(primary_fur_color="Gray")) noautolegend;
openstreetmap;
scatter x=x y=y /group=primary_fur_color markerattrs=(color=cx808080);
run;
We can see that grey squirrels are, not surprisingly, everywhere in the park! There are no discernible clusters.
Here is the corresponding code and output for cinnamon colored squirrels:
title2 'Location of Cinnamon Squirrels';
proc sgmap plotdata=squirrels(where=(primary_fur_color="Cinnamon")) noautolegend;
openstreetmap;
scatter x=x y=y /group=primary_fur_color markerattrs=(color=cxD2691E);
run;
Firstly notice how I have only changed title2 - I want all the other headers and footers to be the same as the first map and they will carry forward on all the output until they are reset or changed. This map is more interesting as there is a definite trend for cinnamon squirrels to be seen more in the southern half of the park than the north with a concentration in the extreme southern area.
Finally here is the code and map for black squirrels:
title2 'Location of Black Squirrels';
proc sgmap plotdata=squirrels(where=(primary_fur_color="Black")) noautolegend;
openstreetmap;
scatter x=x y=y /group=primary_fur_color markerattrs=(color=cx000000);
run;
This is the most interesting map of all - there are two clusters of black squirrels; one in the extreme south east corner of the park and one in the north with only a small number in the central area. I can't say I'm familiar enough with Central Park to say for sure, but it might be interesting if a genetic test could be carried out on the black squirrel population to see if they form two separate communities or if they are related and have simply migrated to these clusters. If the latter is true, then why are there so few in the central part of the park?
Now it's your Turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Visit [[this link]] to see all the Free Data Friday articles.
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
11-18-2019
06:55 PM
Hi @tomrvincent - there's quite a lot of shooting incident data held on individual city data portals, for example New York which goes down to borough and precinct level. It should be possible to find a suitable shapefile to import into SAS and create a choropleth map for gun crime. I don't know if the data is easily available at state or national level though.
... View more
11-08-2019
07:41 PM
6 Likes
Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
While it is true to say that the United Kingdom has very little gun crime, it is also true to say that human ingenuity in finding ways to harm each other seems to know no bounds. There has been, over the last few years, increasing public concern over the level of knife crime in the country, particularly in London, where it is often said to have reached epidemic proportions.
In this edition of Free Data Friday, we will be looking at knife crime data from the House of Commons Library to map incidents of knife crime across the London boroughs to try to establish if there is any geographical pattern to this epidemic.
Get the Data
You can download the data as an Excel file from the House of Commons Library along with an explanatory PDF report. We will also need a shapefile for London showing borough boundaries. This can be downloaded from the London Datastore – note that the page states that while we are free to use the shapefiles to create maps, we must include two copyright notices on the output.
Get started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the Data Ready
We will be using Proc SGMap which can’t directly use shapefiles so firstly we must use Proc Mapimport to convert the shapefile into a SAS map data set.
%let mapfile=/folders/myshortcuts/Dropbox/statistical-gis-boundaries-london/ESRI/;
%let mapfile=&mapfile.London_Borough_Excluding_MHW.shp;
proc mapimport out=london datafile="&mapfile";
run;
Having done that, we can open the downloaded data file using the XLSX engine and read the required table into a SAS data set. As is usual in a spreadsheet there are header rows and a grand total we don’t need. In addition we can ignore the value for Heathrow airport. It is part of the London borough of Hillingdon, but the number of incidents is tiny. So we can just read in the data from row 6 to row 38. We will also rename some of the variables and allocate the data to category bands (each value will be unique which creates an issue with choropleth maps which we will see later. Creating bands is one way around the problem).
libname knife xlsx "/folders/myshortcuts/Dropbox/CBPSN04304.download.xlsx";
options validvarname=any;
data borough(keep=name total category);
length name $50 category $7;
set knife."T.A4C"n;
if _n_ > 5 and _n_ < 39 then do;
name=A;
total=input(F,8.);
if total < 250 then category="0-250";
else if total < 500 then category="250-500";
else if total < 750 then category="500-750";
else if total > 750 then category="750+";
output;
end;
run;
One further point to note about the data is that it does not include figures for the City of London. This is the main financial district and although comprising only a little over one square mile, it has its own form of government and police force. It is a city in its own right and not a London borough, which explains its absence from the data.
Designing the Map
Creating any kind of visualization is always an iterative process. I tend to start with the simplest possible design and build on that while trying not to allow the chart or map to become too complicated. Firstly, we’ll see what a bare choropleth of what the imported shapefile looks like. I do this because I find that some imported shapefiles just don't look "right" for my purposes.
proc sgmap mapdata=london;
choromap / mapid=name;
run;
This looks fine so we shall flesh it out with some data:
proc sgmap mapdata=london maprespdata=borough;
choromap category/ mapid=name;
run;
There are a few problems with this map:
The boroughs are not named so unless you are familiar with London the map is of limited use; and
The colours are not intuitive - you must consult the legend before having any idea which areas have higher figures than others; and
The map has no title, so we don't know what data is displayed or what the source is; and
The copyright notices are missing; and
The legend starts with higher figures at the left - I would expect the lower figures to be first.
We can start the process of improving the map by adding the borough names. We can do this by using the centroid macro supplied with SAS University Edition to determine the central point of each borough and then using the TEXT statement in PROC SGMap to display the borough names at those points:
proc sort data=london out=london_sort;
by name;
run;
%centroid(london_sort, centres,name);
proc sgmap mapdata=london maprespdata=borough plotdata=centres;
choromap category/ mapid=name;
text x=x y=y text=name;
run;
This wasn't a resounding success - the names were placed in the correct position but they are too long for the borough areas so there is a lot of overlap. Instead of using names, I decided to use numbers for the boroughs
data centres;
set centres;
num=_n_;
run;
proc sgmap mapdata=london maprespdata=borough plotdata=centres;
choromap category/ mapid=name name="choro";
text x=x y=y text=num;
keylegend "choro" / title="Total Incidents";
run;
This is much better, but we still have some issues to address.
Adding the title, data source and copyright notices is straightforward; and
We can make the colors more intuitive by using a gradlegend instead of a keylegend - this gives us a gradient that is not only a more accurate depiction of the relative values, but allows us to tell at a glance which boroughs have the higher values; and
A simple PROC Print provides a key to the borough numbers added in the previous step:
proc sort data=borough out=borough_sorted;
by category;
run;
title1 "London Knife Crime Incidents by Borough - 2018/19";
title2 "Excludes data for City of London & Heathrow Airport";
footnote1 j=l 'Data from the UK House of Commons Library';
footnote2 j=l 'Contains National Statistics data © Crown copyright and database right [2015]';
footnote3 j=l 'Contains Ordnance Survey data © Crown copyright and database right [2015]';
proc sgmap mapdata=london maprespdata=borough_sorted plotdata=centres;
choromap total/ mapid=name name="choro";
text x=x y=y text=num;
gradlegend "choro" / title="Incidents";
run;
title;
footnote;
proc print data=centres noobs;
var num name;
run;
The Results
We can see from the final map that generally speaking the number of incidents rises the closer you get to the centre of London with the largest value being for borough 33 - Westminster. This is about as close to the center of London as you can get, plus boroughs south of the river Thames tend to have less knife crime than those north of the river. Having said all this, London is still an incredibly safe place to visit, work and live in so please don't let this dissuade you from including it in your holiday itinerary!
Now it's your Turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Visit [[this link]] to see all the Free Data Friday articles.
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
10-25-2019
07:24 PM
Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
Muir Woods, Part of Golden Gate NRA
Thanks to a bill signed in 1872 by President Ulysses S. Grant, which created Yellowstone, the first of many such parks, millions of Americans have enjoyed these national parks and historic monuments. In 1916, President Woodrow Wilson created the National Park Service to manage these places and ensure their conservation for future generations.
This edition of Free Data Friday uses decades of park service data to discover which site has attracted the most recreational visitors.
Get the Data
The data can be downloaded in a number of formats from the National Parks Service Statistics viewer. I chose National Reports from this page and Annual Summary Report 1904-Last Calendar Year. This presents you with a viewer and options for the level of detail to view and download. The file contains data for many types of visitors (recreational and non-recreational, campers, visitors in RVs etc). I chose to download an Excel file but CSV is also available.
Get Started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the Data Ready
From the Excel file I deleted some header rows at the top, summary rows from the bottom and renamed both the workbook and worksheet for convenience. I saved the resulting file in XLSX format so that I could use the XLSX engine to read it.
The Results
I decided to concentrate on recreational visitors and to look first at which park had received the most all-time recreational visitors. This was a simple task - firstly sorting the file by park name, then running PROC Means to aggregate recreational visitors for each park and then re-sorting by the total number of such visitors. Finally I ran PROC SGPlot to create a horizontal bar chart of the top ten parks. Here is the code I used followed by the chart generated:
options validvarname=any;
libname visitors xlsx "/folders/myshortcuts/Dropbox/NationalParksVisitors.xlsx";
proc sort data=visitors.visitorstats out=visitorsorted;
by parkname;
run;
proc means data=visitorsorted noprint sum;
by parkname;
var recreationvisitors;
output out=recvisitors sum=totrecvisitors;
run;
proc sort data=recvisitors;
by descending totrecvisitors ;
run;
ods graphics / imagemap;
title1 'US National Parks & Monuments';
title2 'All-Time Most Popular for Recreational Visitors';
footnote1 j=l 'Data from US National Parks Service';
footnote2 j=l 'https://irma.nps.gov/Stats/';
proc sgplot data=recvisitors(obs=10);
hbar parkname / response=totrecvisitors
categoryorder=respdesc
dataskin=pressed fillattrs=(color=vpab)
tip=(parkname totrecvisitors)
tiplabel=("Name:" "Total Visitors:")
tipformat=(auto comma12.0);
xaxis label="Number of Recreational Visitors" fitpolicy=none;
yaxis label="Park/Monument Name";
run;
We can see from this chart that Blue Ridge Parkway is by far the most visited park but this isn't the whole story - parks were established at different times so naturally the longer established parks have an advantage. In order to level the playing field a little, I filtered out the data for the most recent year available, 2018. Here is the code and output.
proc sort data=visitors.visitorstats(where=(year=2018)) out=visitors2018;
by descending recreationvisitors;
run;
ods graphics / imagemap;
title1 'US National Parks & Monuments';
title2 'Most Popular for Recreational Visitors in 2018';
footnote1 j=l 'Data from US National Parks Service';
footnote2 j=l 'https://irma.nps.gov/Stats/';
proc sgplot data=visitors2018(obs=10);
hbar parkname / response=recreationvisitors
categoryorder=respdesc
dataskin=pressed fillattrs=(color=vpab)
tip=(parkname recreationvisitors)
tiplabel=("Name:" "Total Visitors:")
tipformat=(auto comma10.0);
xaxis label="Number of Recreational Visitors" fitpolicy=none;
yaxis label="Park/Monument Name";
run;
Here we can see that in 2018 Golden Gate was the most popular park closely followed by Blue Ridge. The file contains data for Blue Ridge from 1941 (shortly after its formation in 1936) whereas Golden Gate (formed in 1972) has data from 1973. It seemed to me that for Golden Gate to reach second in the all-time list and top in 2018 was a remarkable achievement considering it was established only fairly recently in historical terms. We can, however, get a clearer picture by seeing how often each park achieved top spot in the rankings.
proc sql;
create table topbyyear
as select year, parkname, max(recreationvisitors) as visitornum format=comma10.
from visitorsorted
group by year
having recreationvisitors=max(recreationvisitors);
quit;
proc sql;
create table toppark
as select parkname, count(parkname) as topcount
from topbyyear
group by parkname
order by topcount desc;
quit;
ods graphics / imagemap;
title1 'US National Parks & Monuments';
title2 'Number of Years in Top Place for Recreational Visitors';
footnote1 j=l 'Data from US National Parks Service';
footnote2 j=l 'https://irma.nps.gov/Stats/';
proc sgplot data=toppark;
hbar parkname / response=topcount
categoryorder=respdesc
dataskin=pressed fillattrs=(color=vpab)
tip=(parkname topcount)
tiplabel=("Name:" "Times Top Park:")
tipformat=(auto comma10.0);
xaxis label="Number of Times Top Park" fitpolicy=none;
yaxis label="Park/Monument Name";
run;
Again, Blue Ridge is far ahead of the pack making it consistently America's most visited park. Golden Gate is a long way behind but again still ahead of many parks which were established earlier. Finally, let's chart the attendance records of Blue Ridge and Golden Gate against each other.
ods graphics / imagemap;
title1 'US National Parks & Monuments';
title2 'Blue Ridge v Golden Gate Visitors Timeline';
footnote1 j=l 'Data from US National Parks Service';
footnote2 j=l 'https://irma.nps.gov/Stats/';
proc sgplot data=visitorsorted(where=(parkname in ("Blue Ridge PKWY" "Golden Gate NRA")));
series x=year y=recreationvisitors /
group=parkname name="pname"
markers
tip=(parkname year recreationvisitors)
tiplabel=("Name:" "year" "Total Visitors:")
tipformat=(auto auto comma10.0);
keylegend "pname" / type=markersymbol;
keylegend "pname" / type=linecolor;
run;
There are several points to note here:
Blue Ridge's visitor numbers grew steadily from its establishment until 2002 after which numbers started to decline until achieving stability relatively recently; and
Golden Gate saw explosive growth in visitor number during its early years, showing a sharp decline after 1988 and being relatively stable since; and
The last few years have seen a very close battle between Blue Ridge and Golden Gate with neither park achieving supremacy. In fact the topbyyear data set tells us that one of these two parks has occupied top spot every year since 1965 when National Capital Parks Combined took the number one place!
So, having completed my analysis I took a look at the chart showing the top parks for 2018 and realised I have visited five of the top ten:
Golden Gate
The Lincoln Memorial
The George Washington Memorial
The Grand Canyon
The Vietnam Veterans Memorial
Considering I'm British and have lived in the UK all my life, all of these visits have taken place during holidays and visits to the US for SAS Global Forum, so I think five out of ten is pretty good. Tell us in the comments below how many you have visited.
Now it's your Turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Visit [[this link]] to see all the Free Data Friday articles.
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
10-24-2019
06:29 PM
Hi @tomrvincent there are a lot of very strange things going on in British politics at the moment. Bizarrely the opposition have twice blocked the holding of an early general election. Imagine Trump offering the Democrats an early presidential election and them turning it down....
As the saying goes "If you're not confused, you don't really understand the situation"!
... View more
10-11-2019
07:24 PM
1 Like
Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
British politics has been dominated by one subject for the last three years – Brexit. The unexpected victory of leave voters in the 2016 European Union referendum has pushed all other topics to the periphery.
Since that vote, two prime ministers have resigned, the ruling Conservative Party has lost its overall majority in the British Parliament, there have been large-scale defections from both Conservative and Labour parties in the House of Commons and new parties have risen and fallen at a rapid pace.
Now, with the October 31 deadline for leaving the EU getting closer and closer, another general election is expected at any time. One feature of this election is expected to be the geographical split between leave and remain areas.
Although the UK voted by a margin of 52% to 48% to leave, that narrow result masks wide geographical differences. In this edition of Free Data Friday, we will be using results data from the UK Electoral Commission to look at a way of measuring the degree of difference between areas in that vote.
Get the Data
You can download the results of the vote from the UK Electoral Commission web site. The file can be downloaded in CSV format and imported into SAS with PROC Import.
Get Started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the Data Ready
The file imported without any significant issues. There are a couple of minor points which we will need to address:
The British Overseas Territory of Gibraltar is part of the EU by virtue of its relationship to the UK and therefore took part in the vote. However, it is not represented in the UK Parliament and therefore we will exclude it from our calculations; and
Some of the fields which should be numeric were imported as character fields and will be converted.
Here is the code for the import and data cleaning:
filename reffile '/folders/myshortcuts/Dropbox/EU-referendum-result-data.csv';
proc import datafile=reffile
dbms=csv
out=import
replace;
getnames=yes;
guessingrows=700;
run;
/* Area Code GI is Gibraltar */
data area_results(keep=region area valid_votes remain leave);
set import(where=(area_code ne "GI"));
new_remain = input(remain, 8.);
drop remain;
rename new_remain=remain;
new_leave = input(leave, 8.);
drop leave;
rename new_leave=leave;
new_valid = input(valid_votes, 8.);
drop valid_votes;
rename new_valid=valid_votes;
run;
This is what the imported file looks like:
The Index of Dissimilarity
The Index of Dissimilarity can be used to determine the percentage of one constituent part of the population which would have to move areas to achieve a uniform geographic distribution amongst the sub-areas of a larger area. It is often used to determine racial balance in areas within a state or country or gender balance within occupations.
In our case this would mean that an index of zero would mean that all the counting areas had a 52% leave to 48% remain vote identical to the overall total. The method of calculating this was discussed in a 2016 SAS Communities Forum thread and we will be using the PROC SQL statement from that thread to perform the calculation.
The Results
Here is the PROC SQL statement referred to earlier which calculates the index:
proc sql;
create table uk_index as
select *, leave/sum(leave) as var1, remain/sum(remain) as var2
from area_results;
select 0.5*sum(abs(var1-var2)) as d1
from uk_index;
quit;
Here is the result:
This means that 16.5% of UK leave voters would have to move counting areas for the vote split to be uniform across all areas. On a turnout of 33,551,983 that is over 5.5 million people which is a significant imbalance.
This isn't the whole story however - while England and Wales voted for leave, the other two constituents of the UK, Scotland and Northern Ireland, voted for remain. I decided to see if the internal dissimilarity in England, Wales and Scotland was roughly the same (Northern Ireland was a single counting area so we can't calculate an index for it).
Here is the code for those calculations:
data regional_results_eng;
set area_results(where=(region not in ("Scotland", "Wales", "Northern Ireland")));
run;
proc sql;
create table eng_index as
select *, leave/sum(leave) as var1, remain/sum(remain) as var2
from regional_results_eng
;
select 0.5*sum(abs(var1-var2)) as d1
from eng_index;
quit;
data regional_results_scot;
set area_results(where=(region ="Scotland"));
run;
proc sql;
create table scot_index as
select *, leave/sum(leave) as var1, remain/sum(remain) as var2
from regional_results_scot
;
select 0.5*sum(abs(var1-var2)) as d1
from scot_index;
quit;
data regional_results_wales;
set area_results(where=(region ="Wales"));
run;
proc sql;
create table wales_index as
select *, leave/sum(leave) as var1, remain/sum(remain) as var2
from regional_results_wales
;
select 0.5*sum(abs(var1-var2)) as d1
from wales_index;
quit;
Here is the index value for England:
A score of 15.5 % is pretty close to the UK total (perhaps not surprisingly given the relative size of Englands population to the UK total).
This is Scotland's score:
This is a lot less than the UK and England-only values and of course with remain having won, only a shade over 273,000 remain voters would have to move areas to achieve uniformity.
Finally Wales' index is:
This is the smallest of the three index values - only just over 144,700 leave voters would need to move to achieve the perfect zero index.
The question is, then, what does this mean for the upcoming election? The results imply that, especially in England, leave areas are more pro-leave than the country as a whole and remain areas more pro-remain. If people vote along strict leave/remain lines then this is likely to lead to a very polarised result particularly if parties campaign with a strategy of trying to mobilise their base votes rather than win over opposition voters. Having said that, events are moving at a very rapid pace and as is often said - the only certain thing with Brexit is that nothing is certain!
Now it's your Turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Visit [[this link]] to see all the Free Data Friday articles.
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
09-30-2019
06:51 PM
2 Likes
Hi @ballardw - yes geography is important. For example here in Wales sheep farming is a huge part of the agricultural sector. It's mostly hill farming on land where growing crops or keeping other livestock would probably be impractical. I suspect if Wales wasn't such a hilly country there'd be more dairy or beef farming.
I confess I've never tried goat meat and I've never even seen it for sale here; lamb, naturally, is my favourite....
... View more
09-27-2019
07:47 PM
2 Likes
SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
Access Now
Many, if not most of us, will at some point have used census data during the course of their work or study. Governments have been conducting censuses of their population for centuries, but there are other types of census.
Eurostat (the statistical office of the European Union) conducts a census of agricultural animals present in the EU and candidate member countries. In this edition of Free Data Friday, we will be looking at that census data to find out where these animals are and how many of them there are.
Get the Data
The report of the census contains links to the individual data tables for four species of animal – cattle, pigs, sheep and goats. These tables can be downloaded in a number of formats. I chose to download the tables as CSV files and import them into SAS using PROC Import.
Get Started with SAS OnDemand for Academics
In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps:
Get Started
Getting the Data Ready
Given that I had four separate files that are all of the same format, I decided to create a macro to handle the import, data cleaning, and graphing of the results. Here is the macro code along with the calls to the macro for the import of each of the files:
%macro import_lstock(animal_name,long_desc,common_name);
%let reffile=/folders/myshortcuts/Dropbox/&animal_name..csv;
filename reffile "&reffile";
proc import datafile=reffile
dbms=csv
out=&animal_name;
getnames=yes;
datarow=2;
guessingrows=100;
run;
data live_&animal_name;
length num_animals 8.;
format num_animals comma8.;
set &animal_name(where=(animals="Live &long_desc"
and geo not like "European%" and time="2018"));
if value=":" then value=missing;
if scan(geo,1)="Germany" then geo="Germany";
if scan(geo,1)="Kosovo" then geo="Kosovo";
num_animals=input(value,comma8.);
run;
proc sort data=live_&animal_name out=sorted_&animal_name;
by num_animals;
run;
ods graphics / reset width=6.4in height=8in imagemap;
title "Number of &common_name in each Eurostat Country - 2018";
footnote j=l 'Data from Eurostat';
proc sgplot data=sorted_&animal_name;
styleattrs backcolor=lightblue wallcolor=lightblue;
hbar geo / response=num_animals datalabel fillattrs=(color=gold);
yaxis discreteorder=data reverse label="Country";
xaxis grid label="Number of Animals (thousands)";
run;
ods graphics / reset;
%mend import_lstock;
%import_lstock(cows,bovine animals,Cows);
%import_lstock(sheep,sheep,Sheep);
%import_lstock(pigs,swine%str(,) domestic species,Pigs);
%import_lstock(goats,goats,Goats);
The PROC Import was straightforward. However, the files contained some aggregate records for EU totals that had to be removed in the following data step along with the removal of some text in certain country names which was, for our purposes, extraneous (i.e. explanations surrounding Germany reunification, Kosovo and Macedonia). Additionally, missing values were, for some reason, represented by a colon so that needed to be edited. The files were then individually sorted and displayed through PROC SGPlot.
The Results
Here is the output from the PROC SGPlot call for each of the files:
Who leads in cows, sheep, pigs, and goats?
Taking them in order, France is the country with the most cows, the United Kingdom has the most sheep, Spain the most pigs and Greece is home to the most goats. Other significant points from the results are:
Slightly surprisingly to me: Turkey has the second largest number of cows in the Eurostat countries; and
Spain figures highly in all four graphs coming in first in one category and second in two others; and
Despite the fame of Danish bacon, Denmark is only fourth in the list of pig rearing countries; and
Countries such as Germany and the United Kingdom, which figure heavily in the first three charts, fall a long way down in the final chart of goat population.
I then decided to find the total of each animal across all the Eurostat countries. Firstly, I appended the four files into one file and then used PROC SQL for the summation; as the numbers in the downloaded files are in thousands, I multiplied the totals by 1,000. Here is the code and the output:
%if %sysfunc(exist(all_animals)) %then %do;
proc datasets lib=work;
delete all_animals;
quit;;
%end;
proc append base=all_animals data=live_cows;
run;
proc append base=all_animals data=live_sheep force;
run;
proc append base=all_animals data=live_pigs force;
run;
proc append base=all_animals data=live_goats force;
run;
title "Total Number of Farm Animals in Eurostat Countries";
proc sql;
select animals, sum(num_animals)*1000 as total format=comma12.
from all_animals
group by animals
order by animals;
quit;
The most common farm animals in the Eurostat countries are pigs (live swine) with 153 million animals, a number larger than the population of any of the individual countries.
Now it's your Turn!
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Visit [[this link]] to see all the Free Data Friday articles.
... View more
- Find more articles tagged with:
- Data for learning
- Free Data Friday
09-19-2019
06:32 AM
I think the problem is that when there is no record for the snapshot date in the data set the where option will ensure that the SET statement returns no observations. This means that the rest of the step will not be executed.
One way around this would be to use Proc SQL to return the latest date in the data set and then use that in the where option to ensure something is returned (as long as the data set isn't entirely empty). You could then compare the data with the current date to determine what action to take. Here is a simple example using a data set from the SASHELP library to give you an idea how to proceed
/* Doesn't Work */
data _null_;
set sashelp.air(where=(date=today()));
put "here";
run;
/* Works */
proc sql;
select max(date)
into :maxdate
from sashelp.air;
quit;
data _null_;
set sashelp.air(where=(date=&maxdate));
if date=today() then do;
put date=;
end;
else do;
put "New Record Needed";
end;
run;
... View more
09-15-2019
06:51 PM
1 Like
Hi @lexb57 and welcome to the community. Firstly, many thanks for taking the time to read the article and for your kind words of appreciation, also for asking a very interesting question!
I confess I am a SAS developer by training and not a statistician so it is quite possible that someone else can give a better answer than I can but I can give you an idea of one way you might approach this.
You can see from the article that we only have two years data and so I think firstly you'd need a lot more data to be able to judge any trend. Assuming, however, that we had that then I'd take, say, the number of bites attributed to each of the top ten breeds and use Proc SGPlot to create a scatter plot of the number of bites per year for each breed in turn and add a REG Statement to create a regression line which should give you a picture of the trend in bite numbers. Of course a serious study would want to include the number of dogs of each breed in the population for that year but as I mentioned in the article that data doesn't seem to be available.
... View more