About DarthPathos

DarthPathos · ‎02-19-2017

I'm getting the same behaviour and didn't realise that's what was causing it (also Macbook, Chrome, virtual box, current version of SAS UE). @AmyP_sas - Steps I followed for the odd join behaviour were the same as those provided in the link in the OP; I wanted to see if i got the same thing using the "documented" steps (in case I had done something differently). Using the most recent version of UE. As always let me know if you need anything else from me. Chris

DarthPathos · ‎02-18-2017

The MAX(pay) is just to select one item; you could've used MIN(pay) as well. Play around with it and see I mean; if you need anything further, i'm around all day and will be checking in regularly 🙂 Good luck! Chris

DarthPathos · ‎02-18-2017

One reason why I don't like "dummy" code - you miss things that should be easy to catch! I've run this code and I think it's what you're looking for; proc sql; select min(a.id), a.act_no, a.date format=date9., min(a.time) format=time., max(b.pay) from work.A a, work.B b where a.id = b.id group by a.act_no, a.date order by a.act_no, a.date; quit; Because the IDs and Pay are different for each of the act_no | date combinations, the MIN was picking up each one (because they were technically the minimum value for that grouping). I've tweaked the code so it gives you: I realise 4 and 5 aren't in proper order; if you need the data sorted by ID, you can use PROC SQL; create table work.new_data as <SQL query>; quit; and then run PROC SORT on that table. Keeping my fingers crossed this is what you're looking for! Chris

DarthPathos · ‎02-18-2017

Apparently I've not had enough tea this morning - completely missed that! I think if you used something like this it should work: select var_a, var_b, var_c, min(var_d) as FirstInstance from work.have group by var_a, var_b, var_c order by var_a, var_b, var_c that should give you what you need. If it still doesn't give you what you're expecting I'll log into my SAS and create your datasets. Good luck Chris

DarthPathos · ‎02-18-2017

Hi - It looks like your Output and your Data A are the same; I'm not sure what your goal is or what question you're trying to answer. I'm happy to help out, so I will watch for your reply. Chris

DarthPathos · ‎02-18-2017

Hi and apologies for no one replying sooner. I don't know anything about UTM data (all my GIS data comes to me in LAT/LONG format) but I did some searching and found a very old SAS paper that should be able to help http://www.sascommunity.org/sugi/SUGI87/Sugi-12-89%20Chojnacky%20Tymcio.pdf. It's from 1987 but the math would still be right and the methods would still work. Your other option is to look at PROC GPROJECT - i saw people converting LAT/LONG to UTM, so I imagine it should be able to go the other way as well. Good luck and please post back what you end up using, I'm curious! Have a good day Chris

DarthPathos · ‎02-18-2017

Hi Peter, I'm trying this on UE (Google Chrome, Mac Book, updated version of everything except Virtual Box). I'm getting similar behaviour but different symptoms. Dragging and dropping the second instance of CLASSFIT doesn't join as the documentation said it should, but my query runs (albeit with a full join). Odd, as I remember testing this when the functionality first came out and it worked. I'm going to tag @AmyP_sas in the hopes she has some ideas. I'll try on SAS Studio from my work computer today if I can and post back anything different. Thanks Chris

DarthPathos · ‎02-17-2017

Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics: Access Now For this week's installment of Free Data Friday, I wanted to continue on the topic of messy data, this time around the idea of “filling in the blanks.” On Wednesday, as luck would have it, I had an analysis come up at work where I did actually need to expand out the data set so we could get a more accurate picture of trends. My challenge was that it wasn’t just straight Year | Count data, but rather there were multiple categories to take into account. I tried PROC EXPAND, but couldn’t seem to get it to work. I reached out on social media and SAS Technical Support immediately got back to me with a link by @Ksharp that I was able to modify and use. I wanted to take his DATA step method and use it here on an amusing data set I found. Get the data I found this data set by searching “funny open data” and was intrigued by what turned up. The data comes from the City of Bristol in the UK and is an annual count of shopping carts found in the area’s rivers. I would love to see the business case put forward to track this! You can get the data here. Get started with SAS OnDemand for Academics In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps: Get Started Get the data ready The data was already in CSV format, but I did modify the data set slightly: 1) I added an ID column grouped by the River. 2) I renamed the variable Number of Trolleys to Number_of_Trolleys. Both of these could be done through SAS, but I did it in the base CSV as part of my pre-import data review. The results As usual, I run my pre-analysis checks and I notice that there are a couple of years missing, in particular 2006. (What, did they not get funding that year to track this?) Because I’d eventually want to do time series analysis and possibly forecasting, I would need to ensure the missing years are available. Because the data is broken down by River, PROC EXPAND doesn’t (seem) to work, unless I split my data set into smaller ones by location. As I mentioned, @Ksharp answered a similar question so let’s take a look at the DATA step I used: data import2; merge import import(firstobs=2 keep=id year river Number_of_Trolleys rename=(year=_y id=_id)); output; if id=_id then do; do i=year+1 to _y-1; year=i; Number_of_Trolleys =.;output; end; end; drop i _:; run; Now, I’m no Data step expert, but this appears to be a rather simple one. We’re simply taking the data set, merging it to itself, and adding a new Year column with the Number_of_Trolleys variable equal to missing. From what I gather, I can provide my own calculation for a imputed value here, if I so choose. Here’s my output before I run the Data step: And here it is with the missing year added in: Now that I have a more complete data set, I can easily move on to deeper analyses. Just out of curiosity, let’s see if one river has more trolleys show up than the other: For whatever reason, the River Frome (particularly in 2008) clearly had the highest volume. Not knowing anything about the area, I support possible reasons include more grocery stores, a more dense population, or (based on the area I grew up in) a really good hill that kids use the carts to ride in. (Disclaimer: Do not attempt, it’s extremely dangerous!)

DarthPathos · ‎02-10-2017

@ballardw thanks for the real-world example! I have found similar experiences but nothing so dramatic - you definitely "win" the prize haha. have a great weekend and chat soon!! Chris

DarthPathos · ‎02-10-2017

Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics: Access Now A friend has a poster that paraphrases a certain animated hunter who relentlessly pursued a certain wascally wabbit. “Be vewy vewy quiet," it reads. "I'm hunting outwiers!” It speaks to a great point: Care and patience are absolutely critical as you track down outliers. Today's Free Data Friday post builds on a previous one -- How to Assess Messy Data: Step One - Data Review -- so that a closer examination of those data will prevent you from going down a rabbit hole. Outliers in your data are sneaky and elusive. You must handle them with finesse. Get the data As with my last post, How To Assess Messy Data: Step One - Data Review, I’m using data from the City of Edmonton, Alberta, and is a listing of the city-owned trees. Get started with SAS OnDemand for Academics In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps: Get Started Get the data ready The data was already in a format that I could use, as it was in standard CSV format. The results Now that we’ve done a review of the data, the next step is to find and figure out what we’re going to do with our outliers. Outliers can be caused in many ways, for many reasons; they are not automatically “bad” and so require some investigation. I tend to have three categories that I classify outliers: Level I – Clearly an error (for example, someone’s age coming up as 150 years old) Level II – Likely an error (for example, a couple listed as having 10 kids under 6 years of age) Level III – Potentially an error (for example, Harold J. Smith and Harry Joseph Smith with different addresses) Each of the three levels require some sort of investigation as to the cause; is it a data collection issue, an issue of transferring information from handwritten notes to electronic, or something more complex like a database join problem? I had one memorable experience when I was very new to analysis and I had a Level I outlier. It turned out the data was all correct, but when transferred from one system to another, some dates in like 08-09-18 (September 18, 2008) were read as August 9, 2018 and so threw off all our forecasts. We didn’t realise this until we started doing a row-by-row comparison of the datasets. So, turning our attention back to our Tress data, let’s see if we can find some outliers. First off, let’s take a look at the age of the trees: proc sql; create table work.TreesA as select ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge from work.import; select min(TreeAge), max(TreeAge) from work.TreesA; quit; Here’s the output, showing the Min and Max ages: So clearly the -88 years is wrong. The 117 year-old tree(s) however may be correct; if you recall from last week’s article however, the date planted was December, 1899. If you have ever been to Edmonton in winter, you know tree planting is not something you’d want to do, and especially without the heavy machinery that would not have been available in 1899. When I use the Histogram task to see what the distribution looks like, I get this: So from the data, about 70% of the trees were planted all about 25 years ago. Without knowing the history of Edmonton, I would guess that there was a population boom and so many new houses were built, and therefore new trees planted. What happens when I remove the outliers from the dataset (yes, I could have used the WHERE clause in the Histogram task, but it’s a personal quirk of mine that I like having separate tables when I’m doing my data review)? proc sql; create table work.TreesB as select ('03FEB2017'd-PLANTED_DATE)/365 as TreeAge from work.import where ('03FEB2017'd-PLANTED_DATE)/365 <100 and ('03FEB2017'd-PLANTED_DATE)/365>0; quit; We now have a better sense of the data that can potentially be used in our analyses. Next, I want to make sure that the geolocation data makes sense; although SAS University Edition doesn’t have mapping capabilities, there are still some interesting things you can do (topics for another time). Here’s a scatterplot of the data: In looking at the Latitudes / Longitudes in Google Earth, I realised that the -113.7 was North, so the scatterplot is upside down. Realising that, you can see the path of the river that cuts through the middle of the city (bottom left through to the top right). There are a couple of potential outliers, but nothing that sets off warning bells. Although it’s time consuming, doing a thorough review of your outliers allows you to have a better understanding of the data. Questioning the users and database administrators about the data that falls into the three categories allows you as an analyst to better understand the business processes that go into the data collection, but also allows you to provide better insights for the people making decisions based on your reports. Now it’s your turn! Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

DarthPathos · ‎02-09-2017

This is fantastic, thanks so much for following up! unfortunately i won't be able to attend SGF but will be following up with the authors. Have a great evening and chat soon 🙂 Chris

DarthPathos · ‎01-20-2017

@rogerjdeangelis - wow, that's impressive. I've been working on a "toolbelt" of code snippets, but each job i've been at requires different data checks. Will take a look at yours, thanks so much for sharing!! Chat soon Chris

DarthPathos · ‎01-20-2017

I recommend the paper here- specifically the INPUT and INTCK functions. As @ballardw indicated, not sure how you get 1 from the dates given, but the INPUT will convert it to the proper format and INTCK will calculate the difference. Good luck Chris

DarthPathos · ‎01-20-2017

Hi Kriss, Happy Friday 🙂 I'm trying to think of examples when I'd want additional, non-KM lines on a KM plot and I must admit I can't think of any. I've poked around my SAS books and other Data Viz / Time Series books and I can't find any examples of what you're describing; based on the nature of the graph, I would say that the data displayed is limited to KM survival analysis. If you can provide more detail about what other information you'd want to plot, I (or more than likely someone else) may be able to help. Chris

DarthPathos · ‎01-20-2017

Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics: Access Now I was having an interesting conversation with friends earlier this week and the topic of “data checks you should before you do anything else” came up. I thought it was an appropriate topic for this week’s Free Data Friday post. A well-done data review can save you a lot of frustration, project delays, and back-and-forth emails / phone calls. Get the Data I wanted to look at data that was “messy” – missing values, bad data, etc.; one outcome of the increased interest in Open Data is that organizations are now becoming more aware of their data and cleaning it before it is published. After a lot of digging around (weather, sports, etc.) I found a dataset that met my requirements. It’s from the City of Edmonton Open Data Portal and is a listing of the city-owned trees. Get Started with SAS OnDemand for Academics In this 9-minute tutorial, SAS instructor @DomWeatherspoon shows you how to get your data into SAS OnDemand for Academics and other key steps: Get Started Getting the data ready The data was already in a format that I could use, as it was in standard CSV format. The Results The first thing I do is after importing the data (assuming no errors that indicated a review of the raw data was needed), I run PROC CONTENTS... PROC CONTENTS DATA=WORK.IMPORT; RUN; ...which produces three tables, but I want to highlight two of them: The first screen shot shows some rather boring information, but I’ve highlighted the key pieces. Knowing the number of observations lets me decide if, when doing my analyses and data review, if I want to limit my output to a smaller data set. For example, if I want to review the data that is missing Latitude or Longitude, I don’t necessarily need to see all the data – I am just looking for a pattern, and a sample of about 10% should give me that. The second screenshot is the listing of the variables and their formats; this is something I put into Excel and have open as I’m writing my SAS code. Knowing this information is obviously important, and working with a datas et that has 3000 variables, I can’t remember all of their names and formats. My next step is to typically run the Characterise Data task, specifically focussing on my Date and Numeric data; in this example, I’m only running the task on Diameter_Breast_Height, Condition_Percent, and Planted_Date: :Nothing really interesting except when I scroll to the bottom, and see this: I’m suspicious of the 1899 date (planting a tree in December in Edmonton? I really don’t think so!) and the 2105 date is clearly a typo. It’ll be up to you and the person requesting the analysis to decide how to handle these. My next task is to look at the missing data; I’m only looking at Categorical Data for this example. When I run the task I get this: I have 3.2% of my SPECIES data missing; that may not be an issue, or if that’s the key variable, that may be enough to trigger a review of the data collection process. I’ve decided in my review that I need more information on the SPECIES variable, so I’m going to go back to the Characterise Data task and run it specifically on this one column. I’ve truncated the output to highlight the three SPECIES I wanted to focus on. For some reason, Cherry and Plum have brackets around them; that may be to have them sort to the top of the list, or for some reason. The more (potentially) concerning fact is that there are almost 21,000 trees whose species is listed as “X”. Combined with the missing data, that’s almost 10% of the total; even if this column is going to be a secondary part of your analysis, it should be investigated before you get too much further. The last piece I wanted to show was a Box Plot; these are extremely handy as they show your Mean, Median, Average, and Interquartile Range all on a single output. I’m running my graph comparing the GENUS to the DIAMETER_BREAST_HEIGHT: Here’s the output: The first thing I see is that there are some clear outliers that I need to investigate; the diameter of >200cm seems to be a good benchmark overall, and there may be a couple of others I’d want to review (for example, the PICEA at 150cm seems to stick out). I’m also seeing that there are a number of “Unassigned” trees, and those should be pulled out of the data as well. Context is everything I should stress that this is not an absolute nor complete list; your data review will depend highly on what you're looking at. Missing data may be fine in genomic data, but not if you're doing housing-market analysis. "Unknown" may be appropriate if you're dealing with a catalogue of astronomical phenomenon, but not when dealing with diagnosis of a patient in healthcare data. Finally, I’d like to recommend three fairly slim books that can set you up for success with your data analysis. I have a very strong 80/20 rule when it comes to data – I spend 80% of my time reviewing, exploring and understanding the data, so then I spend 20% of the time actually writing the code. These books all embrace that philosopy: Data Management for Researchers by Kristen Briney (non-SAS, but very good and covers a wide range of topics such as managing your files, storing / documenting code, etc). Data Analysis Plans: A Blueprint for Success Using SAS by Kathleen Joblanski and Mark Guagliardo A Recipe for Success Using SAS University Edition: How to Plan Your First Analytics Project by Sharon Jones Now it’s your turn! Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.

Online Status	Offline
Date Last Visited	‎11-02-2020 01:41 PM