About Vince28_Statcan

Vince28_Statcan · ‎01-03-2014

Transpose creates variables in the order they are encountered in the variable(s) used to create new ones after the transpose. There are different approaches to tackle the problem depending on how large a file and how long it takes to process. You could consider adding a "dummy" record to your untransposed file at the very beginning of the file recreating the desired dates in descending order instead of the ascending order it naturally gets updated. You could also proc sort descending or order by desc on your date variable before transposing. Ultimately, you could use the transposed dataset and recreate a new dataset with the variables in a different order after the shot. This can be done with some macros, the dictionary tables and a keep statement or really anything that will make the data step encounter the variables name in the order you want them to be before the set statement. I will only examplify this very case simply as its the only one that requires some more complex programming statements. proc sql; select name into :dt1-:dt9999 /* generic upper boundary on the number of date variables - only the necessary macro vars will be created */ from sashelp.vcolumn where libname="WORK" /* or whatever library your NOW dataset is stored in */ and memname = "NOW" and prxmatch("m/^[A-Za-z]{3}[0-9]{4}/", name)>0 /* validates that the variable name is following 3 char 4 digit patterns. If you have other odd variable names you could be more specific and build a list of 3 letter prefixes corresponding to months or do further year processing in the regular expression. I assumed it was not required */ ; quit; %let ndates=&sqlobs; /* to be able to loop on dt1-dt&sqlobs. after other subsequent proc sql */ proc sql; select name into :leads separated by ' ' from sashelp.vcolumn where libname="WORK" /* or whatever library your NOW dataset is stored in */ and memname = "NOW" and prxmatch("m/^[A-Za-z]{3}[0-9]{4}/", name)=0 /* other non-match variables will be put in front of all date vars, you can change this rule if only a few specific variables are to be leads that is merely my assumption of your intentions */ ; quit; %macro write(); data NEXT_MONTH; keep &leads. %do i=&ndates. %to 1 %by -1; &&dt&i.. %end; ; /* to close the keep statement */ set NOW; run; %mend; %write(); This is all untested so there might be some syntax errors but the idea is all there. Please note that SASHELP.VCOLUMN is naturally sorted by column number hence the reverse %do loop. There are alternative approaches like grouping and ordering inside the proc sql "descending" and then doing a regular %do loop

Vince28_Statcan · ‎01-03-2014

The issue although it may also be the desired result as it was not specified, is that if you have 2 distinct PI_NO with the same x=max(x) value, you will output both with that proc sql approach. Similarly, depending on the size of your data, using proc summary instead of a data step or sql solution could take far more time than it should. Again, depending on the size of your data and whether it is already pre-sorted by HH_NO or not, there are hash solutions, by group processing with point= solution and others that could get you faster processing and solve the proc sql potential duplicate HH_NO issue. So anyway if the solutions above run too slow or don't fully comply with your requirement, I will gladly provide either a hash or by processing solution but just in case you did not have such a huge dataset and either the above worked, I'll wait for a reply to provide an alternative solution.

Vince28_Statcan · ‎01-03-2014

building upon appropriate input is typically better than the first solution as multiplying char by numeric issues a warning in the log and those should be avoided for any large process. could you provide a datalines example with data that does not get input properly by solution #2? Odds are you can apply some simple string function to WERT inside the input function to make it inputable appropriately. Likely strip to remove blanks or possibly the same solution you've mentionned using an alternative input format after removal of periods.

Vince28_Statcan · ‎01-03-2014

Well, since you are stuck with SAS9.1.3, I'm affraid you will have to stick to the data step approach. You should be able to use macros and dictionary tables to at least relieve some of the coding burden of your table with 100 columns. I'm not knowledge enough of proc template to tell you with certainty that it can be achieved or not but realistically though, I don't see how it could be faster than a data step with a single read on the data. I've spent a bit of time looking at proc template documentation yesterday as I had been meaning to learn the basics of it for a long time but could never find a project that required it.It turns out that even if you were to define a tagset specific to your task in proc template and then use proc print to do the output, proc print still has a lot of additionnal processing going on that does not occur in a data step. There are tons of Events that drive the output layout and since proc template is really meant to be a visual output/reporting/graphing tool much more than a flat file output, any custom template reuses the ODS-driven events that are bound to the procedures with output. Anyway in short, even if you did spend the time to build a custom tagset, it should still be slower than the data step solution you've posted originally as it will necessarily involve a lot of additionnal dynamic processing of ODS-driven procedure events. Here's the line in "What's new in SAS9.2 XML libname engine" discussing the new XMLMap feature for exporting purpose (it worked for imports with the XML libname engine but export was only implemented for engine XML92) New XMLMap Functionality XMLMap functionality has the following enhancements for the XML92 engine nickname: You can now export an XML document from a SAS data set using the XMLMap that was created to import the XML document. The XMLMap tells the XML engine how to map the SAS format (variables and observations) into the specific XML document structure. See Exporting XML Documents Using an XMLMap.

Vince28_Statcan · ‎01-02-2014

Alright I got to give it a go earlier than expected Linked files are those relevant to your objectives. This was done in SAS9.2 and I don't think the xml engine was fully functional before then (at least not the xml92 engine). It further got improved with SAS9.3 if you need to create multiple files but I don't have it installed so you'd need to look at documentation as otherwise, you would need to create 1 libname statement per output file desired (even if they used the same map) whereas I believe from reading about it a while back the SAS9.3 improvements help alleviate that at least for input, maybe not for output though. please notice the "thisnamemustmatch" both in the sas code and in the map file. the <TABLE name="thisnamemustmatch"> tag in the map HAS to be the same as the data set output name that you used with your xml engine as otherwise it causes a client crash. You can name it however you want - it does not appear in the output xml anyway. As to provide some brief indications to adapt it to your desired solutions on how to write the map: TABLE-PATH is essentially what dictates how often records are created for extraction and thus conversely impacts how the xml output appears. For example, test the same code with the map only altering <TABLE-PATH syntax="XPath">/Table/RECORD</TABLE-PATH> to <TABLE-PATH syntax="XPath">/Table</TABLE-PATH> and see the difference in the output xml file. As well, if you have hierarchical data that has columns "carried over" - think a household survey where you have one tag with all household level data and then many subtags with person-level data - If you were to create a person-level table that keeps all the household level data for each record, you'd have to add retain="YES" attribute to all of the COLUMN tags which use household-level data in the map. However, for export which is what your original request was, this is somewhat moot as per the requirement of 2D tables. You would've needed to merge a household table with a person table into a single table with repeated household data for each person before exporting. Also, I don't think it is possible to use _N_ directly as a column to export. However, unless the existing sortation is absolutely essential to save, it is fairly typical not even to bother with an instance= attribute or the like as it is not required to import the data back into SAS. In particular, you could even create a new rowid at extraction time even if you did not create an instance= attribute using <INCREMENT-PATH> subtags of the <COLUMN> tag in the map. That is, the xml engine is able to mimic a _N_ column at extraction time and since it reads and writes the data sequentially, the numbers would be the same. Hope this helps and that my comments aren't all too confusing Vincent

Vince28_Statcan · ‎01-02-2014

Hi Yura, With this change, I believe it can be achieved through the xml libname engine and an appropriately defined .map file. I'll take a look at it during lunch break or if I am waiting on a program running through the day.

Vince28_Statcan · ‎12-31-2013

Not that this will make or break the feasability with xml engine or proc template but is there a specific reason to your xml tag naming? The way you are building RECORD1 RECORD2 etc is absolutely counterintuitive to the entire purpose of XML. It would be far more logic to have something such as <RECORD instance=1>subtags</RECORD> for example. That is, using tag attributes to increment such that both reading and writing does not need to do funky string manipulation over tag names to transfer the data around.

Vince28_Statcan · ‎12-24-2013

In a similar trend to DN and Tom, Had I been asked this question in an interview with my current knowledge of SAS, my answer would be straight forward: I will assume this is a trick question. Whether this can be achieved or not through macros is irrelevant, there are significantly better and more efficient tools for this task than macros and in particular, any macro strict solution for this task should be avoided. As for a bad answer to a trick question for the sole sake of showing skills with SAS macros, %macro really(); proc sql; select distinct subject into :subj1-:subj999999 from t ; quit; %let nbyvar=&sqlobs; data _null_; set t; call symput(cats('visit', '_', subject, '_', visit), '0'); /* the actual value put in the macro var is irrelevant we only need it to exist. */ run; proc sql; select visit into :v1-:v999999 from r ; quit; %let nvisit=&sqlobs; data want; %do i=1 %to &nbyvar; %do j=1 %to &nvisit; %if %symexist(visit_&i._&j.) %then %do; subject="&i."; visit=&j.; missingvisit=.; output; %end; %else %do; subject="&i."; visit=.; missingvisit=&j.; output; %end; %end; %end; run; %mend; %really(); Don't get me wrong, it's rarely the best choice to talk against a question but if the rationale as to why macros shouldn't be used to solve this is accompanied by something like "but I assume this is merely for the sake of seeing if I am comfortable with handling macro variable and logic", then here's an example solution that uses right about as much macros as can be. It is probably wiser to demonstrate to the interviewer that not only do you know macro wells but also know when not to macro at the same time~ Vince

Vince28_Statcan · ‎12-24-2013

Well, I thought this would take me forever but I somehow nailed it on the first pass although I might be overlooking something. Anyway here's the solution - sorry for the copy/paste cluter. This may not be the absolute best case to demonstrate hash of hashes but nonetheless it worked out alright. In the event that your data is already sorted by ID but not subsequently by SCORE (as then using a first. logic and retained counter would be far more efficient than hashing), it is possible to simplify it a lot by reusing constantly the same inner/hinner and using the .clear() or .delete() and a new declare for each by group (can be managed with first. and last.). It would obviously significantly reduce memory consumption since you only keep a small subset of data in hash and output it through a loop with the hiterator upon last.ID type of logic. As for the rationale why it can be worthwhile to learn hashing hashes, if you have data with many data points where a large subset of the variables retain the same values, by using hash of hashes you actually allocate memory only once for all the data in the outer hash and then only the varying portion however small n times in the inner hash object. So for instance, if you had pacemaker data (bear with me here I don't work with health statistics but it's the first thing that came up to mind) with millions of timestamps with some data elements about vital signs provided by the pacemaker at each such timestamp but then you needed to carry tons of invariants or little variants (Age sex marital s tatus working status etc.) as well as some slightly more variant data like medical visits and the additional health measuers done at each such visits. Well you could play with 3 hashes, the outer having the age/sex/etc, the middle one with the medical visits data and ther inner one with timestamps+pacemaker data. So you might have 1M timestamps total for a single person but you only use memory once for all of his invariants, say 12 times (once a year) for his little variants and then the 1M timestamps with just a few variables taking up bulk memory. On top of that, this allows you to do some funky data manipulation as you have 3 layers of searches. For instance, if you have twins that have a specific twin_id, you might be interested in adding a 4rth layer of hashing with just twin ID as key and data and loop over twin pairs to track events and then use the fourth (inner most) hash object to see if the twin had a similar event shortly before or shortly after the timestamp where it occured on the first twin. Anyway I'm digressing here but even though they are quite complex to code and especially hard to transfer or exchange with colleagues as Hashing is still foreign grounds for most SAS end-users, there are many niches left to be explored where hash objects and in particular the capacity to store other objects as data in hash objects can significantly improve quality of life. %Let n_highest=3; %let hashexp=%sysfunc(ceil(%sysfunc(sqrt(&n_highest.)))); Data have; Input ID $ Score; Datalines; A.B. 10 A.B. 10 A.B. 5 A.B. 8 A.B. 10 A.B. 7 A.B. 23 A.B. 10 K.L. 9 K.L. 12 K.L. 11 K.L. 11 K.L. 11 K.L. 2 K.L. 9 K.L. 7 ; Run; data want2; length id $8. score 8. counter 8.; If _n_=1 Then Do; declare hash inner; /* not yet instanciated */ declare hiter hinner; /* not yet instanciated */ declare hash outer(Ordered:'a', multidata: 'n'); /* load*/ declare hiter houter('outer'); /* hash iterator object declared on hash object HT */ outer.DefineKey('ID'); outer.DefineData('ID','inner', 'hinner'); outer.DefineDone(); end; set have end=last; if outer.find() NE 0 then do; /* ID does not exist, create its own new hash object and iterator to track scores */ inner = _new_ hash(ordered: 'y', multidata: 'n', hashexp: &hashexp.); /* instanciating / multidata:'n' is default but for clarity, we use the counter variable instead of an additional hash to mimic NUM_KEYS attribute */ hinner = _new_ hiter('inner'); /* instanciating */ inner.definekey('score'); inner.definedata('score', 'counter'); inner.definedone(); outer.add(); /* add the ID and the related inner objects to the outer hash object */ counter=1; /* Initiate inner counter variable */ inner.add(); /* add the score and counter to the inner object */ end; else do; /* Else, an inner object already exists for this ID */ if inner.find()=0 then do; /* If the score exists, increment its counter variable and replace. Otherwise, add it and handle the possibility of 4 distinct score now existing in the object */ counter=counter+1; inner.replace(); end; else do; counter=1; inner.add(); if inner.num_items>&n_highest. then do; hinner.first(); /* set pointer to first aka lowest score due to sorting order */ rc=hinner.prev(); /* Clear the pointer in an idiotic way so that lowest key can be removed - this way inner hiterators will also have null pointers for the output at the end so noneed to do special handling of .first() instead of .next() to reinitialize */ rc=inner.remove(); end; end; end; /* hashes logic is built, now we need a method to output all of this data. */ if last then do; do while(houter.next()=0); /* loop on all IDs */ do while(hinner.next()=0); /* loop on all scores for that ID*/ do i=1 to counter; /* use our counter variable to recreate the appropriate number of records for an ID/Score pair */ output; end; end; end; end; drop rc i counter; run; Cheers and wish you great holidays! Vince

Vince28_Statcan · ‎12-23-2013

DDE has not been supported since 2003 maybe even earlier than that and is dying with 64 bit platforms/remote submission environments like EG or grid. Nonetheless, for local processes it can still achieve the desired results easier and faster than with some other means. However, if you are looking for an expandable solution that will work for a long time, I would stay away from DDE. If you have VBA knowledge, odds are it would be easier to do the opposite. I suggest you give a read to https://communities.sas.com/thread/9841 I'd love to provide more insights but I've really only played with DDE myself and have not had anything of large enough scale to get into VB-SAS interaction Vince *edit Reeza -> Like!

Vince28_Statcan · ‎12-23-2013

I believe it could be reduced in #hash, #of statements and logic by using hash of hashes but it adds an additionnal layer of complexity for code transferability (e.g. you leave your current position and someone has to pickup from it, if he's not already familliar with the hash object he will struggle far more figuring out what the code does with hash of hashes than in the current implementation). I'll gladly take a stab at it and provide the code if you request it. But otherwise I'll assume it is not needed. I don't think it would be significantly faster than the current implementation either. I was actually amazed when I first read Black Belt Hashigana by Paul Dorfman about hash of hashes. Vince *edit* In practice, there would actually be n+1 hash objects where n is the number of distinct IDs but programatically speaking, they would just be new instances of a single hash so there would only need to be 2 hash objects defined. One with the current approach of instanciating upon declaration (followed with (), empty or with options like dataset: or sorted: ). The other ones would be built through a loop. The conditionnal logic would also be simpler with hash of hashes. Only the gist of understanding how multiple instances of a hash object all with the same name are handled is what would make the code difficult to understand. *edit* Furthermore, in the perspective of wanting to use hash objects with complex conditional searches and/or with reduced memory usage, hash of hashes allow you to keep a by-variable only once in memory instead of however many multidata rows exist. It is really only worthwhile if you have big multidata groups and/or may want to search according to different keys based on the first key search success.

Vince28_Statcan · ‎12-20-2013

I did indeed simply assume that he might want to achieve both and not either or hence programming both within a single data step to save what is the longest processing element which is to input the data through the set statement. I don't think the output method supports dataset option like OBS=3 for output so you'd need to replace data _null_ with data want1 and use the iterator on the 2nd hash object to just retrieve the top three values. In fact, this would also solve the bug you've pointed out with my want1 output. However, I still do think there's a lot fo learn playing with hashing around and the 2-hash solution for want2. It also solves the bug pointed out by your own dataset for your want2 output and detailed below. The issue with not using double hash object for WANT2 is that there does not exist a num_key property to hash objects for multidata. The data set you've examplified above also fails for your want2 because the threshold to adding new keys is _N_=3 so if there are only 1 or 2 keys across the first 3 lines of the input dataset, the output will only have one or two keys as well (and however many records share these keys). Still, it means that you have to consider the odd cases of your data in your "initialization phase" of the object. For most scenarios andi n particualr the one examplified here it is fairly simple, you can just create a new counter variable and increment it if find() fails on the hash prior to addition and then repalce the _N_ by that new counter variable instead. The main reason I went with other approach was that it is completely independant from initialization. It uses NUM_ITEMS on the "KT" (key table) hash as a mean to mimic a missing NUM_KEYS property on the object I really care for. If NUM_KEYS property existed, it could've been done in a single hash as well but still using the sorting power of the hash to avoid having to manually code a bunch of additionnal variables to keep track of what to add/remove. You simply systematically add to your hash object and remove the last key (and all data associated to the key) if the NUM_KEYS is above your set threshold. This is also one of the beauty of hash object if you ever have to manipulate timestamp data from a system (like a voice recognition system or electronic questionaire or transaction records etc). You can use hash of hashes to rebuild the sequence of events regardless of any existing sortation and the internal instances of the hashes act as BY-groups sorted by the timestamp so you can iterate through and go back and forth if need be for decision logic and output a single row all within a single data step instead of a couple sorts. Anyway I'm digressing because I did have to work with odd sortation masive call system data not so long ago where if I had had the hash knowledge I have now would've made my life so much easier (and I still have a lot of twists to learn). But for all I know, even though the hash object has significantly improved in 9.2 from all documentation I could find, it is still missing quite a few tools. 1. A way to clear the hash iterator pointer so as to not lock remove() from happening 2. A hash iterator tool that allows to remove the current element (not the entire key but just the data element - similar to removedup but less tedious to use - it would also be extremely useful for traversing through the hash object and removing rows after sortation or other constructs have happened) 3. A NUM_KEYS property Some might exist and are undocumented but I haven't found any paper discussing any such issues. Nonetheless, very interesting discussion so far

Vince28_Statcan · ‎12-18-2013

It is possible to use DEVICE=ACTIVEX or JAVA as a goption so that the data points used to generate the graph and graph attribute (java object or active x control specific values). The drawback is that this provides access to the data points or even raw data through the object or in the raw html file to users. It also does not support all styles and options but there's typically a way around getting the desired output. The user does not need to have SAS installed as there is a prompt to download the appropriate activeX control from SAS website (default path) to view the graphs. Alternatively, if you work on a closed network, you could download a save of the activex EXE from sas.com and store it on a shared location and use an option (I forget which) to change the default path to prompt users lacking the activex control on their machine for download and install. Anyway, this is really the only alternative if you want to keep on using html. This is in the nature of HTML as Ballard pointed out to need to be able to access the image locations.

Vince28_Statcan · ‎12-18-2013

As mentionned, here's an alternative that will produce both output within a single pass on the data effectively saving you a large I/O since the reason why you didn't load the entire table in a hash was for memory issue. data _null_; if 0 then set have; /* avoid length statements */ If _n_=1 Then Do; Declare Hash ht1(Ordered:'d', multidata: 'Y', hashexp: 2); /* load*/ Declare Hiter hi1('ht1'); /* hash iterator object declared on hash object HT */ ht1.DefineKey('Score'); ht1.DefineData('Name','Score'); ht1.DefineDone(); declare hash ht2(ordered:'d', multidata: 'Y', hashexp: 2); declare hiter hi2('ht2'); ht2.DefineKey('Score'); ht2.DefineData('Name','Score'); ht2.DefineDone(); declare hash kt2(ordered: 'd', hashexp: 2); declare hiter ki2('kt2'); kt2.DefineKey('Score'); kt2.DefineData('Score'); kt2.DefineDone(); End; set have end=done; ht1.add(); kt2.replace(); ht2.add(); if ht1.num_items>3 then do; hi1.last(); rc=hi1.next(); ht1.remove(); end; if kt2.num_items>3 then do; ki2.last(); rc=ki2.next(); kt2.remove(); ht2.remove(); end; if done then do; ht1.output(dataset: "want1"); ht2.output(dataset: "want2"); end; run; On a side note, I was very disappointed to realize that there is no documented method or trick to set the hash iterator pointer to NULL after using it to retrieve the last or first item for the purpose of removing data. Anyone knows a trick that is more straight forward, especially when it comes to reading the code, other than having to use next() after a last() to clear the pointer without changing PDV values since you ought to reuse the same keys? The approach above saves from having to declare and call missing every data step loop on a full subset of variables with the same name as the set itself and it solely uses the sortation within the hash and an iterator to remove data but because there exists no way to set an iterator pointer to NULL, I had to waste PDV space and a bunch of useless iteration for RC=iterator.next() as without the RC=, it creates an error (data step still processes). Plus, it does not feel natural to have to tell a pointer on the known last element of a sequence to point to the next element so as to clear it from locking the removal of data/key elements. Anyway end of the rant about iterator having no method to change the pointer position without affecting PDV other than next on a last or prev on a first.

Vince28_Statcan · ‎12-18-2013

I see. Well obviously, my solution causes a single pass on the data instead of two cutting I/O by ~half (well only input but still a large chunk of your total I/O). However loading a full hash object isn't always manageable. It is possible however to manage to output each want1 and want2 in a single read of the data using 3 hash objects each with at most 4 distinct key values at any given time so reasonably little total memory space saving a giant input operation. I will try to come provide the additionnal example during my lunch break as the logic is slightly less natural at least for my hash habits to spare a small break for it. It won't necessarily be a giant breakthrough but if I think it's a worthwhile self-learning I would assume it can benefit to others as well! Vincent

Online Status	Offline
Date Last Visited	‎07-02-2019 05:06 PM

Re: How to import this SDMX-ML data from Statistics Canada in SAS?

Re: Using the XML Mapper Utility

Re: Analysis by row

Re: SAS converting character variables to numeric while exporting to C...

Re: SAS converting character variables to numeric while exporting to C...

Re: If then statement to case statement

Re: using %sysfunc(cat() )

Re: proc contents

Re: Sas merge help

Re: Sas merge help

Re: put statement - format used contained in a variable

Re: Comparing one dataset with another without merging (with the help ...

Re: Comparing one dataset with another without merging (with the help ...

Re: Is it possible to run Excel VBA code using SAS

Re: FORMAT function

Re: Attempt to %GLOBAL a name (NAME) which exists in a local environme...

Re: Unable to export data to local folders (PROC EXPORT in SAS EG)

Re: Make first letter capital only

Re: Removing duplicate pairs i.e keeping only unique values that weren...

Re: Macro error

Re: sort columns from lft to right after proc transpose

Re: Select observations that meet a criteria

Re: Problem using Input to convert char with missing values to num

Re: Proc template can handle next case?

Re: Proc template can handle next case?

Re: Proc template can handle next case?

Re: Proc template can handle next case?

Re: Comparing one dataset with another without merging (with the help ...

Re: Keeping only highest values in hash table

Re: Is it possible to run Excel VBA code using SAS

Re: Keeping only highest values in hash table

Re: Keeping only highest values in hash table

Re: Html multiple graphs, how to?

Re: Keeping only highest values in hash table

Re: Keeping only highest values in hash table