|
|
-
I recently inherited some SQL that someone else had written and had the job of “tidying it up” before it gets pushed out to production. Here’s a slightly simplified (yes, simplified) version of that SQL: 1: --options
2: select asset_class
3: , case when volume > 0 then 'profit'
4: else 'loss'
5: end as profit_or_loss
6: , customer
7: , sum(volume) as total_volume
8: from t1
9: where asset_class = 'options'
10: group by
11: asset_class
12: , case when volume > 0 then 'profit'
13: else 'loss'
14: end
15: , customer
16: union all
17: --swaps
18: select asset_class
19: , case when volume > 0
20: then 'profit'
21: else 'loss'
22: end as profit_or_loss
23: , customer
24: , sum(volume) as total_volume
25: from t2
26: where asset_class = 'swaps'
27: group by
28: asset_class
29: , case when volume > 0 then 'profit'
30: else 'loss'
31: end
32: , customer
The SQL basically takes data from two tables (t1 & t2), aggregates each, carries out some inline expressions (for profit_or_loss) and finally unions it all together. On the surface it looks fine but there are a few problems here, namely that there is a lot of repeated code; it violates the principle of don’t repeat yourself (DRY) which preaches “single point of maintenance” and “deduplication” of code. If you take the time to check it out you’ll see that identical aggregations are carried out on the two datasets (lines 7 & 24) as are the same conversions for profit_or_loss (lines 3-5 & 19-22). Not only that but we have expressions appearing in both the SELECT clause and GROUP BY of both halves of the query (lines 12-14 & 29-31), another violation of DRY.
A bit of refactoring is called for. First job, eliminate those expressions for profit_or_loss which appear in the GROUP BY clauses:
1: --options
2: select asset_class
3: , profit_or_loss
4: , customer
5: , sum(volume) as total_volume
6: from (
7: select asset_class
8: , customer
9: , case when volume > 0 then 'profit'
10: else 'loss'
11: end as profit_or_loss
12: , volume
13: from t1
14: where asset_class = 'options'
15: )t1
16: group by
17: asset_class
18: , profit_or_loss
19: , customer
20: union all
21: --swaps
22: select asset_class
23: , profit_or_loss
24: , customer
25: , sum(volume) as total_volume
26: from (
27: select asset_class
28: , customer
29: , case when volume > 0 then 'profit'
30: else 'loss'
31: end as profit_or_loss
32: , volume
33: from t2
34: where asset_class = 'swaps'
35: )t2
36: group by
37: asset_class
38: , profit_or_loss
39: , customer
Ok cool, we’ve moved the expression for profit_or_loss into a subquery (aka derived table) and hence expressions have gone from our GROUP BY clauses … but we’ve still got DRY violations. The aggregation (lines 5 & 25) and expression for profit_or_loss (lines 9-11 & 29-31) still appear in two places. More refactoring….
1: select asset_class
2: , customer
3: , sum(volume) as total_volume
4: from (
5: select asset_class
6: , case when volume > 0 then 'profit'
7: else 'loss'
8: end as profit_or_loss
9: , customer
10: , volume
11: from (
12: --options
13: select asset_class
14: , customer
15: , volume
16: from t1
17: where asset_class = 'options'
18: union all
19: --swaps
20: select asset_class
21: , customer
22: , volume
23: from t2
24: where asset_class = 'swaps'
25: )t
26: )q
27: group by
28: asset_class
29: , customer
There, much better; the aggregation only occurs once as does the expression for profit_or_loss (which in the code that I inherited occurred four different times). Our SQL is DRY and its going to be a lot easier to maintain for whomever picks the code up from me.
You’ll notice we’ve got 2 levels of nested subqueries (aka derived tables). I make no apologies for that - derived tables are a great mechanism for eliminating repeated code and if you take but one bit of advice away from this blog post it would be this: derived tables are your friend.
Would you have refactored this the same way? Maybe you might have moved the derived tables into a dedicated view. Perhaps you might even have put the derived tables into a WITH clause like so:
1: with t as (
2: --options
3: select asset_class
4: , customer
5: , volume
6: from t1
7: where asset_class = 'options'
8: union all
9: --swaps
10: select asset_class
11: , customer
12: , volume
13: from t2
14: where asset_class = 'swaps'
15: ),
16: q as (
17: select asset_class
18: , case when volume > 0 then 'profit'
19: else 'loss'
20: end as profit_or_loss
21: , customer
22: , volume
23: from t
24: )
25: select asset_class
26: , customer
27: , sum(volume) as total_volume
28: from q
29: group by
30: asset_class
31: , customer
Let me know in the comments!
@Jamiet
Further reading:
Views keep your SQL queries DRY - Seth Schroeder
|
-
Just lately I’ve been doing a lot of messing around with Windows Azure the main focus of which has been exploring Azure storage (tables, queues, blobs) and how those things can be manipulated from Azure worker roles. One output from that “messing around” has been my Tweetpoll application (read more: Tweetpoll my first Windows Azure application is live). Since then I’ve found some blog resources that would have been useful during development had I found them earlier so I’m linking to them from here mainly so that I can easily find them again but also so that other people might be made aware of them because there’s some really good content here. Firstly, Dom Green has developed a library that provides rich application tracing capabilities on top of Azure’s bog standard logging mechanisms. Read more at AzureTrace … managing all your Azure Logging Needs. I’ve put together something slightly similar myself recently but its not as feature rich as Dominic’s and, as anyone at Conchango EMC Consulting will tell you, I’m no .Net expert by any stretch of the imagination so I’ll be working to put Dom’s library into my own stuff forthwith. Secondly, Dom linked to a great series of articles from Neil Kidd which cover how we could (and should) build reliable Azure worker roles. Those links: Thank you Dom. Thank you Neil. There’s some really useful material here that I’ll be devouring constantly over the coming weeks and months. @JamieT
|
-
Of late one may have noticed that I have become very interested in high-brow, generally vaporous, disciplines such as RESTful data services, data interoperability and cloud computing whilst coincident with that has been the inexorable rise of the term “mashup” in the information technology lexicon. “Mashup” means different things to different people but to me its simply the practice of combining data from multiple places with the aim of discovering or passing on knowledge that wasn’t known before. Well hey, that sounds a lot like what I do in my day job; the main difference being that I don’t generally hear the term “mashup” being bandied about the London meeting rooms that I frequent to the same extent that it does in the funky web 2.0 and swanky startup world; the term I hear (and use) is the considerably less cool “data integration”. Fundamentally though I don’t think there’s that much difference between the two so maybe enterprise data integration people like myself have something to learn from these so-called mashup players. One of my favourite mashup tools out there is one I’ve spoken about before – Yahoo Pipes. If you haven’t had a look at this it really is worth taking a glance. Yahoo Pipes enables you to extract data from multiple web-based data sources, transform it using a series of operations like sorting, joining, unioning and filtering before finally outputting that transformed data in one of a number of different formats; its a data pipeline for web-based data (A pipeline? Oh, there’s something else I’ve talked about before – noticing a pattern here?). Here is an example of a Yahoo Pipe: Yahoo Finance Stock Quote Watch List Feed w/Chart Extracting. Transforming. Sorting. Unioning. Filtering. Outputting. This Yahoo Pipes thing is starting to sound awfully like ETL tools such as SQL Server Integration Services (SSIS) wouldn’t you say? They even look a little bit like each other with their boxes joined up with lines between them: I’m now reminded of what my good friend Andy Britcliffe of Sharpcloud once said to me upon reading my blog post (and viewing the embedded video) Consuming web services in SSIS 2008 a full two years ago. I distinctly remember Andy’s words on that occasion: “SSIS is the ultimate mashup tool”! I didn’t disagree! Most mashup tools share one common characteristic in that they invariably require someone with some technical nous to set them up in advance so that they can be used by the less tech-savvy amongst us and the same applies in enterprises as well; data is distributed by the IT guys to the information workers and this distribution of data typically takes months whereas the consumers of that data want it available in hours. In both arenas I sense a shift occurring; now the consumers of the data are being empowered to find and interrogate data for themselves and in the enterprise this is happening through the adoption of tools such as Qlikview, Omniscope and (in the near future) Microsoft’s Gemini. I find this to be a fascinating development not because it means there may be less work for me to do (admittedly that would be nice) but because information workers now have the opportunity to be much more productive in their daily jobs and I expect those who invest in learning these new technologies to be the cream that rises to the top of enterprises in the near future. Up until recently I hadn’t been all that interested in Microsoft’s Gemini project, indeed I was very sceptical of it, but as I started to formulate some of the thoughts that I’m writing about here I began to realise how important it will be when it gets released sometime (hopefully) in early 2010. I earlier described mashups as being “the practice of combining data from multiple places with the aim of discovering or passing on knowledge that wasn’t known before” and that description fits very well with Gemini. If you don’t know what Gemini is take a look at this video: That demo glosses over the main point I’m making which is that here we see data that is originally pulled from multiple sources and combined in a familiar place (Excel) where the end user can consume it. The person speaking in the video is Donald Farmer and he has a blog entry with many other links to Gemini resources at Microsoft Project Gemini links. At the top of this email I also talked about how I’m interested in data services, that is data available over the web that we can consume via an API and use for our own knowledge discovery and I was introduced to such a data service just yesterday when listening to Jon Udell’s “Interviews with Innovators” podcast. In the most recent episode Jon interviewed Stephen Willmott whose company 3scale Networks has taken it upon themselves to make data held by the United Nations freely available via a data service to anyone that would like to consume it. For example, if you want to know the United Kingdom population’s annual growth rate since 1991, that data is available, for free, at http://undata-api.appspot.com/data/query/Population%20annual%20growth%20rate%20(percent)/United%20Kingdom?user_key=XXXX (you need to sign-up for a free user-key and substitute it for XXXX in order for this query to work) and is returned like so: “Wouldn’t it be cool” I thought, “if I could consume that data inside of Excel using Gemini”, perhaps in this example to combine it with birth rates over the same period to discover if there is a correlation between the two. At the time though I didn’t know if Gemini made it possible to consume data directly from data sources so I went straight to ask the man who would know, the aforementioned Donald Farmer. I contacted Donald over Twitter and here is the conversation that ensued: - Me: @donalddotfarmer Is there a list of data sources types from which #Gemini can get data? Interested in data from web APIs e.g. undata-api.org (link)
- Donald: @jamiet I'll need to check out that site in particular, but we do support Atom feeds. (link)
- Me: @donalddotfarmer Ahh that's good news. How about POX/RSS? Does Gemini allow us to parse it or use XQuery? (link)
- Donald: @jamiet No we don't support XQuery - we just consume Atom feeds as they come - the users can then filter and sort in Gemini (link)
- Me: @donalddotfarmer OK, so Atom only right now. Looking forward to getting hands dirty, think I know what 1st feature request will be :) (link)
Lots of techy abbreviations in there so let me summarise. Gemini will be able to consume data from web services that deliver it in the popular Atom XML dialect (more on Wikipedia) which is great news and no great surprise given that Microsoft announced in February 2008 that Atom would be their XML syndication format of choice going forward (see my blog post Windows Live Dev announcements for a more complete commentary). I happen to know that the United Nations data provided by Stephen Willmott is not currently delivered in Atom format but no matter, at least things are moving in the right direction and as I alluded during my last tweet to Donald I’ll be asking for support for other syndication formats in the future. This has turned into a rather rambling blog post so I’ll call a halt here. As always though I’d be interested to know other people’s thoughts on data services, usage of that data in enterprises or anything else I’ve mentioned herein so if you have any thoughts please leave comments in the space below! -Jamie
|
-
I like to think that most of the stuff I blog about on here is in some way useful but sometimes I just want to put something out there that i think is simply damn cool, this is one of the latter cases. Today I stumbled upon Joe Pruitt’s PoshBing – a Powershell library for Microsoft’s new Bing search engine. Its a wrapper around the Bing API and enables you to call the Bing search engine and consume the results inside the Powershell pipeline. A picture speaks a thousand words so here are some example of what you can do with it: Spellchecking  Translations Basic web search News articles Not sure how useful it is but its all very cool stuff I’m sure you’ll agree (if you’re a fan of Powershell)! Download the Powershell script file containing this good stuff from http://poshbing.codeplex.com/ and happy querying! As an aside, Joe has also built a Twitter wrapper around his Powershell Bing wrapper which means you can now use Twitter to use Powershell to query Bing; simply head for @askbing and ask it whatever you want. Probably not all that useful but nonetheless very very cool. -Jamie Please check out more of my Powershell related blog posts:
|
-
I’ve just read a mildly interesting article The Twitpocalypse is Near: Will Your Twitter Client Survive? which talks about how any Twitter client using a signed 32bit integer to store twitter status_ids is going to break in the next few days when the number of tweets surpasses the maximum possible value of a signed 32bit integer which is 2,147,483,648. That got me thinking about how much space Twitter must be using up to store all of those status updates. If we disregard storage of who posted a tweet and where it was posted from then we can assume (that’s the first of many assumptions in this blog post) that Twitter’s table of tweets looks something like this: CREATE TABLE statuses ( status_id BIGINT, status VARCHAR(140), post_dt TIMESTAMP ) According to the MySQL documentation BIGINTs take up 8 bytes and TIMESTAMPs take up 4 bytes. The space used by a VARCHAR depends both on the length of the value being stored and the character set being used so if we assume the character set is UTF-8 and we also assume that that means each character takes up 2 bytes (not a safe assumption apparently) then the maximum space used by a particular status will be 281 bytes (140*2 + 1 byte to represent the length). Hence, the maximum possible length of a record in this table is 293 bytes. Of course, very few tweets are 140 characters in length so we need to know what the average tweet length is. I’ve examined the underlying data for my Tweetpoll application* and rudimentarily calculated that the average tweet length is around about 92 characters. Thus, we can expect that the average length of a tweet in the table can be calculated as: | status_id | + | status | + | post_dt | | 8 | + | (92 * 2) + 1 | + | 4 | That’s 197 bytes in total per row. So, when twitter hits that maximum possible value of a 32bit signed integer sometime in the next 7 days I estimate that this table is going to be occupying 197 * 2147483648 = 423054278656 bytes. Or, in numbers we can understand, 394GB. By the way, don’t fret that Twitter itself is going to break any time soon; they use 64bit unsigned integers to store status_ids so the Twitter service itself is going to be OK for a while. Storing status_ids in 64bit unsigned integers means the theoretical maximum number of tweets is 18,446,744,073,709,551,615 or, as Programmable web point out, 2.7 billion tweets for every person on the planet. When that limit is (theoretically) reached in the year [fill in arbitrarily chosen year here] Twitter are going to need something in the region of 3.2million petabytes of disk space to store them all (i.e. 3,634,008,582,520,781,668,155 bytes). To put that into perspective, that’s about the same as 68 billion Blu Ray disks completely filled up with tweets. Now, back to some real work…. -Jamie Disclaimer: I’m sure I don’t need to point out that this blog post is purely based on assumptions and my, usually rather hopeless, mathematical abilities so the numbers are completely bogus and anyone else attempting this calculation would probably come up with totally different ones! *At the time of writing Tweetpoll is viewable because I don’t have to pay for it but if it is not available by the time you come to read this then its because Windows Azure has reached general availability in which case you can read more about Tweetpoll at Tweetpoll – My first Windows Azure application is live
|
-
Google have, today, released what looks like a very interesting product in Google Squared which is an extension to the search engine that tabulates your search results. Interesting because it presents results in a structured format that I assume one may, one day, be able to consume in any manner of ways (more on that in a minute). Here’s an example, the results of a Google Squared search for “prime ministers”: or how about “SSIS tasks”: Google Squared has pulled back relevant information and put it into a table format which does seem very useful but in many ways it highlights a problem with search engines in this day and age; the information simply isn’t complete and therefore can’t be trusted. We don’t have a definitive list of all prime minsters here and neither do we have a definitive list of all tasks provided with SQL Server Integration Services (SSIS) either. Based on this would you trust the results coming out of Google Squared without additional research? I certainly wouldn’t. On the subject of consuming this data, wouldn’t it be great if there were an easy mechanism of consuming structured data like this? Something none-proprietary, open and based on a standard? Well, if you had read and taken any notice of my recent blog post Whatever happened to Live Clipboard? you’d be aware that a mechanism that purports to solving this problem (i.e. Live Clipboard) has already been mooted but unfortunately has failed to gain any traction in the industry. That’s a real shame because Live Clipboard could provide a way to (for example) copy data from a Google Squared result into an Excel spreadsheet and vice versa. As I said in that previous blog post: Here’s hoping [the lack of adoption of Live Clipboard] changes soon because it sounds like a very useful technology and to a fella like me whose primary interest is data integration anything that uses well-known standards as a method of doing that is worthy of attention. All that being said Google Squared is a very interesting advancement in the increasingly intermingling worlds of data integration and search, even if it does think that George Clinton is an ex-US Vice President :)  Any thoughts? Let me know in the comments! -Jamie
|
-
[If you’re using SSIS to pull data out of Oracle then this may be of interest to you.] In SQL Server 2008 Microsoft introduced a feature called Change Data Capture (CDC) which basically enables you to query a SQL Server table from within SSIS and ask for all rows that have been created or changed since a given point-in-time. Attunity have now released a product called Oracle-CDC for SSIS that purportedly allows you to do the same with a table stored in Oracle too. Features include: - Non-intrusive, log-based Oracle CDC for Oracle 9i, 10g, and 11g
- Full integration into SSIS and the Business Intelligence Developer Studio
- Generation of SSIS packages for processing Oracle data, reducing development efforts and ensuring best practice implementation
- Automatic synchronization of full-load and CDC processing
- Monitoring and control of CDC processing
If you’re interested in knowing more then Attunity are running a webinar on 3rd June, details are at https://www1.gotomeeting.com/register/543263673. Attunity are the company that provide the officially sanctioned Microsoft connectors for Oracle and Teradata so they have a pretty good pedigree; Oracle-CDC might just be worth paying attention to. -Jamie
|
-
I’ve noticed some confusion in the community over the mythical “4000 character limit” in SQL Server Integration Services (SSIS) and thought it may be worth clarifying the situation. The most common misconception I hear is “A SSIS string variable has a maximum length of 4000 characters”. This isn’t true and its easy to prove; simply create a string variable and paste in a value that is greater than 4000 characters (which is easily constructed in Notepad using copy/paste). You can then use a script task to display the whole value, thus proving that variable values an be greater than 4000 characters:  So then where does this notion of a 4000 character limit come from? Well, 4000 is the maximum length of the result of an expression and that’s easily proved as well by using REPLICATE(“a”, 5000) in an expression. If you use that expression to return the value of a variable then you will get, at design-time, the following message:  If you examine the message in the screenshot above you’ll notice that it mentions the maximum length of a DT_WSTR value is 4000. This is true, but DT_WSTR is not the type of a SSIS string variable and hence as I’ve just proven we can place a string greater than 4000 characters into such a variable. The SSIS variable type system is basically a subset of the .Net type system whereas expressions and dataflow columns use a type system that is proprietary to SSIS (i.e. DT_WSTR, DT_STR, DT_I4 etc…). It is of course utterly ridiculous that SSIS has two different type systems; I’m completely miffed as to why that is the case. Hopefully that clears up the confusion! This information is true of both SSIS 2005 and SSIS 2008. -Jamie
|
-
Anyone out there remember Live Clipboard? It was a very interesting incubation technology that came out of Microsoft’s Live Labs group way back in 2006 (I think) and how now been open sourced under the Creative Commons Attribution-ShareAlike License. It was also backed from way up on high – Ray Ozzie (now Chief Software Architect at Microsoft and also the guy who took over from Bill Gates) was its chief backer. Live Clipboard was in fact one of the few technologies that Ray blogged about on his, now defunct, MSN Spaces blog but sadly that blog post is no longer available.

So, what was Live Clipboard? From the site itself:
“Live Clipboard uses JavaScript and standard XML formats to easily move data from one web site to another, or between the web and standard applications. It extends the clipboard concept (familiar to most computer users) to the web.”
(http://www.liveclipboard.org/)
In other words, it is copy and paste for the web. You may think we already have that; after all, I can copy and paste some text from a text box on one page into a text box on another web page; but that isn’t really what this is about. Live Clipboard used XML markup to describe the data that was being copied thus if that XML was a well-known representation of the data (e.g. a microformat) then the receiving website could act upon that data accordingly. The canonical example is the one given at http://www.liveclipboard.org/:
“Let's say you have two sites both of which understand calendar data. I want to move an appointment from one site to another. With Live Clipboard, there is now an icon on each site, next to each piece of data that can be transported. Bring site A to the front, click on the icon and choose Copy, then bring site B to the front, click on the icon and choose Paste.“
Its not hard to envisage many other uses for such a technology, http://microformats.org has a number of fledgling microformat specifications that could all benefit from Live Clipboard:
Imagine finding someone’s contact details on their website and easily being able to transport those details into your address book with a couple of clicks – that’s the promise of Live Clipboard and microformats. Copy and paste is nearing ubiquity for smart devices (iPhone is expected to announce support in the next couple of weeks) and I doubt anyone reading this would contemplate using a PC that didn’t support it so I’m surprised that this similar concept for the web hasn’t taken off.
A number of large organisations have started to support Microformats most notably Google who recently announced that Googlebot would start seeking out Microformats and Microsoft themselves who have released Oomph, a microformats toolkit. Given that the use of microformats is now taking off I’m surprised that Live Clipboard hasn’t been heard of in such a long time. Here’s hoping that changes soon because it sounds like a very useful technology and to a fella like me whose primary interest is data integration anything that uses well-known standards as a method of doing that is worthy of attention.
Does anyone out there have any information to share about Live Clipboard?
-Jamie
Links:
Live Clipboard main site - http://www.liveclipboard.org/ Live Clipboard on Wikipedia - http://en.wikipedia.org/wiki/Live_Clipboard
UPDATE: Ray is well known for inviting people to contact him so I did that and asked him what had happened to Live Clipboard. Paraphrasing his reply: even though there was initial interest, once it was open sourced there was a lack of take-up within the wider web developer community.
|
-
This evening I took over the running of the London SQL Server User Group because the usual hosts, Simon Sabin and James Rowland-Jones, were otherwise occupied. The theme of the night was “SQL Server in the cloud” so I did a session called “What’s new in SQL Data Services” which was a run through of the slides from Nigel Ellis’ session of the same name at MIX09. For those that asked about obtaining the slides they can be downloaded from http://videos.visitmix.com/MIX09/T06F where you can also see a full run-through of Nigel’s talk. For various reasons I had to rush my presentation so I highly recommend watching Nigel’s delivery of it so as to get the information that I didn’t have time to present. My colleague Simon Munro also did a session, this one entitled “The cloud DBA”, where he discussed some of the issues that we as SQL people need to be considering in regard to “the cloud” (whatever that may be). Doubtless Simon will make the slides available on his blog, perhaps even his demo too. During the QnA section there was a question about configuring parameters in the Execute SQL Task. I mentioned an article I knew of that provided guidance based on the OLE DB provider being used and said I would link to it later. Here is that article: http://totaldevpro.com/blogs/khaselden/archive/2009/04/07/sql-task-parameters.aspx. Thank you to everyone that turned up, I look forward to seeing you all at the next London meeting on 18th June. If you are interested in presenting at future user group events then please let Simon or James know; these guys do a wonderful job of making sure that the regular meetings are fruitful and fun for all of us but in order to do that they need people to “muck in” so if you think you have something to contribute please do get involved. -Jamie
|
-
Applications hosted on Windows Azure will create log files and given that we don’t have access to the running app instances the log files are the only means available to us for debugging and problem resolution so they’re a very important part of the whole service. Recently I wanted to take a look at the log files for my Tweetpoll Azure application and realised that viewing and analysing those log files is not exactly an easy process. Azure log files are stored in Azure BLOB storage from where you able to download them for offline analysis (N.B. I won’t be covering in this blog post the process of obtaining your log files – I’ll assume you already know how to do that).
Azure logs are delivered in XML files. Here’s one of them:
Given that I know quite a bit about SQL Server Integration Services (SSIS) I figured I would build a SSIS package that would parse my files for me; once parsed I can use SSIS to load the files to wherever I like. Perhaps I load the data into a database so I could query over it, perhaps I even push it into an Analysis Services cube so that I can slice-and-dice it; lots of possibilities.
First job though is to parse the files. I loop over them using a For Each loop container and upon each iteration push the log file contents into a raw file (SSIS’s proprietary storage format), appending the data a log file at a time. Once in the raw file I can consume all the log file data en masse and push it to wherever I want. TO visualise that for you, here is the control-flow of my package:
As I explained above, “DFT Consume all log files” does not actually push the data anywhere but I have hooked up some data viewers so you can see the data as it passes thru when the package is running in debug mode:
That’s all there is to it. If you’re familiar with SSIS you will know that the ‘Union All’ components shown here simply need to be replaced with OLE DB Destination components in order to push the data into a database table for analysis – what you do with the data thereafter is entirely up to you.
You can download the package from my Skydrive:
There is one simple piece of configuration you need to carry out before being able to run the package; you need to store the path to the folder containing your log files in the @[User::RootFolder] variable:
There are two pre-requisites to running the package. You need to have SQL Server Integration Services 2008 installed and you need to have read-write access to the folder containing your log files (write access is required because that is where the package attempt to write the raw file).
Hope that proves useful to people! Any questions please don’t hesitate to ask.
-Jamie
P.S. I recognise that the process of downloading all your files is laborious, I’m working on a more automated solution to that. Watch this space ;)
|
-
Simon Sabin has asked me to give a shout out to the first Cambridge-based meeting of the SQL Server UK user group. Its on Wednesday 20th May, details and agenda here: http://sqlserverfaq.com/events/176/Cambridge-SQL-UG-SSIS-and-SSRS-sponsored-by-Redgate.aspx I wouldn’t mind going to this given that Cambridge is the home town of SQL Server tool vendor Redgate but unfortunately Cambridge isn’t exactly just around the corner from South West London. Still, Simon always puts on a good meeting so if you can make it along I highly recommend you making the time to do so. -Jamie
|
-
I have noticed a few fellow SQL bloggers over on SQLBlog.com (most notably Aaron Bertrand) have started to use their blog to draw attention to various submissions to Microsoft Connect that they believe to be worthy of attention. I think this a great idea, the more attention that are paid to Connect items the better hence I’m going to start calling out some SSIS related Connect submissions that I think are worthy of your attention.
Take a read of the items below and if you agree with the aim of the submission, please click through and vote for it – it only takes a few seconds. If you want to leave a comment supporting your vote, that would be even better.
Don’t feel obliged to vote on everything carte blanche, only vote for those that you would like the SSIS product team to concentrate on. Most of these are already closed but that doesn’t mean they can’t be opened again. And remember, voting really does make a difference; if you don’t believe me take a read of Doug Laudenschlager’s blog post Your vote on Microsoft Connect influenced SQL Server 2008 Service Pack 1.
Ability to view the contents of a raw file in the raw file source adapter
There is currently no way to view the contents of a raw file without running a package or using a 3rd party tool. This request is to allow us to view the contents of a raw file from within a raw source file adapter.
#124839: Let us view the contents of a raw file in the raw file source adapter (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=124839)
Execute from this point forward
The “Execute Task" option in BIDS is incredibly useful but is only executes one task at a time (aside: it wouldn’t have even done that had a certain someone not asked for it during the SQL Server 2005 beta period). It would be very handy to begin package execution from a particular task so that it executes that task and all subsequent tasks.
This would be a great time saver in situations where the first step in a package is to fetch lots of data from a source and stage it somewhere for processing. Quite often we may already have the data available in the staging area thus we only want to execute the subsequent processing steps.
#126372: Execute from here in control-flow https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=126372
Drop a component into an existing data path
If we need to add a new component into an existing data-flow we need to detach data paths and re-attach them all again.
It would be better if we could drag the new component on top of a data path and it automatically gets inserted in between the 2 components that the data path is connecting. Depending on the behaviour of the component (synchronous or asynchronous) this may create 2 new data paths with the new component in between.
#126475: Drag component into a data path (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=126475)
View a variable in OLE DB Source component
If, in an OLE DB Source component, DataAccessMode="SQL Command from variable" you have no idea what SQL statement is being issued without coming out of the editor and looking at the variable.
The component editor should display the contents of the variable in the OLE DB Source Editor.
#127221: View the contents of a variable in OLE DB Source component (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=127221)
UPDATE: John Welch informed me that this one was actually fixed in SSIS2008 – which I have just confirmed to be true:
Let me know if you want me to carry on calling these out. I have many many more of them.
-Jamie
|
-
Two days ago I blogged about my first Windows Azure application Tweetpoll that I set into motion sometime on the morning of 6th May 2009 (i.e. 2 days ago). I’ve just been to the Azure portal to check out what metrics Microsoft are measuring. Here’s my hourly virtual machine usage: and the same data but per day: At approximately 1400 BST today (8th May 2009) I upped Tweetpool to use two worker and two web roles (previously it was using just one of each): and straightaway the impact on hourly usage was obvious as usage for that following hour virtually doubled. Bandwidth usage data was not available. There are some obvious (and dull) takeaways from this; virtual machine usage is simply a summation of the amount of time that each role instance is running. Azure costings are due to be made available this summer and I wouldn’t be surprised to see us having to pay per hour of virtual machine usage. Other things to note. - Metrics are shown for all of your hosted services. Currently we are only allowed one hosted service per Live ID (which, I think, equates to a billing account) so this isn’t a problem but I assume that they’ll eventually allow us to have multiple hosted services per billing account. If so one would hope the analytics page would differentiate per hosted service.
- Virtual Machine usage hours are extremely dull. I would be much more interested in seeing actual usage stats such as how many times each page on my web role has been accessed and how many log records have been written rather than simply how long my service had been “up”.
- There is no slice-and-dice functionality. Given that adCenter Analytics (RIP) provided demographical analysis (as I reported previously) I see no reason why similar shouldn’t be available for Azure web roles.
- Data is not available for us to consume offline and conduct our own analysis; there could be a great value for a service developer in being able to analyse web log and service log records. If and when this changes I wouldn’t be surprised to see a startup establish itself as a purveyor of detailed Azure analyses – there could be a big market for that.
In summary, the analytics available for Azure hosted services don’t provide anything of note for service developers, they simply tell you what you’re going to have to pay for. -Jamie
|
-
Intro My first Windows Azure app is (at the time of writing) up and running at http://tweetpoll.cloudapp.net/. I’ve not built it to provide anything particularly useful but what it does do is demonstrate the capabilities of Windows Azure storage, web roles and worker roles and has afforded me the opportunity to get my hands dirty with Windows Azure, something I’ve been meaning to do for quite a while. The source code is available at http://tweetpoll.codeplex.com. What does it do? Very simply, the application displays the distribution of the lengths of Twitter statuses (“tweets”) from Twitter’s public timeline. Here’s a screenshot of that display: Unsurprisingly 140 is the most common length for a Tweet however it seems that there are a great number in the 40-80 character range as well. It’ll be interesting to see if the graph smoothes out over time. The web page front-ending the application also does a few ancillary things: - Displays the total number of tweets that the application has fetched
- Displays a sample of messages currently residing on the application’s queue (more on the queue later)
- Displays the number of times that the web page has been accessed
TweetPoll won’t be available forever, I suspect that the moment Microsoft start charging for hosting on Azure that I’ll take it down but in the meantime feel free to stare transfixed as the numbers twitch before your very eyes!!!! Great fun I’m sure you’ll agree!!! How does it do it? My application consists of a worker role and a web role. The worker role has responsibility for: - Polling Twitter’s public timeline for the 20 most recent tweets
- Pushing a message onto an Azure queue for each tweet
- Popping a message from the head the Azure queue
- Calculating the length of the tweet in the popped message
- Incrementing the counter for that particular Tweet length which is persisted in an Azure table
The web role has responsibility for: Here is an architecture diagram that depicts the whole system: It is of course totally unnecessary for one thread to push a message onto a queue just for another to pop it off again but this isn’t meant to be a real-world scenario; I just wanted to use as many pieces of Azure as possible. Under the covers I used the Storage Client sample code from the Azure SDK which made interacting with Azure storage rather simple. Thanks go to… Howard van Rooijen for pointing me at Flot, a Javascript graphing engine. John O’Brien for his rather spiffing Azure Queue watcher gadget for Windows Sidebar which polls an Azure queue to detect the number of messages and which helped me immensely during development. Its also quite gratifying to deploy your app to the cloud, set it running, and watch as the gadget starts ticking over as messages are pushed on and popped off. That’s it. I hope the source code up on Codeplex helps anyone else that may be venturing into the world of Azure development. Any questions, just ask! -Jamie
|
|
|
|