Tag Archives: SQL

DB2 group_concat and other figments

For all that DB2 has to offer, there remain some features found in other relational databases that DB2 still lacks.  Multi-version concurrency control (MVCC) and Oracle-style optimizer hints (between the /*+‘s) often top the list, but there are others.  I’m reminded of these when I find myself recommending something out of habit before I remember DB2 doesn’t have it (rhyme intended).

This happened today with a request for DB2 SQL to build an export command with an explicit column list (no “select *”).  The simple answer was quick:

select ‘export ‘ || group_concat(colname) || ‘ from ‘ || tabname from syscat.columns where … group by tabname

Unfortunately, DB2 doesn’t have group_concat, so I went googling for an alternative.  Some folks offered group_concat, concat_all, and rowtocol implementations as UDFs and stored procedures, while others suggested really ugly recursive SQLs or UNION/CASE alternatives.  None were very general.  One IBMer figured out a way to use DB2’s xmlagg function to hack a solution, which I applied as follows:

select ‘export ‘ || substr(replace(replace(xmlserialize(content xmlagg(xmlelement(name “COLNAME”, colname)) as varchar(1000)), ‘<COLNAME>’, ‘, ‘), ‘</COLNAME>’, ”),3) || ‘ from ‘ || tabname from syscat.columns where … group by tabname

It works, but it’s uuuugly!

I can understand some hesitation to add group_concat to DB2; for example, it’s very easy to exceed column limits.  And architectural differences mean that DB2 doesn’t always need to “clone” everything that other DBs have (there are other ways to do the same thing).  But it’s often helpful to keep up with the database Joneses in one area while you’re passing them in others.

Recent DB2 enhancements (currently committed semantics, literal replacement, optimization profiles, etc.) have shrunk many of these gaps, but many remain to be closed entirely.  With the continued rapid addition of new DB2 features, perhaps we’ll see some of these soon.

CLPMinus

There are many great tools for running DB2 commands; I often use TOAD, Control Center (db2ce), and a CLI-based tool I wrote.  And with DB2 9.7, I’ve enjoyed experimenting with the new CLPPlus, IBM’s answer to SQL*Plus for Oracle refugees.  But for those quick commands, I usually just pop open a DB2 Command Window (Windows) or bash shell (Linux) and type “db2” followed by the command.  It works great nearly all the time.  Emphasis on nearly.

Today, Wayne reported how this command taken verbatim from the DB2 documentation (the “File type modifiers for export” section) choked:

db2 export to delfile2 of del modified by timestampformat=”yyyy.mm.dd hh:mm tt” select * from schedule

It failed with the error: SQL3192N  In the filetmod a user specified format “TIMESTAMPFORMAT” beginning with the string “yyyy” is not valid.  And no value for timestampformat worked.

This is a case where bugs in CLP’s command-line parsing (particularly with double quotes) get in the way.  The command works from inside CLP, Command Editor, and other tools, so you can just type db2 or db2ce and hit enter and then run the command (without “db2” in front) from there.  Using admin_cmd from a CLI/ODBC tool also works, like so:

call admin_cmd(‘ export to delfile2 of del modified by timestampformat=”yyyy.mm.dd hh:mm tt” select * from schedule’)

Bugs like this have survived in the db2 command line for awhile (this fails even in 9.7).  I’ll report it, but since CLPPlus is the new kid in town, the old db2 command line probably won’t get as much attention.

DML Insurance

In a chat today, we discussed how to protect data when formulating DML, particularly those SQL update and delete statements that can destroy perfectly good data if done incorrectly.  You may find yourself having to run such SQLs against large data sets (and, yes, even production data), so no amount of care is too much.

One option is to hold a transaction open, run the DML, select the (uncommitted) results, and then commit if good or rollback if not.  That’s clever, but it can hold a lot of locks for a long time, is error-prone, and doesn’t work for all cases.  Another is to keep full backups handy and restore the table(s) if something goes wrong.  This can be time consuming and can harm availability during restore if the table(s) are large.

A better approach is to create exports of exactly the data you will be changing or deleting.  That is, run an export command whose where clause is exactly the same as that in your update or delete statement.  That way, you can restore (via import) just the data you changed.  For DB2, I use IXF exports since they’re the richest.  So, for example, before you:

update company.employees set salary=100000000 where name=’Derek’

You should:

export to empderek.ixf of ixf select * from company.employees where name=’Derek’

If you realized maybe you shouldn’t have done that, you can put it back with:

import from empderek.ixf of ixf insert_update into company.employees

This can be used to recover from deletes as well (hence insert_update).  For example, before you:

delete from wikipedia.articles where title=’Malamanteau’

You should:

export to articles.ixf of ixf select * from from wikipedia.articles where title=’Malamanteau’

MySQL has fewer export/import options, but mysqldump can definitely help.  For example, export with:

mysqldump wikipedia articles –compact –where=”title=’Malamanteau'” > articles.sql

And, after the above delete, you can put it back with:

mysql wikipedia < articles.sql

For ease of reference, I often give these “backup” IXF and SQL files unique, descriptive names. It’s helpful to include the date and some tag that identifies what the data is.  In the above example, I could have used articles-Malamanteau-12-May-2010.ixf, although Malamanteau is hard to spell and probably isn’t even a real word.

Perhaps this simple technique of “export first, DML second” may rescue you from the occasional “oops moment.”

The Friday Fragment

It’s Friday, and time again for a new Fragment: my weekly programming-related puzzle.

This Week’s Fragment

This week’s fragment is programming-related (by a stretch), and is borrowed from a recent Car Talk episode.  A listener there wrote how he didn’t get a programming job partly because he failed to solve the following interview puzzle within the allotted five minutes:

You have a four ounce glass, a nine ounce glass and an unlimited supply of water, which can be poured and dumped as much as needed.  Can you measure exactly six ounces of water in as few steps as possible?

If you want to “play along”, post the solution as a comment or send it via email.   To avoid “spoilers”, simply don’t expand comments for this post.  It’s your chance to demonstrate your qualifications for a job as a programmer.  Or at least a soda jerk.

Last Week’s Fragment – Solution

Last week was another SQL challenge, where IRS programmer Owen Moore had trouble joining his Employee table to the EmployeeAddress table.  When he ran his SQL, he found that it dropped some of the employees: those who did not have addresses on file.  He doesn’t mind having multiple rows for one employee (whenever there are multiple addresses), but the IRS never wants to overlook a taxpayer.  His SQL was:

select companyid, e.employeeid, hiredate, address
from employee e, employeeaddress ea
where e.employeeid = ea.employeeid
order by companyid, hiredate

Owen’s bane was that, by default, SQL joins are inner joins, meaning that results are included only if there are matching rows in both tables.  Owen needs an outer join, so that it includes all the rows from the Employee table, even if there aren’t matching rows in the EmployeeAddress table (the EmployeeAddress column values will be null were the rows are missing).  Outer joins can be left joins or right joins, depending on the order you list the tables.  Owen reads and thinks left-to-right, so he’ll list the Employee table first and use a left join, like so:

select companyid, e.employeeid, hiredate, address
from employee e
left join employeeaddress ea
on e.employeeid = ea.employeeid

Congratulations to Spencer for quickly spotting the problem and proposing the solution.  For that, he gets an extra hour to file his quarterly estimated tax payments.

If you’ve been following along, you know this was part of a larger SQL for Owen’s “pink slip pick list” report which shows IDs and addresses for all but the 49 most senior employees in each company.  The full SQL with the left join is now:

select companyid, e.employeeid, hiredate, address
from employee e
left join employeeaddress ea
on e.employeeid = ea.employeeid
where
not exists
 (select *
 from employee ei
 where ei.companyid = e.companyid
 and ei.employeeid = e.employeeid
 and 49 >
    (select count(*) from employee eii
    where eii.companyid = ei.companyid and
    eii.hiredate < ei.hiredate))

The Friday Fragment

It’s Friday, and time again for a new Fragment: my weekly programming-related puzzle.

This Week’s Fragment

Owen Moore at the IRS needs our help again.  He wants to add mailing addresses to his “pink slip pick list” report.  Easy enough: he just added in the EmployeeAddress table to get these fields.  But when he ran it, he found that it dropped some of the employees.  He doesn’t mind having multiple rows for one employee (whenever there are multiple addresses), but the IRS never wants to overlook a taxpayer.

Upon investigation, he discovered that the dropped ones didn’t have addresses on file. Here is his SQL:

select companyid, e.employeeid, hiredate, address
from employee e, employeeaddress ea
where e.employeeid = ea.employeeid
and ...
order by companyid, hiredate

Can you help Owen correct his mistake?  The “…” is the rest of his where clause (see solution below).  Since it’s not important to the solution, it’s omitted here.

If you want to “play along”, post the solution as a comment or send it via email.  You can use the attached file for sample data.  To avoid “spoilers”, simply don’t expand comments for this post.  Owen promises to add you to the “do not audit” list if you can help.

Last Week’s Fragment – Solution

Last week’s fragment was missing from a SQL statement.  IRS programmer Owen Moore needed a report of all but the 49 most senior employees of each company.  That is, fill in the dot, dot, dots (ellipsis) here:

select  companyid, employeeid, hiredate
from employee e
where ...
order by companyid, hiredate

Fortunately, Owen had a couple strokes of luck.  First, during his lunch break, the fast food drive-through attendant asked, “would you like a correlated subquery with that?”  Not knowing what such a thing was, he said “yes”, and it turned out to be just the thing he needed.  Second, upon arriving back at work, he was greeted with a couple of emails (including one from Spencer) suggesting a SQL like the following:

select companyid, employeeid, hiredate
from employee e
where 49 >
  (select count(*) from employee ei
   where ei.companyid = e.companyid and
   ei.hiredate > e.hiredate)
order by companyid, hiredate

That got him in the ballpark, but, alas, there was a gap:  it revealed the 49 newest employees (a good thing), but not all but the 49 oldest.  Well, Owen pulled up Google Translate and found that “all but” in English loosely translates to “not exists” in SQL.  So he wrapped an outer select around it, flipped the greater than sign (to exclude older ones), and came up with the following:

select companyid, employeeid, hiredate
from employee e
where not exists
  (select *
   from employee ei
   where ei.companyid = e.companyid
   and ei.employeeid = e.employeeid
   and 49 >
     (select count(*) from employee eii
      where eii.companyid = ei.companyid and
      eii.hiredate < ei.hiredate))

By the way, the “i” suffixes on table aliases mean “inner” (so “ei” is “employee inner” and “eii” is “employee inner-inner”), just a convention.

Owen has a “make it right, then make it fast” mentality, so he’ll consider tuning/rewriting later if performance is bad.  But if you’re interested in tuning it, he attached a script to create the table, load some sample data, and run the SQL.  This script also has SQLs and data to work on this week’s fragment.

Guilt By Association

Anyone who has done a little data mining knows that simple association rules (a.k.a., market basket analysis) and decision trees can reveal some of the most strange and wondrous things.  Often the results are intuitive, which builds confidence in the techniques.  But then let it run loose and you’ll usually find some (strongly correlated) wild surprises.

Folks who fold their underwear tend to make their bed daily.  I’ll buy that.  But people who like The Count on Sesame Street tend to support legalizing marijuana – are you kidding?

Those are some of the conclusions reached at hunch.com.  This site will happily make recommendations for you on all your life decisions, big or small.  There’s no real wisdom here – it just collects data and mines it to build decision trees.  So, as with most data mining, the results are based on pragmatics and association, and they never answer the question, “why?”  Yet “just because” is usually good enough for things like marketing, politics, and all your important life decisions.

In school they made me work through many of these data mining algorithms by hand: classifiers, associations, clusters, and nets using Apriori, OneR, Bayes, PRISM, k-means, and the like.  When it got too rote, we could use tools like Weka and DMX SQL extensions.  It was, of course, somewhat time-consuming and pedantic, but it made me realize that most of these “complex data mining techniques” that seem to mystify folks are actually quite simple.  The real value is in the data itself, and having it stored in such a way that it can be easily sliced and diced into countless permutations.  (NoSQL fans: that typically means a relational database.  Oh the horror.)

Yet simple associations can be valuable and entertaining.  I’ve run enough DMX and SQLs against large database tables (housing contact management, payment, and contribution data) to find some surprising ways to “predict” things like risk and likely contributors.  But since “past performance is no guarantee of future results”, these outputs must be used carefully.  It’s one thing to use them to lay out products in a store, quite another to deny credit or insurance coverage.

American Express, Visa, and others have caught a some flack lately for their overuse of these results.  “OK, so I bought something from PleaseRipMeOff.com and you’ve found that other cardholders who shop there have trouble paying their bills.  But that doesn’t mean I won’t pay my bill!  Don’t associate me with those guys!”  Well, associate is what data mining does best.  And, like actuarial science, it’s surprisingly accurate: numbers don’t lie.  But companies must acknowledge and accommodate exceptions to the rules.

Meanwhile, data mining will continue to turn wheels of business, so get used to it.  Just don’t let anyone know that you like The Count.

That No HLL Thing

Probably the worst thing about high level languages (HLL) is that they are so good in what they are doing. Good enough to conquer the entire market on programming and hold it for decades.

Wait! That is a bad thing? How?

It is a bad thing because high level languages are appropriate for a wide range of tasks, but not for every task. Yet it is exactly that that caused them to be used in contexts where they are not appropriate. In the last month alone, my strong recommendation for two different clients was that they need to switch to assembly language because it would greatly simplify the work that they need to do.

That met with some (justified) resistance, predictably. People think that high level languages are the way to write software. I decided to write a series of blog posts about the topic, trying to explain why you might want to work with a low level language.

High level languages have the following properties:

* Standardization
* Simplification
* Optimization
* Ease of trainification

Just about any of the No HLL approaches give up on some of those properties., usually, it gives up on all of those properties. But think about how useful a HLL is, how flexible it can be. Why give it up?

Indeed, the most common reason to want to move from a high level language is running into the HLL limitations. In short, high level languages don’t scale. Actually, let me phrase that a little more strongly, high level languages cannot be made to scale.

The problem is inherit into the basic requirements of a high level language, it must be compiled, to handle things like converting text to machine language, etc. The problem, however, is that trying to scale a high level language system over a set of coding styles. At that point, you run head on into the Abstraction Penalty, which state that if performance is your absolute requirement, you need to give up on either optimization or ease of coding.

In most high scaling environments, it is not feasible to give up on either option, so high level languages are out. That leaves you with the No HLL options, I am going to dedicate a post for each of the following, let me know if I missed something:

* MASM
* Gas
* Typing machine code into a hex editor
* Typing machine code into notepad or vi

Other languages, namely PL/S and inline assembly, exists. PL/S suffers from the same problem regarding the Abstraction Penalty as high level languages, and inline assembly is basically a special case of No HLL.

But seriously now… this post is, of course, an April Fools Day joke: a spoof on (and transliteration of) That No SQL Thing at Ayende.com.  (You can follow the link to the post, or just Topeka it).  Oren and other NoSQL advocates do make some good points and offer very good advice.  But be careful out there, especially when it comes to such architecture decisions, so that you pick the best tool for the job: SQL or NoSQL.  And don’t get fooled again.

Don’t Get Fooled Again

Surprisingly, I caught a little flack for this comment I made several weeks ago:

“If all you need are LOBs, you probably don’t need a relational database.  And if you need a relational database, you probably shouldn’t use LOBs.”

The surprising part was the question raised: “why would you ever use a relational database these days?”  It was argued from some familiar and simple high-level reasons (trends, having to write SQL, taking the car apart to put it in the garage, etc.), and came from a NoSQL fan.

This objection seemed anachronistic to me, since we just finished migrating one of our large products from a non-relational model to using DB2 for all data storage.  Why would we do such a thing?  Lots of reasons, but it only took one: paying customers wanted it.  Further, data mining has become so popular and successful lately (on the wings of relational databases) that it’s hard to imagine tossing aside that shiny new benefit.

The NoSQL revolution has taken on a reactionary bent (just consider the name), which is odd for a new movement.  Chesterton reminds us that “the business of progressives is to go on making mistakes, and the business of the conservatives is to prevent the mistakes from being corrected.”  NoSQL advertises itself like a progressive movement, but actually falls to the right of conservative.  BigTable, Cassandra, and HBase are no newer concepts than the ancient things grandpa tells war stories about: flat files, hash files, partition stores, CAM, BDAM, ISAM, etc.  So blindly applying NoSQL is making the new mistake of rejecting well-established corrections to old mistakes.

When making architectural decisions like whether to use a relational database or not, I often start with a blank 8 1/2 x 11 sheet of paper, turned landscape.  At the top I write the relevant business requirements (real, not imagined, ones).  I then do a simple left-and-right table of pros and cons.  At the bottom I write areas where architectural prototypes are needed, if any.  This, of course, helps me weigh the options and make a decision, but it also means I “own” the cons.  If I decide to use an RDMS, I accept the costs.  If I decide not to use an RDMS, I willingly reject the benefits they offer with eyes wide open.

Yet the war of words for and against NoSQL rages on, often without fully or correctly acknowledging the cons, nor the simple fact that you don’t do both at the same time.  Many problems are best solved with a relational database and many are best solved without one.

In the early 90s, I would joke that there existed a parallel universe where client/server has fallen out of fashion and the new, cool buzz was mainframes and 3270 terminals. Funny that we soon joined that parallel universe when web apps and cloud computing became our trend.  Andrew Tanenbaum notes that, while recapitulation doesn’t really happen in biology, computer history does rightfully repeat itself.  Old solutions should be applied to new technologies; at the very least it keeps consultants gainfully employed.  Let the pendulum swing, but truly grok all the pros and cons as history repeats itself.

Unlike Ted Dziuba, I don’t want NoSQL to die.  By definition, it can’t, and that’s arguing a strawman anyway.  I just want it to be used where it best fits.  And the same goes for relational databases.  Repeat after me: “no golden hammers; pick the best tool for the job.”  Just don’t be a lemming.  And don’t get fooled again.

Taking Inventory

Some questions came in today from a customer’s conversion programmer who was exploring uses of various related coded columns in the system’s DB2 tables.  In this case, the quickest way was to “dive in” and run some simple count and group by queries.  For example,

select mycodecolumn1, mycodecolumn2, count(*) from mytable group by mycodecolumn1, mycodecolumn2 with ur

Sometimes it’s helpful to rollup or cube these to get sub-totals:

select mycodecolumn1, mycodecolumn2, count(*) from mytable group by rollup(mycodecolumn1, mycodecolumn2) with ur

A common problem with DB2’s standard cube implementation is that it’s hard to distinguish null values from the summary (“all”) values.  If you don’t want to see null values, simply add that to the where clause:

select mycodecolumn1, mycodecolumn2, count(*) from mytable where mycodecolumn1 is not null and mycodecolumn2 is not null group by rollup(mycodecolumn1, mycodecolumn2) with ur

If you do want to see null values, use the grouping function to distinguish:

select mycodecolumn1, mycodecolumn2, grouping(mycodecolumn1), grouping(mycodecolumn2), count(*) from mytable group by rollup(mycodecolumn1, mycodecolumn2) with ur

Zero (0) means it’s the actual column value (null or otherwise); one (1) means it’s a summary row.

Translating those coded values into meaningful descriptions is often helpful.  This can be done by joining to lookup tables (those tables that map codes to descriptions), or by using case statements.  But there are other, quicker ways.  More on that later.