Will Spooner: 2011

Tuesday 30 August 2011

Justification

I've just returned from 2 weeks living under canvas in Spain. Although my Spanish is, erm, basic, communication was simply not a problem. That's in stark comparison to my day job of science and business, even with all the tools of the modern internet at my disposal; there's so much I want to say that is intuitively obvious (to me), but defies logical description (by me). Compare the following; "I'd like bread, cheese and a litre of red wine in a plastic bottle" with "commercial involvement will make open science more successful". Everyone understands intuitively why I want to poison my body with saturated fats and alcohol, but polluting open communities with business ethics requires special justification for which I have yet to develop the language. But I'm working on it.

Friday 1 July 2011

The battle between "Open Science" and "Open Innovation"

"Open innovation" is a term that describes the sourcing of new methods, ideas, solutions etc. from outside the organisation.

I hate "open innovation"! I don't hate the process of "open innovation", I just hate the term. Because there's nothing "open" about it. The final "innovation" is just as closed as if it had it been invented in-house.

The poster child of open innovation is, of course, InnoCentive. Clever name, brilliant business model: "Seekers" invite "solvers" to provide solutions to their problems for a cash reward. InnoCentive represents open-something for sure, but if not innovation, then what? Open questions? Not quite. Open quandaries? Better. Open befuddlement? Too far! I therefore humbly suggest;

InnoCentive; crowdsourced solutions to open quandaries.

However, so called "open innovation" extends way beyond crowdsourcing of the Innocentive mould. A great number of acquisitions and licensing deals, particularly in the pharmaceutical industry, can be seen in this light. Although such transactions open very little to the public domain it is clear that innovation has technically come from outside the purchaser/licensee, hence "open innovation" still fits.

More on "open innovation": Pharma are increasingly looking to bypass the biotech middleman by partnering directly with academia. This represents the funding of public research by multinational corporations in exchange for first dibs on any intellectual property that may emerge. It could be argued that such initiatives pit "open innovation" in mortal combat with "open science". "Open science", remember, asserts that public research belongs public domain, for free, and for the good of all.

Semantic posturing aside, innovation is the key to progress, no matter how it is couched. A nice recent example from Henry Chesborough (who coined open innovation) on how it can help pharma. So let's call a spade a spade; I hate the term "open innovation" because I can twist it to be in conflict with "open science" which is a movement that I truly value. I would rather "open innovation" revert to plain old "contract research", perhaps reserving "open quandary" to describe the crowdsourcing of same.

Thursday 16 June 2011

The brilliant Genome Analysis Crowdsourcing repository

In the days following the deadly German E. coli outbreak various 'rapid response' sequencing, assembly and annotation efforts washed across my radar (mainly via twitter). In isolation each of these efforts represents little more than a shop-front for their respective creator's (albeit impressive) capabilities. There was always the nagging feeling that a coordinated effort would have been more credible, and ultimately more useful.

Having perused a couple of the available data sets to see which file formats were being distributed I was hoping to find a blog post that summarised them all. That's when I found the E.coli O104:H4 Genome Analysis Crowdsourcing repository at GitHub. This goes way beyond being a simple blog. It represents a living repository linking all of the data generation efforts to-date. If that in itself were not enough, there is also a day-by-day listing of analysis reports (mainly blog posts).

I now contend that "Genome Analysis Crowdsourcing", by pooling various independent data and analyses makes these as credible and useful, if not more so, than any coordinated project could possibly have been. The quantity and variety of data in the public domain, all generated within 2 weeks, linked from a central location, is staggering!

Thursday 2 June 2011

Open communities - build or reuse?

I have drafted this blog a few times and I'm bored with the narrative. I'm therefore going to spit out the conclusion right at the start: if you want to engage a developer community for your project go to them, don't ask them to come to you (they almost certainly won't)...

As funding for open databases like NCBIs OMIM is cut, there tend to be fairly rational calls for the database curation to be opened up to the community (eg Manuel Corpas' recent blog post). The typical method is the addition to the project of a wiki interface that accepts community annotation. I've been at the birth of a few such projects. No names named, and here's why; I've also sadly attended their inevitable deaths from neglect when no bugger ever used the darned things. Whilst notable successes exist (EcoliWiki, SNPedia, the Polymath Project) building an open community from scratch is hard, very hard, and most projects are doomed to failure. I, for one, limit myself to participating in two or three projects at any one time, and need a very compelling reason to start contributing to a new one.

So, given that collaborative development does produce valuable products and individuals can be motivated to contribute, how do we go about finding our contributors? The solution is actually pretty obvious; don't build an open community from scratch - use an existing one! The shining example of this approach is the Rfam adoption of Wikipedia itself as the source of community‐derived annotation, with advantages described in this NAR paper, including;

Access to a large existing community of curators,
Access to well maintained, user-friendly curation tools,
Entries subjected to automated QC tools (bots),
Leading to improved database content (around 2500 contributions/year),
Plus the side effect of improved discoverability of the resource via Wikipedia itself.

It will be interesting to see other annotation projects cotton on to this idea; Pfam already has, but it's from the same Bateman stable as Rfam, so might not count (I've already been chided for mixing the two over at this Tree of Life blog post on a similar subject). Away from annotation, for active and inclusive bioinformatics-specific open communities you have the OBF leading the way, and also Debian Med (now blogging here) who are leveraging the wider Debian Linux community for the benefit of the life sciences. Whether there will be open science projects that successfully leverage Twitter and other social media communities remains to be seen.

So; what's the point of all this? Oh yes - if you're serious about engaging a developer community for your project go to them, don't ask them to come to you. Got that?

Thursday 19 May 2011

Well that KEGGing sucks - but how much?

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a hugely important database of genomic pathways and interactions that has been used daily by countless molecular biologists over the past 15 years (up to 200K unique web site visitors per month).

Even though the data sources that KEGG integrates to build its database are predominantly available to all, free of restriction, the KEGG database itself has traditionally carried a dual license - free for academic use, but non-free for commercial use through their Pathway Solutions licensing agent. I'm no great lover of dual licenses as they discourage commercial use thereby restricting translational application of the resource 'for the good of humanity'. Well, two days ago KEGG announced that it would go even further, by charging up to $5000 for academics to download the database (starting July 1st).

Can we use this unfortunate circumstance to assess the impact of limiting access to an established resource such as KEGG? And do it using a scientific measure that really matters; citations? Given that KEGG have 1000 citations/year and a 15 year trading record, the returns for the next few years should be very revealing.

Footnote - funding for large integrated databases is notoriously difficult to maintain over the long term even though the resources themselves are enormously valuable. In February NCBI tackled budget challenges by throwing their SRA toys out of the pram (on which I have commented before), whereas KEGG have been far more pragmatic in looking for alternative sources. I have huge respect for both projects.

Wednesday 18 May 2011

Financial value of open vs closed science

I was visiting the University of East Anglia today, presenting to their Enterprise and Engagement Club. My rather experimental talk was titled "Extracting value from open science; a commercial perspective" (slides to be posed soon). I wanted to demonstrate my point using an example along the lines of; "project A with open science generated $big, whereas with closed science it would have only generated $small". Whilst not quite exactly that, I did find an interesting financial comparison from the human genome sequencing project;

Exhibit A, the perceived cost to the biotech industry of "opening" the human genome;

"In March 2000, President Clinton announced that the genome sequence could not be patented, and should be made freely available to all researchers. The statement sent Celera's stock plummeting and dragged down the biotechnology-heavy Nasdaq. The biotechnology sector lost about $50 billion in market capitalization in two days" [from Wikipedia].

Exhibit B, the total economic impact of the "open" human genome according to the recent Life-sponsored Battelle report;

"The Human Genome Project [...] wasn't just a money-sucking vanity initiative that only reaped profits for personal genetic testing companies like 23andMe. The project has, in fact, driven $796 billion in economic impact and generated $244 billion in total personal income" [from Fast Company].

So, it looks like open-science for the human genome project is about $200 billion (personal value) in the black! That's a fair chunk of change by any estimation.

Disclaimer - I'm not an economist, and do not attempt to justify the validity of the comparison (other than both numbers having the same units) in any way.

Tuesday 10 May 2011

Open science; Open for business?

OK, let's get started.

What is open science? Well, my personal take is this - the scientific process has four outputs;

Methods/protocols that,
Generate data that,
Can be documented/published thereby,
Contributing new insight/knowledge to the scientific corpus.

I define 'open science' as any of the above that is made for free for the use of anyone with few if any restrictions. For example,

Methods => Open source
Data => Open data
Documentation => Open access
Knowledge => Open innovation (although formally open innovation can include patented/licensed ideas)

In the context of publicly-funded science I assert that the 'value' created by open scientific output generally exceeds that of its closed counterparts, although I'll leave the argument itself for a later post. And a truism - this value will never be realised unless it is actively extracted, in the same way a field crop withers and dies unless harvested.

Unfortunately, it is the very process of extracting value from open science, the business of open science, the trade in open science, that is widely ignored by the scientific community. Indeed, the assumption that peer reviewed publication is the only scientific currency was the premise of this Guardian piece - 'Why won't Open Science work?'.

So - do we need a new way of dealing in the existing currency, epitomised by open access publishing? Or do we need a new currency, nano-publications perhaps? Whichever, (and I'm desperately clutching for an appropriate idiom), where there's value, human nature will eek it out, come what may.