Tuesday, February 3, 2009

Throwing Down the Gauntlet for my Fellow Geeks

I love Wikipedia.  I think deep down we all do.  There’s something truly amazing about accessing hordes of (useless) information simply by entering a few keystrokes in a giant search engine.  At times, Wikipedia’s better than “Googling” simply by virtue of the fact that each topic is referenced (most of the time) and peer-reviewed.  By analogy, can you imagine the quality of published work if the ACS didn’t require references in a submitted manuscript or operate a peer-reviews-type system? 

Wikipedia is great for getting an objective “big picture,” rapidly in a fairly organized format, but it has its limitations.  Do you need to know the origins of Evacuation day in Boston?  Use Wikipedia.  Do you need to know the economic impact of the 11-month British seizure of Boston?  You’re better off consulting a textbook or bugging you local history scholar. 

By contrast, my “ranking” professors largely despise search engines such as Wikipedia.  I think they frown at the ease of accessing a tool that anyone can alter for finding physical constants (i.e. the density of aniline) or understanding conceptual material (i.e. Zimmerman-Traxler transition state models).  I once heard a professor claim, “If it’s published on the internet, there’s really no way to verify if the information is true.”  In a sense, he was correct.  The internet is a terrific source for (mis)information, and Wikipedia is really no exception to this phenomenon.  Hell, my wife (trained as a chemical engineer) has witnessed physical constants change on Wikipedia on several occasions. 

Science is a largely unspoken art.  Sure, there are lectures and textbooks that “guide the way.”  However, every research scientist mines information from the stockpiles of primary literature in an effort to piece together relevant aspects of his or her project.  I imagine that if I were to search for a procedure for using SOCl2 today (Lord knows I wouldn’t consult my PI), there are probably 49 other people in the world this week looking for a similar procedure.  This means that 50 of us will spend valuable time crawling through the literature looking for a similar ballpark procedure.  To make matters worse, on my campus, the SciFinder subscription is only available at a library where waiting for a computer is akin waiting for the Kansas City Royals to win a World Series.  A lot of these problems could be fixed with the development of a free scientific database.

I think charging several hundreds of dollars for crappy textbooks is criminal.  I also think that a scientist’s time is too valuable to be wasted crawling through primary literature looking for the proverbial needle in a haystack.  Knowledge should be available to the general public, hence libraries.  But, we live in a digital age where copious and sufficient information can be accessed with the click of a mouse.  Don’t get me wrong.  I believe in peer-reviewed publishing.  However, the organization of that information (specifically scientific) is what kills me. 

I propose the creation of a knowledge database.  In the spirit of Dan Carlin’s last podcast, I say let it be produced by the militia—by the people for the people.  You sign on.  You contribute.  You enter the associated references.  Do you need to know the side reactions of a Pictet Spengler reaction?  Maybe someone in Patrick Bailey’s group just added a reference from a recent paper last week.  Need a technique for depositing silver nanoparticles?  That would be easy to find if someone in Louis Brus’ team contributed a procedure.  Earlier this morning, a friend of mine just came by my office looking for a quick, easy way to make trityl tetrafluoroborate.  Imagine how easy it would’ve been for him to access a free database that references 50 different procedures (BTW, his group has complete access to SciFinder outside of the library).  Don’t feel left out you biologists out in Internet-land.  You could have access to PCR techniques, free sequencing software, and even references to protein crystal structures.

My argument is this: there is so much useful information that needs to be organized in a format that is free, navigational and easy to access.  One person cannot do it alone; we all need to contribute for the betterment of science, in general.  I envision a hybrid of Doug Taber’s Organic Chemistry Portal, Wikipedia and a condensed version of SciFinder.  I’ll gladly contribute!  How do we get the ball rolling?

38 comments:

OilIsMastery said...

The obvious problems with Wikipedia are the political and religiously motivated censorship that takes place as well as blatant violations of the alleged neutral point of view policy and censorship of dissident and minority viewpoints. But, like mainstream education, it can be useful.

J said...

Thank you for your comment.

I'm not convinced that Wikipedia posts are largely politically or religiously motivated because I haven't seen enough evidence to suggest so. However, insofar as a collective science wiki is concerned, I imagine such motivations would be relatively silent.

Aaron Rowe said...

That is a great idea. It would be best if we could rank the difficulty and reproducibility of each method or reagent with a five point scale.

J said...

Thank you for your comment.

I like the rating system, but instead of "points" as you suggested, maybe we'll use flasks :-)... just a thought.

Rich Apodaca said...

J, you have some very interesting ideas here. I've posted a reply.

rpg said...

"Don’t feel left out you biologists out in Internet-land"

What, like this?

It's been around for a while, as have various protocol services. bionet.methds-reagents, anyone? The trick is getting a critical mass of people using them.

Or have I missed your point entirely?

J said...

Thank you for the feedback rpg. Your analysis is spot-on.

I was unaware of such a product; it's truly remarkable! I envision a similar resource that incorporates techniques with general chemical knowledge.

There are a couple ideas in the works. Truthfully, I'd like to collect some data first from some fellow chemists before I let the project take shape.

Olah said...

I've been thinking this same thing for a long time.

Imagine not only procedures, but full on MSDS type information for every imaginable chemical coupled with the analytical data of journals and sites like SDBS.

One of the major problems with most sites on the internet is the anonymity of posters. If we were to force people to use their real names when editing the site, maybe by registering through each person's lab, you'd have the same incentive for quality as with journals. Make a bad edit; face the scorn of your peers.

I also love Aaron's rating idea, not only for procedures, but for every piece of chemical data. Imagine that coupled with a youtube-like comment system, where people could share experiences, offer wisdom, or issue warnings.

We are the internet generation, it's about time our research tools have caught up with us.

J said...

Thanks for your suggestion, Olah.

I like the MSDS idea. But instead, make the MSDS navagatible (i.e. pull all of the pertinent information out of the MSDS and onto the product). Merging the "MSDS" with a "SDBS"-like interface would be huge. I can't tell you the number of times I needed the physical properties of a reagent (including NMR data), and it ends up taking 45 min to find a reasonable spectrum when a text presentation would've done just as well.

The only concern with "registering through each person's lab" is that it ultimately drags a PI into the resource. I'm paranoid that PI's will want to have a hand in the operation of the product, and (in my experience) PI's typically slow things down either through micromanaging or overanalysis.

Egon Willighagen said...

Hi J,

in 1995, doing a minor in organic chemistry, I started a dictionary on organic website, which, unfortunately, is offline at this moment for me to put it online at my new workplace...

Putting organic chemistry knowledge and data online has been one of the things I have been doing for a long time now, but I agree there is no satisfying solution yet, even though the cheminformatics community, and the Blue Obelisk, has been promoting such websites for a few years on an international level...

At least, we have the technology, an there are certainly open submission chemistry databases around. I have summarized a few yesterday, but will make that list longer:

http://chem-bla-ics.blogspot.com/2009/02/where-can-i-host-my-experimental-data.html

But, please let me know what data you would like to contribute, and I look around. The Blue Obelisk community might even come up with new open submission chemistry databases to host new data types.

ChemSpiderman said...

Have you seen ChemSpider by any chance? www.chemspider.com. We put this online a couple of years ago and have been working hard to extend it's capabilities to support the Open Notebook Science world, integration with external sources (e.g Wikipedia, PubChem , chemistry document markup, deposition of new data (including images, spectral data, CIF files), crowdsourced curation. If you are interested in working with me to test whether ChemSpider can already provide want you want (and I think we are close) drop me an email at antonyDOTwilliamsATchemspiderDOTcom (replace the dots and the at)

ChemSpiderman said...

Relative to Olah's comments

"Full on MSDS type information for every imaginable chemical"

Look at the Supplementary information here: http://www.chemspider.com/Chemical-Structure.5889.html


"coupled with the analytical data of journals and sites like SDBS."

Look at the spectra here: http://www.chemspider.com/spectra.aspx : and read this page:
http://www.chemspider.com/docs/Uploading_Spectra_onto_ChemSpider.htm

"One of the major problems with most sites on the internet is the anonymity of posters. If we were to force people to use their real names when editing the site, maybe by registering through each person's lab, you'd have the same incentive for quality as with journals. "

All edits and curations on a site are captured here: http://www.chemspider.com/feedbackcurated.aspx But I don't force people to sign in with their own name if they choose not to


"Imagine that coupled with a youtube-like comment system, where people could share experiences, offer wisdom, or issue warnings."

We have that....simply Post Comments or become a curator
http://www.chemspider.com/docs/The_Curators_Manual_for_ChemSpider.pdf

ChemSpiderMan said...

J, You threw down a gauntlet so I picked it up for fun. Take a look at this: http://www.chemmantis.com/Article.aspx?id=875

This is a copy of the last four of YOUR blogposts marked up and linked back to chemical structures, reactions etc. It's not perfect..in fact, when the links fail it's commonly spelling errors in the blog post...enjoy!

Egon Willighagen said...

Antony, I am asked to log in, when going to ChemMantis for the annotated blogs... that's not intentional, is it?

ChemSpiderMan said...

Oops...I forgot to hit PUBLISH...try it now...one button click..

Egon Willighagen said...

OK, works now. Indeed some glitches here and there... For example, 'imidoyl cyanide' gives 'Cannot convert name into structures'... Antony, can you please explain how the organic chemist can learn ChemMantis/ChemSpider what that compound is? I think that's the Open Submission they may be looking for...

Additionally, and interestingly, in the ChemMantis version of "Playing the Devils Advocate: What has Monsanto done for you?", ChemMantis seems to disagree over 'Agent Orange'... a single structure in ChemMantis, a code for two structures in the blog...

J, can you clarify?

J said...

Hi everyone (in particular Egon and ChemSpiderMan)!

Thank you for your comments. I've taken a quick peak at the links you both posted. I liked the SI on the aniline post. I have two concerns about chemspider. First (and it's minor), calculated/hypothesized data makes me a bit nervous. I'm jaded because my PI refers to calculated properties as "really expensive guessing." But like I said, it's a minor concern and for every one person who frowns on it, there's another person who can't get enough of it.

The bigger concern is the lack of experimental conditions. My point is that it might be fine to know the molecular weight of thionyl chloride (for example), but an experimental procedure(s) that outlines its use would be most beneficial. All in all, I haven't had much experience with ChemSpider, but I'm going to drop it in my browser's bookmarks.

The chemmantis is really neat. I like how it links back to sites such as Entrez and ChemSpider. The concern with that you have to cross programs (sorry, I'm not a computer person). I imagine a completely contained program that cross-links within itself.

Good news, though. There's been a lot of dicussion, idea swapping and development behind the scenes (it's amazing what 3 days of collaboration will accomplish). I will soon be updating the blog with more information.

Egon Willighagen said...

Ad. "really expensive guessing".

:)

One reason why it is expensive and why it is guessing: (organic) chemistry PIs have not been putting their data in the Open, so the cheminformatics guessing is based on a too small data set, which costed way too much money to extract from published literature.

And, of course, it is much cheaper to have PhD students synthesize the compounds and measure reality ;)

Seriously, no computational chemist will ever deny that measured data is more exact than calculated data. Just try synthesizing 10M compounds in a week and measure their properties.

J said...

Touche

Jean-Claude Bradley said...

@J ChemSpider is not currently used to store experimental protocols. But what it does do (store spectra, properties, etc) it does very well.

When we have characterized a new product, we link from our lab notebook on a wiki to ChemSpider. For example here is a Ugi product (UC-150D) that was made in my lab. You can look at a bunch of properties (included calculated if you choose) then when you want to look at the experimental conditions click on the link from the supplemental information section in ChemSpider to our wiki and see the lab notebook page.

Egon Willighagen said...

JC, very nice example! I noted that the CIF (nice to see Jmol in action) is marked as OpenData (Antony, cheers on the icon), but also saw it was missing for the spectra... JC, are those not open?

ChemSpiderMan said...

Response to Egon's question "... For example, 'imidoyl cyanide' gives 'Cannot convert name into structures'... Antony, can you please explain how the organic chemist can learn ChemMantis/ChemSpider what that compound is?"

I believe that there is no "imidoyl cyanide"...I believe that it is a non-specific name and therefore cannot be converted to a single structure. I have not heard the term imidoyl myself (not that that means much..I am a spectroscopist not an organic chemist) but there are no hits in MeSH, Wikipedia or ChemSPider. However, at the top of the structure balloon ABOVE "Cannot convert Name to structure" we allow one click searches of ChemSpider, Entrez, Google, MeSH and Wikipedia to help people find info. We DID have up a week ago a red dotted line under all extracted names that did not convert. We need to return it

ChemSpiderMan said...

Some responses to J...

" I have two concerns about chemspider. First (and it's minor), calculated/hypothesized data makes me a bit nervous. I'm jaded because my PI refers to calculated properties as "really expensive guessing."

Without meaning to sound overly critical there is the purists world and there is the reality of industry. Calculated data have saved MANY a project in industry where the calculated data said "take another look at what you measured". I was involved with the development and marketing of commercial software tools for PhysChem prediction, nomenclature generation and NMR prediction. Look at the story of hexacyclinol for an example of NMR prediction DISPROVING the experimental method. Look at the report by Gernot Eller about nomenclature and how well chemists can generate chemical names themselves (http://tinyurl.com/ctofhj downloads the PDF). I have seen directly how PhysChem prediction can RESCUE projects where measurement protocols fail. I am NOT saying that prediction outperforms experimental measurement but be very careful in judging it harshly as you WILL meet it and be grateful should you move to industry. Your PI should not be so judgmental and purist in nature. By the way...upload a molecule here: http://www.chemspider.com/Services.aspx

J said... "The bigger concern is the lack of experimental conditions. My point is that it might be fine to know the molecular weight of thionyl chloride (for example), but an experimental procedure(s) that outlines its use would be most beneficial."

Yes, I agree. But someone has to CREATE that information in order for others to use it. ChemSPider is run on no-funding, no grants, and no fees to use. We depend on the community, users like yourself, to help. We will provide an environment to deposit information but we need people to do it. CAS will charge for you to access information but they are aggregating OTHER peoples work and work hard to do it. In Wikipedia people BUILD the platform and users populate it. If you'd like to see info on ChemSPider tell me where it is or put it in there.

Take a look at the Description section for this molecule: http://www.chemspider.com/Chemical-Structure.8215220.html. The description was taken from Paul Docherty's blog and is VERY useful and elegant in its description. ANYONE can post their info to ChemSPider...


"The chemmantis is really neat. I like how it links back to sites such as Entrez and ChemSpider. The concern with that you have to cross programs (sorry, I'm not a computer person). I imagine a completely contained program that cross-links within itself."

That is totally contrary to the intention of ChemSpider. No one body should manage all data and information but link seamlessly to updated information. We COULD host all of Wikipedia but if there is one change at a record then we would miss the update. There is lots more to day here...


"There's been a lot of dicussion, idea swapping and development behind the scenes (it's amazing what 3 days of collaboration will accomplish). I will soon be updating the blog with more information."

Here's my question for you...you said in the original post "I envision a hybrid of Doug Taber’s Organic Chemistry Portal, Wikipedia and a condensed version of SciFinder. I’ll gladly contribute! How do we get the ball rolling?" SO, I'm asking you to work with me to expose some interesting information, articles etc for the good of community and use ChemSpider to test your thoughts...ready?

ChemSpiderMan said...

Comment to Egon

"...I noted that the CIF (nice to see Jmol in action) is marked as OpenData (Antony, cheers on the icon), but also saw it was missing for the spectra... JC, are those not open?"

JC has already contacted me to ask whether we can back across all of his spectral data and make it Open. I'll add it to the list of things to do and they will be available soon (no specific timeline promised)

ChemSpiderMan said...

Comment to JC "ChemSpider is not currently used to store experimental protocols."

ChemSpider CAN be used to store experimental protocols if we acknowledge upfront that it is not designed to be a Notebook system. It CAN hold experimental protocols if you wanted.

An example? Check uot UC150 that JC Bradley was talking about. Took me 15 seconds to do this...copy and paste the UC150 page from the Usefulchem wiki and paste in the decsription section: http://www.chemspider.com/Chemical-Structure.21105601.html

ChemSpiderMan said...

J,
You wanted something associated with thionyl chloride...check out: http://www.chemspider.com/Chemical-Structure.22797.html

Also, checkout whether you are comfortable with "your contribution" regarding 5-amino levulinic acid being posted here: http://www.chemspider.com/Chemical-Structure.110200.html

Joerg Kurt Wegner said...

Another option is chemistry@freebase.

As discussed earlier, do I still think that a proper chemistry ID @freebase is interesting.

On the other hand, with the actual chemspider widget trend, there is a richer web-service possible then just linking information, e.g. getting chemical images, chemical structures and cross-links (ChemMantis text mining) as mentioned by ChemSpiderMan.

Still some work to do, but the trend goes certainly in the right direction. And I agree, that information exchange should not be limited to single services or web-interfaces, so if Geoff wants to get more linking, e.g. reaction conditions, then this meta-information should be linkable somewhere. One option could be to create one Google document for each reaction and just putting all information in it, of course, could Wikipedia allow the same depth in reactionXYZ(conditions) articles ;-)

Jean-Claude Bradley said...

Joerg,
One of the downsides with GoogleDocs is that ironically I have not found that they index well on Google. We use one wiki page per experiment on the publicly hosted Wikispaces and indexing is usually very quick.

I do however really like Google Spreadsheets for recording calculations of experiments so that anyone can check over them if something doesn't make sense. We typically have one per experiment - we set the access rights to allow anyone to edit with an immediate notification of changes. That way people can write comments directly into the sheet.

Joerg Kurt Wegner said...

This is clearly one strong point for MediaWiki, or even its semantic extension.

I just was reading the Google documents entry on Wikipedia and I think the size limitation of GoogleDocuments of 2MB is another strong point for using rather a Wiki.

ChemSpiderMan said...

J, You asked about something Scifinder like also...we have text based searching of the literature too. Over 300,000 papers now, and growing. I did a search on thionyl chloride on ONLY the Molecules (MDPI) subset on our ChemSPider literature search See the results: http://tinyurl.com/aqcwlf

Please note that searching across all articles for such a general chemical is very slow at present and we are working on improving performance.

Joerg..I looked at chemistry@freebase. Can you point me to an "interesting article"? There are about 4000 chemicals only and I couldn't find an article with interesting content. I couldn't find common compounds like xanax or viagra yet...

mitch said...

Reading through all the comments I can't help but feel the chemical informatics guys don't get it. The critical error seems to derive from making tools that can do cool things but not having a critical mass of users to make it relevant. I've always approached problems from the critical mass side, and let others worry about indexing, tagging, and developing tools in the future.

What J and I are planning is truly awesome, but it is not going to fall within the realm of the type of collaborative work that the chemical informatics people know so well. J will announce the project in the next couple of days. I think you will find it completely awesome, or simply not get it. It could go either way from reading through the comments.

Egon Willighagen said...

Mitch and J: really looking forward to what you are cooking up!

ChemSpiderMan said...

Mitch...I look forward to seeing what J and yourself will be unveiling. I though J was actually asking for help when he said "One person cannot do it alone; we all need to contribute for the betterment of science, ...I’ll gladly contribute! How do we get the ball rolling?" nd that's why I offered some help. But based on your comment he wasn't asking for help...it was a set-up for what's to come from the both of you. Doh...sorry I consumed so much of your time giving feedback. But if there's anyway that ChemSpider can help let me know.

Anonymous said...

Hello J,

As a graduate student of Library and Information Sciences, I agree with your comments here.

If I may make a suggestion, you might want to consult with a librarian, information specialist or research specialist in setting up your idea.

Joerg Kurt Wegner said...

I can only agree, it could not hurt asking for more input on this, especially when talking about critical mass.

I commented in response to a your, and a few other blog posts and discussion on
Mining Drug Space.

Well, if you want to make it simpler, shorter, and more pragmatic ... I wish you time ... "I would have written a shorter letter, but I did not have the time.—Blaise Pascal"

Eric Drexler said...

Success in this project would be extremely valuable, and I'm encouraged by the discussion here. A few thoughts --

What is generally called "critical mass" has an aspect that can be thought of as "critical density": covering a knowledge domain densely enough that the likelihood of finding a useful answer is high enough to reward the effort of looking for it. This speaks in favor of starting with a domain that is well-defined and not too large, making it more practical for a focused start-up effort to reach the required density. Excessive initial breadth is one cause of the problems with chemistry@freebase that were discussed above.

One way of quickly getting both mass and density is to provide a front-end to an already rich source of data, while making the system extensible for broader applications. This is how the Web got off the ground -- the http protocol and first browsers were immediately useful because they provided an improved front-end for accessing the already large body of ftp content. (The CAS Registry look-up example has some similarity.)

I recall legal rulings from some years ago to that established that some copyrighted databases had a markedly limited copyright on their contents. This suggests that there may be ways to liberate data derived from the literature on a large-scale basis; legal advice would be necessary here. Links to databases that themselves have restricted access is a second-best (but clearly legal) option for achieving density in a way that is accessible to a large block of potential users.

It's important to look toward broader objectives, not to pursue them immediately, but to see whether they give reason to adjust initial directions in some way. Downstream, I'd like to see a really effective merger of the Wikipedia model for contributions with a database model for search, and to see the content organized in a way that supports research in the domain of molecules and materials. There are some interesting question here involving openness vs. review (review annotations are a middle ground), and combined technical and social questions involving extension of the database structure.

On a more concrete and immediate level, the NMRShiftDB is different in many ways, but may have relevant lessons.

john said...

ITemplatez.com offers professional web templates, flash templates ,swish templates, dreamweaver templates, and other web design productsavailable for immediate download.

dqualley said...

Admit it, one year ago you would have said "Tampa Bay Rays" instead of "Kansas City Royals".

:)