semantic web, George London

Reflections on the Semantic Technology and Business Conference, SF 2012

July 27, 2012 George LondonLeave a comment

In June, I attended the enlightening “Semantic Technology and Business” conference in San Francisco. These are my reflections.

First, a little background on myself – I graduated college in 2008 with a degree in philosophy and very little programming experience. I worked for three years at a very large, very technology-oriented hedge fund where I did macroeconomic modeling and built large statistical models. I left there in 2011 and ever since I’ve been pouring all of my energy into learning to program, learning semantic technologies, and learning the entrepreneurial ecosystem.

So, in short, I’ve come to the Semantic Web with a fairly non-traditional perspective compared to most Linked-Datanauts. For one thing, I could count on one hand the number of conference attendees under the age of 30. My informal impression is that most of the Semantic Web community consists of very experienced programmers with extensive backgrounds in either academia or in traditional large-corporation IT.

Some consequences of my relative youth and inexperience are that

1) I’ve completely missed the decade long back-history of the Semantic Web;

2) I have no experience in enterprise IT, and I don’t at all understand the technology problems large corporations face nor the politics that drive technology adoption;

3) I’m easily befuddled by technical discussions that go deeper than my limited (but growing) technological understanding

I bother with the extended preamble because it explains why my impressions will have to be very…well…impressionistic. Eighty percent of the content at SemTechBiz was either about the (increasingly successful) war to convince enterprises to deploy Semantic technology, or else was focused on deeply technical low-level details of infrastructure technologies. In other words, it went above my head. Nevertheless, I do have a few hopefully at least mildly interesting remarks.

My impressions all center around one central theme – the Semantic Web is an adolescent industry. And that’s fitting, since it’s around 15 years old. All over the conference there were clear signs that Linked Data is out of its infancy and is being used in real enterprise settings to solve real pressing problems. For example, Merck is using Linked Data to power a widely used internal research tool. The U.S. intelligence community (seems) to be heavily using Linked Data for counter-terrorism and other purposes. Nearly all the major database vendors (Oracle, IBM, etc.) have released adapters to facilitate working with RDF. Google, Microsoft, and Yahoo all have semantic research teams and have come together to support the Schema.org metadata standard (elements of which purportedly now appear on a startlingly high percentage of all webpages.) Facebook has convinced webmasters to mark up something like 40% of all webpages with RDFa open graph tags.

The technology stack is also improving rapidly. There were a lot of technologies demonstrated at SemTech that I’m excited about, especially around the theme of “scalability.” There are now billions of triples in the cumulative open Linked Data Cloud, which is far more than existing single-machine Triplestores (Semantic Web databases) can handle. So a move toward distributed computing is going to be critical. Sindice Tech is making RDF processing much more scalable by applying Hadoop batch processing. The Bigdata triplestore (which I’m using for a project) is able to handle unprecedented volumes of RDF on a single machine or easily convert to a cluster. Google Refine is being applied to make ingesting large volumes of “idiosyncratic” RDF more manageable.

There was also a great deal of presumably-exciting-but-hard-for-me-to-understand technology targeted towards various enterprise problems.

For all the great stuff happening, there are also clear signs that the industry is not yet fully mature. I am very optimistic about the Semantic Web (which is why I’m devoting so much of my time to learning about it), and I sincerely believe that all the problems can and will be overcome. But the problems do exist.

From what I’ve been told, much of the (ever more robust) open source Semantic Web technology stack has been built primarily with government research money. Now the tap of money has turned off, and longstanding projects are struggling to generate cash flow to support themselves.

This is leading to one of what I see as the industry’s biggest challenges, fragmentation. There are many different small companies building and trying to sell important but basic infrastructure components like Triplestores or Enterprise Knowledge Management Platforms. Having personally tried most of the Triplestore options, I can personally attest that the offerings are probably not differentiated enough to really merit the existence of so many choices (and if anything the panoply of similar and usually under-documented options makes learning about and adopting the stack even more intimidating.) “Forking” makes some sense in the Open Source world, but it makes a lot less sense in the commercial world where one team of twelve can accomplish a lot more than four teams of three.

The other major problem is the talent pipeline – there are clearly a lot of very smart people working on Semantic technology (I personally met many of them at the conference). But one of the Semantic Web community’s greatest strengths – it’s internationalism – is also a weakness because it makes finding and consolidating talent to work on a single project very difficult. It’s just not feasible to start a company with employees in 10 different countries. And because the Semantic Web community is still small and the total number of skilled Semantic developers in any given country can be measured in the low thousands, the problem is even worse. We’re caught in a catch-22 where even ifan enterprise has a great use case and permission to spend millions of dollars on a Linked Data approach, they may (quite reasonably) demure simply for fear of not being able to find 10 developers to work on the project.

And on the other side of the pipeline, we’re suffering from a distinct lack of “MBA” types who are experienced at making emotional compelling sales pitches or raising money from investors. Ironically, the rest of the tech world has an enormous glut of MBA types who are desperately scrounging for developers to latch on to, but none of them seem to have discovered the Semantic Web world yet. Perhaps because we don’t have the MBA types to sell it to them!

I think that each of these problems will work itself out over time, but the one guaranteed cure-all is a single highly visible home-run commercial success. And I suspect that the company that will hit that home run is being started right now in a garage.

It’s an exciting time to work on the Semantic Web. The technology is increasingly solid, powerful and mature, and it’s very clear that the community is still strong and still full of energy and ideas.

Big things are about to happen.

Quick reflections on “An Introduction to Bio-Ontologies” with Barry Smith

February 5, 2012 George LondonLeave a comment

Today, I woke up at 7:30am, which is easily the earliest I can remember waking up on a Sunday in literally as long as I can remember. After a hearty and healthy breakfast of Frosted Mini-Wheats , I set out into the hoary cold and began the hour-long journey to the New York Botanical Garden.

Why? Bio-ontologies!

Or, rather, an introduction to them lead by Barry Smith, of the National Center for Ontological research at SUNY-Buffalo.

I didn’t quite know what to expect. I was attending the event at the recommendation of the New York Lotico Semantic Web Meetup Group. I’m a fairly recent convert to the Semantic Web way of thinking, so over the last few months I’ve been trying to attend any event I can find and meet/talk to anyone I can in the field. And I’m now actively working on ontology construction for my current project LinerNotes.com, so I was curious to see how people in other knowledge domains are approaching ontological problems.

What I did get was confirmation of what I had long suspected, that poorly conceived and poorly labeled schematizations of data inside of traditional databases have lead to as much poor findability and poor interoperability in biology as they have in music or (in my past life) in finance.

Dr. Smith opened the talk with a discussion of his previous work with genetics databases that were full of confusing and under-documented labeling, wherein a single location a field specimen had been found could be labeled “Japan”, “Japanese”, “In Japan”, or “J.”. Obviously trying to query such a database to find all the specimens collected in Japan is going to be an exercise in frustration.

Dr. Smith then began to dive into the technical details of his project in biological ontology, which was followed by a deep discussion of a Plant Ontology project apparently being carried out in part at the Botanical Garden. At that point, unfortunately, I ceased being able to extract much from the discussion as they dove deep into I’m-sure-fascinating-but-way-beyond-me questions a la “is designating a capable_of relationship between a cell type and a plant organ going to cause conflicts with medically-oriented biologists’ intuitive notion of totipotency?”

A few meta-points I could glean from that are:

1) Even when restricted to a single subject matter domain, ontologies are tremendously complex if expressed down to anywhere near the “real individual instances” level. And so building out robust ontologies without a huge capital investment is usually going to require significant broad collaboration. Meaning that getting a large team of enthusiasts to work together is extremely helpful.

2) Even in biology, ontology is no more suitable to rigid exactitude than normal human language is. There is no “right” answer for what “capable_of” should express, and the best we can hope for is a level of clarity that makes it easy to choose the right ontology for our particular purpose, and when necessary to algorithmically convert between ontologies (which should, by and large, be functionally equivalent most of the time.)

All in all, I probably wouldn’t recommend this kind of event for those who don’t have a strong abstract interest in the process of ontology creation or in the specific subject matter being covered (in this case biology), because the talk was really a “by experts for experts” type thing.

But I think it was a worthwhile use of my Sunday morning. And certainly an entertaining contrast to my Superbowl afternoon.

[Quickstart] Using Neo4j and Tinkerpop to work with RDF. Part 1!

December 20, 2011 George LondonLeave a comment

[Warning: This is another super-technical post. If you don’t know what the Semantic Web and RDF are, this will be incomprehensible.]

In my last post, I talked about my attempt, as a novice programmer currently capable of only rudimentary Python and not much else, to use Neo4j as an RDF triple store so that I could work with the DBpedia dataset on my laptop. Tinkerpop is an open-source set of tools that lets you magically convert Neo4j into a fully functional triplestore.

My conclusion from that attempt was that using only Python to set up and control Neo4j for RDF is basically impossible.

To reiterate why I’m doing this in the first place: the DBPedia dataset is fascinating and I want to explore it. But the web interface has frustrating limitations (especially the fact that it will simply time out for non-trivial SPARQL queries, and also that I can’t easily download the input to feed into other programs.) So, I want to host the data locally so that I can let my laptop chug away for as long as I damn please answering my queries.

I’m still determined to accomplish that goal, so my new plan is to just bite the bullet and teach myself “just enough Java” (JeJ. Palindr-acronym!) to make this all work. I’ve hesitated to learn Java, since it is, well…extremely daunting.

As of six months ago, I knew basically nothing about programming. Since then, I’ve taught myself rudimentary Ruby (+ Rails) and rudimentary Python (+ Django), both of which are very nice, syntactically simple languages with excellent online “getting-started” resources. For Ruby, I recommend The Little Book of Ruby, or if you’re in for a more psychedelic experience, The Poignant Guide to Ruby. For Rails, I used Michael Hartl’s online Ruby Tutorial (there’s a link to a free HTML version buried on that page somewhere.) For Python, you can’t go wrong with Learn Python the Hard Way. MIT’s Open Courseware Site also has an entire intro to CS class in Python. For Django, I’m working my way through the Django Book. Both languages also have strong, enthusiastic communities in New York which you can easily connect with in person through www.meetup.com. If I get a chance, I’ll write another post sharing all the cool resources I’ve found from trying to learn Ruby and Python.

Now for Java, on all of those points…not so much.

From my perspective as an outsider and a novice, the Java ecosystem looks huge, fragmented, confusing, and uninviting.

Now I will freely concede that I don’t know shit about Java (that’s why I’m trying to learn!), so many of things I say in this post may be deeply ignorant and wrong. If so, please point out any errors/idiocy to me and I’ll happily correct myself.

In this post, I’m going to try to walk you through the whole process of going

FROM: Knowing nothing but a simple scripting language like Python

TO: Knowing enough Java to set up and run a publicly accesible Neo4j server that uses Tinkerpop to process and serve RDF data.

I’m going to try to stick as few steps as possible so that you can follow along even if you’re a true beginner like me. I am going to have to presume that you know enough about the Semantic Web to know what RDF and SPARQL are and why you’d want to use them. If you don’t, that’s just too big a subject to tackle here, though I will try to eventually write an introductory blog post about those too. In the meantime, you can start with wikipedia for a brief overview of RDF and SPARQL, or learn the hard way by reading the W3C specifications for RDF and SPARQL.

So:

STEP 1: Make sure you have Java.

This post presumes that you’re using a Mac. Speaking as a long-time Mac-avoider who just recently ditched his Windows laptop for a new Macbook – if you’re using Windows and want to develop modern software, you need to get a Mac. Just do it.

(Protip: buy used. I got a five-day old Macbook Pro for $2k on Craigslist. It actually had a faulty battery, so the Apple Store gave me a brand new one, no questions asked. Ebay also has substantial markdowns. And if AppleCare is not included, SquareTrade warranties are apparently 90% as good for 50% of the cost.)

So, the basic way that Java works is:

1) You write some code, and save it in a .java file.

2) You compile your source code into .class files, which I presume are in byte-code.

3) A magical machine called “the Java Virtual Machine” magically translates your bytecode into binary which can be executed on whatever system you’re using. The JVM is what makes Java portable to so many different systems…you only have to write code that’s compatible with the JVM, which is the same on every system. Making the JVM compatible with the chipset in your refrigerator is someone else’s job.

So, from what I can tell, “having Java” on your computer means two different things:

1) You have “the Java Runtime Environment”, or “JRE”, which contains the JVM and lets your computer execute precompiled Java code.

2) You have “the Java Development Kit”, or JDK, which contains all the machinery to compile your raw Java source code into bytecode.

Some blogs are claiming that Apple has stopped shipping a JDK since Lion, though you probably have a JRE. I can’t honestly remember what was installed on my laptop when I got it, but to figure out what you have vs. need, just open a console and type:

% java -version

If you don’t have a JDK, you will apparently get explicit instructions on how to get one from Apple. (Oracle apparently just doesn’t feel like supporting Mac). You can also download the latest JDK and updates from the Apple Developer download site. I can’t find a static link but it should hopefully be obvious what to click. This stackoverflow post also has instructions. The latest version seems to be JDK6, though there seems to possibly be a version 7 on the near horizon.

STEP 2: Get Eclipse

Unlike Python, which is happy to run your hello_world.py script by itself in some random folder, Java has fairly rigid requirements for how the filesystem of your project has to be laid out. So while you probably could do everything in emacs, you can save yourself a lot of pain by using an IDE.

One of the most widely-used open source IDE is called Eclipse. In addition to being free, it has a plugin system that makes it (reasonably) easy to add in new functionality. Neo4j will ask us to install some plugins, so I recommend that you just use Eclipse for you development, unless you have a strong reason not to. You can download it here. Just unzip it and put the decompressed folder in whatever folder you want to keep your Java stuff in (for me it’s /Users/rogueleaderr/Programming/Java).

For some reason the drag-the-app-icon-into-your-applications-folder-to-install-on-Lion didn’t work for me (the app wouldn’t launch), but I was able to just put an alias to the app icon into the applications folder and thus add Eclipse to the launch dock.

STEP 3: Get Maven

Don’t you love how simple adding new packages in Ruby is? Isn’t “gem install cthulu-mod” easy and intuitive? Well, forget about that.

You’re going to be using Maven now. I’m still figuring out exactly what Maven does, but my understanding is that it’s a package manager on steroids. If you have Maven installed, you put an xml file “pom.xml” inside each Java project you do, and it specifies the complete structure and all dependencies of your project. So if you download someone else’s project, you can use Maven to automatically make sure that you have everything you’re going to need to run that project. I recommend scanning the wiki page for a quick overview of what Maven does.

To me, typing in “gem install XYZ” three times sounds easier, but hey…

You can download Maven from the Apache website here. Follow the directions on that page to install on Mac. Basically, decompress the file then put it where Apache tells you to, then add it to your shell path. (To add to your shell path, open your .bashrc or .zshrc file, which is a hidden file located inside your home directory “ ~/ ”. If this file doesn’t exist, just create it by typing “ % emacs .zshrc ” (or whatever your preferred text editor is). Then paste in the lines from the Apache install directions. Make sure you enter the right file locations, as I learned the hard way.)

STEP 4: Get Neo4j

As you hopefully know if you’ve read this far, Neo4j is a graph database. While I’ve been told that a graph database is theoretically formally equivalent to a relational database and can be used for almost all of the same things, graph databases are naturally particularly good at representing graph structures. RDF data naturally forms a graph structure, meaning that Neo4j is naturally pretty well suited for hosting RDF.

Neo4j is not as naturally well suited for RDF as a dedicated triplestore like Sesame or OWLIM. But it has one key advantage, which is why I’m testing it out in the first place:

The free open source version is apparently capable of working with billions of triples. Sesame works fine with up to ~100m triples, but even the pared down DBPedia dataset I’m trying to work with has around 1.5bln. My first attempt to “damn the torpedoes” and load everything into Sesame lead to some bizarre behavior. There are commerical solutions like OpenLink Virtuoso and Ontotext OWLIM which claim to work with 10bln+ triples, but those are rather expensive.

Hence, Neo4j gets my attention for now.

Neo4j comes in two forms:

1) A standalone server which you can get by clicking the download button on the Neo4j homepage. The upside of the standalone sever is that you can control it through REST. So if you want to stick with Python, this is probably the way to go. Neo4j does have some embedded Python bindings, but they’re fairly limited. The downside of the standalone sever is that, as far as I know, there is no way to use additional plugins like Tinkerpop, so you’re limited to what Neo4j can do out of the box.

2) A set of Java libraries. This is what we’re going to need, so that we get the full range of control and so that we can use Tinkerpop. Neo4j has a fairly extensive manual which explains how to get these libraries (the specific page is here.) Follow the directions there (including potentially installing an Eclipse plugin called M2Eclipse to let you use Maven directly inside of Eclipse. On my Eclipse install, M2E was already installed, but I’m not sure how to check the full plugin list (Eclipse is pretty freakin’ complicated). But if you open Eclipse–>Preferences and see a line for “Maven”, you’re probably good.

STEP 5: Learn Java

And this is where the paved roads end. From here on out, we’re going to be tying everything together directly in Java, and fighting bugs and dinosaurs as they attack.

User-friendly resources for learning Java seem to be rather scarce (please let me know if you find any.) My solution pro tems is to just go directly to the Oracle Java Tutorial and work through it. Obviously this leaves you about 3652 days short of the ten years you’re going to need to be any good at Java. But assuming you already know the basics of some object oriented programming language, it will give you just barely enough to muddle your way through getting this basic setup working. And crucially, it will teach you how the Java package system works, which is not particularly intuitive but will be crucial if we want to use Tinkerpop.

STEP 6: Get Tinkerpop

Well, I hope you enjoyed learning Java. That must have taken a while. You did go learn Java, right?

Well, just in case you didn’t – I’ll walk you through how to create a Neo4j interface using Tinkerpop. Most of this is ripped directly off of a recent blog post by Davy Suvee, found here. Davy provides some very helpful code, but he assumes a high level of Java fluency. I, on the other hand, will assume that you know no more than I do (i.e. nothing.)

So, start by reading Davy’s post. If you can follow and implement that, you don’t need me!

If not, then let’s start by downloading Davy’s code. Head over to the Github repository. If you don’t know how to use Github, Google yourself a tutorial…it’s pretty easy.

Now, within Eclipse, go to File –> Import. A dialog will pop up. Click Git –> Project From Git

Now click Next. Then copy in the URL of Davy’s project – https://github.com/datablend/neo4j-sail-test.git – last I checked. Now click “
clone.”

Make sure url is autopopulated in the next window, and click next again. You shouldn’t need to enter any github credentials to do this, but if you get an error, try entering yours (definitely worth signing up for a free account if you don’t have one.)

Just click next on the next screen:

And on the last screen, make sure you’re creating the repo where you want it. Then click finish. The repo should download. Eclipse will bring up the original import screen again, but just close it.

Now you have the files! But what to do with them?

For some reason, Eclipse does not let you open projects. ಠ_ಠ

So what you have to do is:

1. Create a new Java project. Make sure your Eclipse workspace is set to the same folder where you cloned the project off of Github (go to File – > Switch Workspace if it’s not). Give the new project the same name as the github repo you cloned. Click okay, and Eclipse should automatically open the neo4j-sail-test project.

2. Now you should have a project open in Eclipse, and you can get started trying to fix all the dependency errors and make this code run.

3. To do that, we’re going to have to get the actual Tinkerpop libraries, and add them to our “classpath”, which is what Java uses to figure out where to look for the files you tell it to import.

That’s hard. And I will try to figure that out tomorrow…stay tuned for part 2.