It should be noted that this is the initial draft of this document. Feedback should be directed to email@example.com. This draft should not be taken as an academic paper. I do not claim to have invented the concepts expressed in this document.
The plague of the modern age is not a disease or global warming or ecological collapse or any other vogue concepts. Rather, the plague of the modern age is information. We are up to our eyeballs in information. So much so that we expend a great deal of effort attempting to organize the information in a way we can make sense of it. That is, we try to make the information we have useful. Some parts of the effort involve making the information searchable so we can find some piece of relevant information. Others involve categorizing and crossreferencing so we can find related information easily. Some efforts even try to reduce the amount of information by summarizing it. And, to top it all off, even though we feel snowed under by information, many of us actually seek out yet more of it. It is no wonder the current era is often called the Information Age.
A side effect of all the various attempts to record information is that we have a great many situation specific means of organizing it. This is not a bad thing. After all, a situation specific method can be tailored to the precise requirements of the situation. These means will never disappear. Nor should they. Even so, there is one situation that seems not to have been solved – how to relate different types of information. This is best explained by example.
Suppose we have a library with a collection of books. A system for tracking the books might keep track of authors, editors, titles, publishers, and so on. This works well for the library’s purposes. On the other hand, authors are also people and they participate in other things than just writing. An auther might be a famous actor or an astronaut or may have been involved in some significant event. This information cannot be captured by the library’s system (and, indeed, should not). However, a person researching a particular event might be interested to know that a participant in the event wrote a book. Or, perhaps, the researcher is looking for all books written by survivors of a catastrophe.
The researcher would, of course, search for information on the catastrophe and come up with survivors and then check bibliographic records for any matches. By this time, the researcher has likely collected a great deal of information about the survivors and the catastrophe itself that has nothing to do with the books themselves but may have been instrumental in discovering the books or the survivors. The researcher possibly has a collection of genealogical, biographical, anecdotal, chronological, etc., information that needs to be kept organized in order to make sense of it, both during and after the research phase.
All the information collected by the researcher is related, possibly in multiple ways, to other information collected by said researcher. Additionally, many researchers will be investigating multiple things at the same time. Using an appropriate means of organizing information may well point to relationships between projects, possibly saving a great deal of duplicated effort. This is especially true as the volume of information increases to the point that the researcher can no longer keep track of it all mentally and is forced to rely an various pnuemonic devices and systems.
Let us now examine a concrete example. We will suppose it is the year 2150 and we are doing biographical research on the creator of a fancy new concept in information management. We’ll pretend that a fellow by the name of John Smith invented the concept and that he has been dead for some time. (A stretch, I know, but fiction is just as useful in illustrating a concept as fact.) Let’s trace one possible research path.
First, we might find out something about Mr. Smith’s life; where was he born? When? To whom? Where did he grow up? Fortunately, this type of information is available to any sufficiently determined researcher so we determine that Mr. Smith was born on December 1, 1972 to Herbert and Marian Smith in the back country of Northern British Columbia. We also learn that he did not attend formal school but holds a doctorate in information theory. We also dig up a copy of is doctoral thesis and several biographical articles from various periodicals that have featured his work.
Now we get to wondering just why he would have pursued a degree in information theory so we start to investigate just what life was like growing up in the back country and we put together a collection of stories from others living in the back country along with facts about the area including weather, politics, demographics, technology, and so on. Then we think that maybe his motivation came when he moved from the back country to a city of over a million people in 2010. So we investigate what it was like living in that city (Calgary) in 2010 and put together a set of comparisons with the information we gathered on the back country.
As we research, we find that we need to know more about why he moved to the city so we investigate economic conditions in 2010. We also discover that his wife as a city girl and had been vacationing in the back country. He met her in 2006 and moved to the city with her when they were married.
We also discover that Mr. Smith’s daughter, Jessica, is still alive so we decide to interview her and collect her recollections about her father. Her comments lead us to investigating his family history as a possible motivating factor and we put together a family tree for Mr. Smith.
On a parallel line of research, we also investigate some of Mr. Smith’s contemporaries, especially those that might have influenced him or worked with him so we end up digging up a lot of the same type of information on a number of Smith’s contemporaries as we have on Smith himself.
Eventually, we have a massive pile of information that we now need to distill into something useful so we can write a biography. We know all the information is related because we would not have collected it otherwise. However, there is so much of it that we aren’t sure any more just what the interrelations are and what they mean. We need a means of expressing the relationships between different types of data. Because of the volume of information, keeping track of more than a small part of it manually is out of the question. Fortunately, we have a “knowledge web” application that we can feed our information into.
First, we start with John Smith as a peg to hang our web of knowledge on. Now, for every significant piece of information, place, time, person, event, etc., we add a node to the web and add crossreferences between them. Eventually, we have a collection of nodes that are all interconnected, some more than others. Now, having the computer display the relationships graphically and zooming in on specific areas can reveal relationships that were previously not obvious; things that are a result of two or more degrees of separation on the chart. We can also navigate through the information as though it were an encyclopedia with crossreferences and possibly trace paths that bring additional information to light. All of this can even suggest additional avenues of research.
The real advantage of the knowledge web, however, is that we can then pull out genealogical information and use that separately if we need to. Or we can generate a timeline. Or pull out a bibliographical listing of what Mr. Smith or his colleagues wrote. Or, if we end up doing research on one of Mr. Smith’s colleagues, we now have a ready made web of information that ties into our new research project ready-made.
One can liken the idea to the current WWW or an encyclopedia or a library or a road map or any number of other current situations. Indeed, this idea may have been in the mind of Berners-Lee when he came up with the WWW. The WWW, however, is anarchy with contributions from everyone who can possibly swing access to the Internet. It is also not well crossreferenced and is litered with noise. It is also not possible to crawl through it in an automated way and pull out useful subsets of information.
To sum up, classic information systems are single purpose and highly structured. This works well for homogeneous information and is very easy to extract subsets from. However, it is far too rigid for heterogeneous information. On the other hand, the WWW and its like are very good at storing hetergeneous information but are so unstructured that it is very difficult for even humans with their understanding of semantics and meaning to extract a meaningful subset. Instead, a solution somewhere in the middle which has a regular structure should make it easier for humans to extract subsets and may even make it possible for computers to extract some types of subsets from the information.
Such a system, which I am calling a “knowledge web”, would have the following requirements:
- No type restrictions on any information node. Any node can be any type and may be more than one type.
- A relationship of a particular type must have a regular tag associated with it to ease identification of relationship types.
- A node of a particular type must have a regular tag associated with it to ease identification of a particular node type.
- Agents accessing the knowledge web must be able to ignore node types and relationship types they do not recognize.
- It must be possible to add additional node or relationship types at any time.
It should be noted that it may be possible to express relationship types as node types and every relationship is a separate node in the web. The implications of doing so are left as an exercise for the reader.
Note added November 14, 2009
It turns out that somelike like MongoDB might be a good starting point for the data storage for a knowledge web. It enforces no strict schema and could easily be used in a way that meets the above requirements.
Back to Theoretical Ramblings index