Signal to Noise and the Future of the Net
Tired of Wall of Mud Search returns?
As we gain more people on the network, more content on the network, and more raw data on the network we have to get smarter about how we use it and our tool-kits are going to grow in ability; but can the tools keep up with the pace?
One attempt at this is Wolfram Alpha, a tool designed to answer natural language questions approximately intelligently with an assemblage of pertinent data in a page, somewhat like a mash-up of data filtered by your parameters from what I have seen. Think of it as a rosetta stone for types of factual data accessable on the internet since it does conversions, comparisons, and charts.
After watching one demo video it does appear exciting, but I fully expect WA to run into some of the same problems that Natural language Interactive Voice Response units have (NL-IVRs,) along with the problems that Google and Wikipedia both have.
You’ve probably met one of these NL IVRs over the phone, they generally go into their spiel and along the way tell you to “just ask” for what you want.
Taking away the voice recognition faults and dictionary tuning tasks, using a natural language dictionary for even limited applications like directing customers between the billing department or the customer service department does take some skill because you can’t guess all combinations of how the customer might ask for billing or for customer service. One might say bill, another billing, another pay, another the name of the product and collections, etc. etc. There’s also that question of what you do when someone asks for something outside the standard menu.
Wolfram’s advantage is that it doesn’t have to deal with interpreting language and dialects from sound, but on the other hand it does have it’s own “mispronunciations” in the form of typos and colloquialisms.
Wolfram also has greater advantage over NL-IVR’s in that the entire language is out there on the network and defined, and Wolfram will have access to all of that. On the other hand it wouldn’t hurt Wolfram’s success chances to have a chat with some of the NL-IVR industry leaders to see what they are struggling with in practice every day. Like some of the first natural language sound dictionaries (University of Oregon and their 800 number survey collection of dialects comes to mind,) and the major search engines, I suspect that Wolfram will use some crowd-sourcing to iteratively tune itself.
As the net grows millions of more people are added annually, new sources of raw data come online, and the content is becoming much richer. That said, the poor signal to noise ratio from those additions is quickly becoming alarming. When I was a roadie it was very important to keep the signal strong at the starting point, and as you amplified it was important to keep each amp in the chain leading to the last mixer at maximum strength without distortion if you wanted clean pure sound ( let’s forgive the distortionists and their wall of mud sound for the moment, it was interesting as a primitive artistic experiment, but when the day is done you want to hear the expression of the individual instruments and voices woven together cleanly – and those are the songs that will truly last.)
For a search on any specific you have to wade through a morass of sites, some using keywords and Search Engine Optimization to just get you to eat their search of links rather than the one you first chose. Some are not information sites, but instead splogger, retread, and disinformation sites – they have all the right words but they are telling lies or redirecting to nonsense and away from the original or actual content. They are the distortion and the noise in the net. How do we get to authority, how do we get to relevant searches, how do we get to trusted and genesis sources?
The major search engines have made sorting signal from noise into their business – but they are beginning to fail under the load. Merely cataloguing data and ranking through keywords, links, and page hits as authority measures and popping that stack when searched for just isn’t enough. Indeed it’s rather a simplex approach as the protagonist found out in Samuel Delany’s “Empire Star/Babel 17” upon meeting the simplex culture of the Galactic Encyclopedia the first time. (That pair of linked novellettes also explore linguistics, their nature, and how they shape culture and so are an interesting and entertaining read in their own right.)
Back to the subject at hand: the tools we have for transforming raw data to usable information are improving, and you can see that Wolfram is taking advantage of those – whole databases have been put online and places like Gapminder.org and others are working on methods to take that data and turn it into useful information in the form of tweakable charts. That is to the good, but as more data comes how do we get from simple line graphs to the search that pops up the right control chart for a specific question? How do we insure that the right data set is chosen? Therein lies the rub in taking data from raw to information form. The next step in the chain is taking that information and converting it to usable intelligence and we really haven’t quite got there anywhere that I’ve seen yet, perhaps Wolfram is the first reach to crack open that door.
In the meantime we are saddled with the growing babel of search optimization as commercial interests compete with social networks to steal the search mojo. In that open environment what tools can be used to get to factual pertinent content and genesis sources where required? There are various means in practice, but right now they amount to measuring popularity through a few means, and popularity does not usually equate to trustworthy or authoritative sources. Few of us have 100 years to live and poking through three pages of links to get to the pertinent sites needed is a waste of valuable time. The novelty of noodling through the net is also wearing off in the general public, they want what they want now, not yesterday – they are growing tired of distortion and wall of mud searches.
Both Commercial and Social ranking sites have created another phenomenon of the web, something I’ll call “yellow searchalism” for now. The snarkier and the more sensational that your headline, excerpt, and tags are the more chance that your article will get higher ranked. It’s like the yellow journalism of the past – the more alluring the headline sold more papers, the more obnoxious or sensational tagline also gets more hits.
There are also problems in communities that ding up and ding down, such as Digg and Little Green Footballs. While having completely different political bases each has “thought leaders” who if they plus something up are more likely to be followed by others who plus things up. Yellow searchalism and time of day also affect ratings at these sites. An article posted at right time of day with a snarky headline is more likely to go up in rank than the same article posted off-peak with mundane, factual headline. Each community is attracted to specific interests and you are more likely to find technology, entertainment, and humor on digg while LGF is more news, politics, science, and technology.
To Charles Johson’s credit Little Green Footballs is also pioneering with a filter system in the form of “monitor lizards” who remove links to non-factual sources, kookspiracy or hate sources, and they also clean out some of the hysterical and hyperbolic, while Digg doesn’t appear to have any similar mechanism in place.
One of the means of search ranking is through a mix several methods: number of hits, links to that page, number of times your terms appear, and similar quotes and citations. Most search engines will not divulge their full means since that allows you to “hack the stack..” But as seen with Google bombs and search page ranking races that’s not working effectively more than half of the time. People who were once attacted to the salacious and attractive are getting frustrated now because they aren’t getting exactly what they asked for – distortion and walls of mud searches are going out of style.
We have to get better at honing in to what is truly asked for versus what’s popular or what’s highly pimped, and some are trying through tailoring to stated individual preferences and past preferences. Some examples of this are Itune’s Genius, Youtube’s “recommended for you,” and other examples are in this article. The negative with “tailored for you” approaches to ranking is that it boxes individuals into a room of the same and they can lose all sight of the new. When wanting a new view of new things coloring that with past bias is not really a good thing, and it can stultify creativity.
The other factor that weighs heavy on the net: search engines can’t tell when you are looking for empircal fact or when you are looking for entertainment or fantasy, and there are no dotted lines between the information and disinformation. So when searching for the empirical you might end up at a speculative entertainment site, a political site with bias, or others. One example: if you type in “carbon dating accuracy” five of the top ten links will take you to young earth creationist pseudo-science sites that will tell you that carbon dating is bunk when it’s really a proven method.
So what means are there for trust and authority? Here are a few, some in use, some not:
- First mention of terms: Is “genesis” and authorship really ranked or given credence by most?
- Number of links back – (a traditional but last century approach to authority which sometimes confuses popularity with authority)
- Number of “updings” at a mix of social sites (popularity)
- Length of time spent on page vs. length of content (authority)
- Links from authoritative sites with authority measured in scholasticism instead of number of link backs (authority/trust)
- ratio of facts / data (one that I haven’t a clue about how to measure)
- Think tank links (authority)
- .edu links (authority)
- Entertainment vs. Informaton: numbers of links from categories of sites. (entertainment, news, sports, humor, e.g. traditional classification.)
- Past preferences of the individual
- filters: what’s in place to stop disinformation? (authority, and I haven’t a clue how to do this without human watchers who will have bias, ala the wiki page reversion wars we’ve seen)
Now if you mix those all together and drive it with pseudo AI in a well mannered way, you might improve the system. The first ones to do this well have a great chance to displace Google. Also keep in mind that with the “get your raw data online and accessable” movement well underway, similar tools will be needed for authority of databases and as we move to a full rich media world, how do you mine a video for tags? Will natural language voice recognition be woven into search engines for audio and video content that right now relies on users and others to hand tag it with text?
Finally: What other means are there to classifying, codifying, and sorting the net? What are your ideas on it?