On a way to solve clustered context, my thinking process

Dec 01, 2022

The golden age is not in the past but in our current effort.

The Context? What are you talking about?

In linguistics and sociology, the context usually means “Hey, if we want to discuss this topic, we need to know the situation, culture, and the whole story to discuss it” which necessarily means the relationship between the subject that connects within this event/topic. This issue will usually be raised when people argue the event/topic can’t be fully understood. ¹

But the word, context is also a very vague word and needs to strengthen with the context itself. Its meaning varies in the different fields which need to be defined in the first place, so let’s try to do it first.

What I want to focus on in this article is not a grand spectrum of the infrastructure of the internet which usually involve how we store information, and how we connect each other with various protocol. In this article, the boundary of context is relatively smaller than that.

Let’s first identify a central identifier of the context. It can be a figure, company, event, product, project, and so on. Across these identifiers, we will generate a bunch of data that vary from articles, documents, videos, images…etc. The context I would like to address in this article encloses them, it means an “Identifier” and every material generated, related, and connected to the identifier. It will be formed like a giant spider web.

I want to call this aggregate “Context”. (Every time I use capitalize C it means I am talking about this definition.)

The clustered Context

Imagine a city, it has a dozen of libraries to store different categories of books, but there has no connection between these libraries. The southeast library doesn’t know whether the northeast library has books related to Hannah Arendt, it doesn’t know northeast library stores a lot of philosophy books at all.

To borrow different categories of books you have no other way but go to each library in present and search whether they have the book or not by using their computer and you have to track their due day by yourself. In this example, our identifier is books and the Context of it including books’ information(metadata), where you can find these books, the reference of these books, and so on.

This will be chaotic right? The citizen of the city will be angry about the inconvenience of this policy and they will stand up asking for changes. On the internet, there has the same story undergoing. Each company that builds a close-sourced product without the support of public API will become this kind of library. The only way to access their data is to use their machine to search. Even if a company has private API, it will still raise a wall among the information.

The reasoning behind this kind of policy is quite simple. They treat your data like their asset and the more they have, the deeper their moat will be built. But people seem not nervous about this inconvenience right now, or worse, they feel there has nothing they can do to make the situation better.

The consequence of clustered Context

The clustered Context has lots of detrimental effects but there have two most notorious things I want to address in this article. “The singular falling point” and “Who has the data who can draw”.

The singular failing point: search engine

The fundamental reason that we can comfortably argue that everything seems fine is the existence of several powerful search engines. They glue the material together with hundreds of thousands of crawlers all over the web, crawling every second. But it grows in a way that is not healthy or even dangerous. It is trying to make it too big to fall.

Too big to fall

The clustered Context make search engine essential. There has no way to live without them. The whole system of our society heavily relies on the search engine to a level that we can even call it “the infrastructure of society”. In some countries, search engine is even under the control of the government and become notoriously dangerous. ²

It is not healthy that we solely rely on a single structure to support our internet.

Free but expensive

Great power comes with a great price. Although all of these search engines are free to use on the surface. Deep in their vein, they charge us with our data and display advertisements for their profit. Our attention becomes their currency and further than that, our knowledge is shadowed by the existence of ads.

When people talk about search engines and the issue they bring to the table. They usually discuss search engine as a priori and forget search engine is just another tool for solving a very specific problem in the first place. The reason it will flourish and affect us in so many ways, back to the principle, it’s because we don’t have a very well-connected web. Everything is clustered.

Although some search engines stand on the privacy-first policy like DuckDuckGo. But I think the lack of a huge cash flow and business model to support makes it hard to compete with Google. We are stuck in this vicious cycle. The big ones get bigger and others are hard to catch up with.

Search engines make us compete

To rank on the first page, we have to compete with each other under the rule of the search engine. This is detrimental. To let your work be seen. The thing you need to do is not write a good piece of work but search “Keyword” first, and use lots of these keywords to fill up your article. Then you should analyze your “Competitor”. Copy their insight but with your taste in it. Either way, the competitive environment makes lots of things toxic. We are not going to reference another page just because they are competitors not because they do a great job.

On the other side, people ignore the rule of search engines no matter consciously or unconsciously. They will suffer from noticeably low traffic even if they have very good content.

The completeness of the search engine result makes our creativity fragile. We are hard to find interesting websites or content on the search result but lean back to the word of mouth, especially on social media.

Keyword anchor effect

We heavily rely on the keyword to find a good source of information. Without a good keyword or search sentence, it’s hard to find the thing we want. So we train ourselves to become a better keyword former and make it like a talent to teach other people.

But the keyword is very limited. Let’s just compare the experience of using keywords to search and read the Wiki page. Although they are in different categories, they still have something in common, so bear with me for a while.

When we search for a topic we only have two or three proper words in mind. It’s quite easy to put these words as a set of keywords and search for them. The answer is very useful, under normal circumstances we can usually find the thing we want within minutes and we stick to it, almost believing the first result Google gives us will be the best. The strong bond between keyword - results and how easy and quick it is strengthens this feeling.³

When we browse through Wiki. We will observe the tremendous information in the first place, not just the sentence but also the small, blue, shiny link sprinkled around the page. We understand that each blue link represents another set of knowledge and they lead to more.

The result is very different. Keyword - Results with bullet points like listing structure (which search engines tend to use) make us arbitrarily believe some result is 100% correct. But the complex link between each material and forming the Context like Wiki will make us humble and keep searching and learning.

Inter-connectivity between keywords is not knowledge

The most crucial thing search engine brings to the table is the ability to gather tremendous data and connect keyword and let us search them at lightspeed. But the connectivity between keywords is not enough to form knowledge. When I search “Nextjs”, the result is not demonstrating collective knowledge, they don’t have meaning besides the connection link toward the keyword.

What I am trying to say is the power of search engine only push us up a level. We still have lots of work to do when it comes to transferring the internet into a better knowledge garden. After we increase the usability of the web with the help of a search engine. We need to collect, select and connect with our hands to form proper knowledge. We use our productivity tool to accomplish that. But what about at the community and company levels? How should we cultivate the knowledge and where to store them?

The tribe is lacking a house to store grain.

Who has the data who can draw

In this section, I would like to focus more on the company level, especially on those big tech which owns lots of data comes from people around the globe.

The content we have right now has a rule. Those who store the data have the right to design the interface accessing the data. This is a double-edged sword. On the one side, you could argue that usually, the people who manage the data know the data better than others and they can design the best interface toward the data. On the other side, the interface is so limited that you are hard to customize it to your personal or organizational level needs.

Indeed they usually provide API but with a rate limit ⁴. The amount of data you can retrieve and leverage is limited. Besides that, the connection we can build is limited in the specific platform, other than that the only tool we can use is the hyperlink and it is hard to track backwardly.

Lack of flexibility leads to lack of experiment

In the crowded and information-overloaded society. We are in desperate need of the flexibility to experiment with a different kind of way to display data. But the dominant position of these platforms and their wariness of building moats, not just between their competitors but also between their customers had caused lots of harm to the flexibility we need.

These companies are facing two problems that make them hard to experiment too.

The platform is unified. The change in UI will eventually affect hundreds of thousands of users and at worst will make the platform lose its customers. Extreme cautiousness causes the loss of creativity.
The Platform is pursuing cost-effectiveness. Every change in UI needs to lead toward a specific financial end goal which will greatly reduce the will of experimenting.

Granularity is hard to suit both ends

Because these companies need to consider not only the customers but also their competitors. They seldom have the chance to let the user customize their UI and the way they display data. But within the different groups of people and purpose, we usually have different needs.

A popular platform usually has to fit into different roles. Take Facebook, for example, they need to let people become editors and strengthen the experience of reading and they also have to make the people who need data or insight the right place to go. Which further complicates the issue.

To sum up

Back to the first principle, the search engine is not a priori but a tool to try to solve a very specific problem. Our internet is clustered and the information needs to be bonded with the keyword. This makes search engines slowly become the infrastructure of our society and they are too big to fall.

Besides that, the company which owns the data can decide how to represent the data. We have a few ways to get the data and draw by ourselves. This leads to lack of experimentation and we need a radical experiment to test which structure can solve the problems we have right now.

My thinking process toward a solution

I am still pursuing the solution. Here are my thinking processes to solve these problems.

Rebuilding the web is a zero-sum game

Recently there is a trend about rebuilding the web. They list lots of disadvantages of the old web like lack of direct currency, not in a distributed manner, lack of interconnectivity, and backlinks. But when you look closely at their argument, you will understand this is a zero-sum game that not everyone can play. The initiative of these campaigns doesn’t mention how people should transition from the old web to the new web, how to empower each one to have the ability to use them and how to protect those who can’t catch up.

These are not their concern. Rebuilding a web is the biggest moat they can find for their business and it can attract lots of attention and even bring in VC’s money. We need to stand aside from this trend and re-think what we need.

To think of the web as a whole as a replaceable component is very attractive. It gives people a power fantasy that they are the ones who see through the mist. But to think about the web like this is not even partially true. The web is more like an organism that will change according to how you interact with it. It’s too hard and too risky to pull all the thing you don’t like at once and push into lots of things that you think is helpful. Besides that, no one has this kind of power.⁵

What I approach these issues is not considering solving the issues all at once with the belief in singular technology. But find the very specific problem I want to focus on and build a set of limited functions that can push the progression. ⁶

Gather redundantly but display wisely

People need insight, not data. If software can only gather but can’t cultivate insight automatically, it will fall in this market and it won’t benefit well for the people.

The system will aggressively digest data and store them as much as possible. This act will greatly benefit from the emergence of API from companies. Five years ago, there didn’t have a lot of fast, well-maintained/documented APIs for people to access. But because the maturity of micro-services and cloud-native language ⁷ makes the cost of building and maintaining an API decrease, the trend that people want their data back and the rise of various automation toolsets all result in an environment the company considers API as an advantage for their services.

This is a great environment for us to re-connect the clustered material. We can access these data from the company’s API (Like Discord, Slack, and GitHub…) and store them elsewhere where we can calculate our insight and display them with another set of UI.

The thinking process here is we should gather redundantly from API or crawler but we should consider carefully what we should display. Besides that, it should have a good search algorithm that can achieve similar efficiency as a major search engine or even beyond. ⁸

Think like AST (abstract syntax trees)

The above point is about to collect -> display and the missing part is how we achieve maximum flexibility, especially for plain-text content like blog posts, discussions…etc. The solution is leveraging what remarkjs ⁹ had offered. It constructs a markdown file to an AST-like structure. You could easily loop through the structure and get the data you want.

# I am a header 1

I am some paragraph

A markdown file like the above will output the structure below.

[{
  type: 'heading',
  depth: 1,
  children: [
    {type: 'text', value: 'I am a header 1'},
  ]
}, 
{
  type: 'paragraph',
  children: [
    {type: 'text', value: 'I am some paragraph'}
  ]
}]

After converting the normal article, blog post even the discord thread to a markdown file. We have a steady way to treat an article like a set of data. We can easily get all the headers or loop through the structure and get all the links. And we can even make this become a hook. Every time word or a set of the structure shows up, the hook will get activated and we will get informed.

All the functionality will be exposed in the form of a plugin. It should act similarly to remarkjs’s plugin which is very easy to use.

Experiment first

The spirit of all the attempts is a set of experiments. The structure should be clear and well-discussed with the shareholder. To facilitate that everyone can join this journey, there should be lots of entry points for the user to build their plugin. For example, the markdown parser, entry hook on every lifecycle, and even the displayed UI should act like a replaceable component. Users can rebuild or leverage others’ work to further enhance their experience.

Export at any time you need

The data should be portable and users should have a reliable way to clone it or transfer it to another container. But it’s not enough. Portability is not only the data but also how data is processed. Nowadays, software companies only let user export data with complicated structures ¹⁰ and call it the day. What I imagined is a product that can not only export the data but also export a standalone, simple version of the software itself. This standalone server can run on any computer without further building or compiling.

It just gets started

The Internet comes a long way. Flourish tremendous amounts of content and people. It also produces a set of problems that no one had solved before. I feel excited to live in this era where the problem remains but we have the tools and ability to solve it or push up humanity to be closer to the answer.

I am looking forward to having any conversation related to this topic, you could find me on Twitter or send an email directly to me. And I am currently exploring this concept with Curioucity, it’s open-source. If you are interested, welcome to the development.

Goodwin, Charles; Duranti, Alessandro, eds. (1992). “Rethinking context: an introduction” (PDF). Rethinking context: Language as an interactive phenomenon ↩
For example, China. It becomes a fortification for the group that holds the power. ↩
And the bullet point-like UI strengthens the feeling too. People can only judge by single order so they tend to trust the first result. ↩
Each company has its policy. But usually, you have no hope to retrieve all your data on your hand. ↩
If you live in a country that is autocratic and has the power and wealth to reshape how the internet interacts with people. That is another story. I hope in a democratic society, the internet can slowly but steadily march to a more independent position. ↩
I am mostly focusing on “Tech Documentation” and “The Context gathering/connection” right now. And I believe what we need to do for these issues is not rebuild something but re-connection materials. ↩
New cloud-native languages like Go keep gaining popularity and building an ecosystem around them. The result is building a well-design API is not as hard as it used to be and it’s much more reliable and cost-effective for a small company to release its API too. ↩
The major insight of Google is they don’t rely on people’s input but observe how people interact with that specific web page. But in the end, their data won’t be as valuable as people directly point out where to find the information. ↩
Remark.js and Rehype.js are very powerful toolsets. They not only make markdown parsing much easier but also open up a new world for us to explore. Imagine you have multiple entry points to diagnose the article not just the start of the article. remarkjs/remark ↩
You could look into the way notion exported their data. It’s a mess and very hard to process. ↩