Making an LLM-Based Streaming Chatbot with Go and WebSocket

Last week we created an LLM-based Chatbot with a knowledge base, which many people are interested in to my surprise. I thought it would be cool if we can continue on that topic, now let’s make the chatbot a bit more advanced with a sprinkle of WebSocket magic. This will demonstrate how easy it’s to make a useable LLM application using GChain.

The default experience that people expect from a chatbot today is with streaming response a la chatGPT, however, this is not a given experience. The fact that not all off-the-shelf models today support it, so far I only know openAI supports streaming response. Delivering the response to the user/client will also be more complicated as usual restful API is not sufficient, hence in this case we gonna use WebSocket to stream the response.

The plan is :

  1. Prepare a WebSocket server and a basic handler
  2. Make a conversation chain with a streaming model session
  3. Get user’s input and stream the response thru WebSocket
In the end, it will look like this

Prepare the WebSocket boilerplate

I hope WebSocket (ws) term here will not intimidate you, it will be simple and fun :). Ws boilerplate will not be much different than the usual HTTP server and handler, and here github.com/gorilla/websocket will be used to make things much easier. Two big differences are 1.) The HTTP request will be upgraded to ws. 2.) As the connection will be long-lived, input and output will be streamed using WriteMessage and ReadMessage functions.

func main() {
	// websocket route
	http.HandleFunc("/chat", wshandler)

	log.Println("http server started on :8000")
	err := http.ListenAndServe(":8000", nil)
	if err != nil {
		log.Fatal("ListenAndServe: ", err)
	}

	fmt.Println("Program exited.")
}
func wshandler(w http.ResponseWriter, r *http.Request) {
	// Upgrade initial GET request to a websocket
	ws, err := upgrader.Upgrade(w, r, nil)
	if err != nil {
		log.Fatal(err)
	}
	// Make sure we close the connection when the function returns
	defer ws.Close()
}

Make a conversation chain with a streaming model session

This part is where we GChain comes to help us. We’re creating a conversation chain within the ws session, so the conversation memory will be kept as long as the session is alive.

// all communication thru ws will be in this format
type message struct {
	Text     string `json:"text"`
	Finished bool   `json:"finished"`
}

func wshandler(w http.ResponseWriter, r *http.Request) {
	log.Println("serving ws connection")
	// new conversation memory created
	memory := []model.ChatMessage{}
	// streamingChannel to get response from model
	streamingChannel := make(chan model.ChatMessage, 100)

	// setup new conversation chain
	convoChain := conversation.NewConversationChain(chatModel, memory, callback.NewManager(), "You're helpful chatbot that answer human question very concisely, response with formatted html.", false)
	// append greeting to the memory
	convoChain.AppendToMemory(model.ChatMessage{Role: model.ChatMessageRoleAssistant, Content: "Hi, My name is GioAI"})
......
	// send greetings to user
	m, err := json.Marshal(message{Text: "Hi, My name is GioAI", Finished: true})
	if err != nil {
		log.Println(err)
		return
	}
	ws.WriteMessage(websocket.TextMessage, m)
}

Handle Messages

With the WebSocket boilerplate and the conversation chain setup, we’re ready to get and send messages to the client. All messages sent will use the message struct defined above. There are 4 main activities here: 1.) Get the user’s message from ws. 2.) Send the user’s message to the model. 3.) Get the model’s response. 4.) Stream the response to the user.

A main loop is created within wshandler, we can see how a call to model is within a go routine as we don’t want the model call to be a blocking process. Later streamingChannel is used to handle the model’s response.

func wshandler(w http.ResponseWriter, r *http.Request) {
.....
	// main loop to handle user's input
	for {
		// Read in a new requestMessage as JSON and map it to a Message object
		_, requestMessage, err := ws.ReadMessage()
		if err != nil {
			log.Printf("error: %v", err)
			break
		}

		// whole model output will be kept here
		var output string
		// send request to model as a go routine, as we don't want to block here
		go func() {
			var err error
			output, err = convoChain.SimpleRun(context.Background(), string(requestMessage[:]), model.WithIsStreaming(true), model.WithStreamingChannel(streamingChannel))
			if err != nil {
				fmt.Println("error " + err.Error())
				return
			}
		}()
		// handle the response streaming
		for {
			value, ok := <-streamingChannel
		}
	}

For every streamed response from the model, we will put it to message struct to be marshalled as JSON and sent to the client. As we’re sending responses in small chunks, it’s not obvious for the client to identify the end of the message, hence the finished field is important to have.

	// the main loop
	for {
.....
		// handle the response streaming
		for {
			value, ok := <-streamingChannel
			if ok && !model.IsStreamFinished(value) {
				m, err := json.Marshal(message{Text: value.Content, Finished: false})
				if err != nil {
					log.Println(err)
					continue
				}
				ws.WriteMessage(websocket.TextMessage, []byte(m))
			} else {
				// Finished field true at the end of the message
				m, err := json.Marshal(message{Finished: true})
				if err != nil {
					log.Println(err)
					continue
				}
				ws.WriteMessage(websocket.TextMessage, m)
				break
			}
		}

		// put user message and model response to conversation memory
		convoChain.AppendToMemory(model.ChatMessage{Role: model.ChatMessageRoleUser, Content: string(requestMessage[:])})
		convoChain.AppendToMemory(model.ChatMessage{Role: model.ChatMessageRoleAssistant, Content: output})
	} // the end of main loop

The complete code is just above 100 lines, you can find it in the GChain’s example with a simple HTML client as well.

Building LLM Based Chatbot with a knowledge base in Go

You can feel the craze of the Large Language Model (LLM) based solution Today. People are racing to find a good market-fit product or even a disruptive one. However, the scene is mainly dominated by Python-based implementation, so building a solution on top of Go is rare Today and I hope interesting for many people.

Here we will use github.com/wejick/gchain to build the chatbot, it’s a framework heavily inspired by LangChain. It has enough tools to build a chatbot with a connection to a vector db as the knowledge base.

What we’re building today?

  1. An indexer to put information to a vector DB
  2. A chatbot backed by openAI Chat API
  3. The chatbot will query information to the vector DB to answer user question

Building Indexer

We’re going to use English Wikipedia page of Indonesia and History of Indonesia as the knowledge and put it into weaviate as our vector db. As both text containts a lot of text and LLM has constraints on the context window size, chunking the text into smaller sizes will help in our application.

In the indexer there are several main steps to be done :

  1. Load the text
  2. Cut the texts into chunks
  3. Get each chunk embedding
  4. Store the text chunk + embedding to the vector DB

Loading the text

    sources := []source{
		{filename: "indonesia.txt"},
		{filename: "history_of_indonesia.txt"},
	}
	for idx, s := range sources {
		data, err := os.ReadFile(s.filename)
		if err != nil {
			log.Println(err)
			continue
		}
		sources[idx].doc = string(data)
	}

Cut the texts into chunks using textsplitter package

	// create text splitter for embedding model
	indexingplitter, err := textsplitter.NewTikTokenSplitter(openai.AdaEmbeddingV2.String())

	// split the documents by 500 token
	var docs []string
	docs = indexingplitter.SplitText(sources[0].doc, 500, 0)
	docs = append(docs, indexingplitter.SplitText(sources[1].doc, 500, 0)...)

Create weaviate and embedding instance, then Store the text chunk to DB

	// create e
	embeddingModel = _openai.NewOpenAIEmbedModel(OAIauthToken, "", openai.AdaEmbeddingV2)
	// create a new weaviate client
	wvClient, err = weaviateVS.NewWeaviateVectorStore(wvhost, wvscheme, wvApiKey, embeddingModel, nil)
	// store the document to weaviate db
	bErr, err := wvClient.AddDocuments(context.Background(), "Indonesia", docs)
	// handle error here

Create the ChatBot

GChain has a premade chain named conversational_retriever which gives us ability to create a chatbot that connects to a datastore easily. As we have prepared the knowledge base above, it’s just a matter of wiring everything together and creating a Text interface to handle user interaction.

Setup the chain

	// memory to store conversation history
	memory := []model.ChatMessage{}
	// Creating text splitter for GPT3Dot5Turbo0301 model
	textplitter, err := textsplitter.NewTikTokenSplitter(_openai.GPT3Dot5Turbo0301)

	// create the chain
	chain := conversational_retrieval.NewConversationalRetrievalChain(chatModel, memory, wvClient, "Indonesia", textplitter, callback.NewManager(), "", 2000, false)

Create a simple text-based user interface, basically loop until user want to quit

	fmt.Println("AI : How can I help you, I know many things about indonesia")
	scanner := bufio.NewScanner(os.Stdin)
	for {
		fmt.Print("User : ")
		scanner.Scan()
		input := scanner.Text()
		output, err := chain.Run(context.Background(), map[string]string{"input": input})
		if err != nil {
			log.Println(err)
			return
		}
		fmt.Println("AI : " + output["output"])
	}

You can find the full code in the GChain Example. It will looks like this :

Can I have a smaller Prometheus

Prometheus | Power Devops

For the past few weeks, I’ve been thinking about having a time series DB and visualization tool that is easy to use and light on resources, so it can be used locally for development. Tried several times to find it with keywords such as “Grafana alternative”, “Prometheus alternative”, nothing to be seen as an alternative close to my liking. At one stage I almost write my own tools, which will be too big an undertaking for me.

So why not just use Prometheus? Well not until you see the binary size of it, a whopping 103.93 MB. Not surprising for a statically linked executable, but still, it’s big. Yesterday morning I told myself, why not make it smaller? How hard it can be right? Well at least smaller on the binary size, as I’m not yet informed enough to do a runtime utilization measurement on it.

The first thing I did after getting access to the code is to just remove whatever code I deemed not necessary. Like one time I completely remove tracing and alerting functionality, but that’s only reducing a maximum of 2MB. This comes with a lot bunch of modified files and breaking test cases. Definitely not the right way, we need to do it smarter.

The weight

After browsing for a binary size analysis tool I found goweight. It helps to determine what’s contributing to the binary size :

No surprise to see that the main contributors are cloud provider SDKs, but it’s still disappointing. At this stage, it is just a matter of finding where those are being used. A simple text search revealed that all the SDKs are mainly for Service Discovery (SD), for local development file-SD are more than enough.

After tinkering here and there, the very minimum change to exclude all unused SD is to comment them out in the in discovery/install/install.go . With this change, all tests are passed except one test related to zookeeper initialization. The binary got reduced to just 35.69 MB, just 34% from the original one.

Wrap Up

After stripping the debugging info using -ldflags="-s -w", the size got reduced quite bit more. At the end we reduced the size to 28.30% of the original file to just 29.42 MB.

It’s crazy when you think that more than 50% (I’m being generous) of the size is something most people will not use, maybe this is the price of modern computation where 50MB means nothing. To be fair Prometheus developers are kinda aware of this and putting the discussion in the readme of service discovery, however, the binary size is not discussed there.

There are several potential ways to reduce the file size, like by using upx to package the file. However it’s not readily available on brew, so I just skipped it. Another way is just to cut more functionality like tracing, alert and remote storage, however, it will take more effort to maintain and make sure all the tests will be passed.

The related changes can be found here https://github.com/wejick/prometheus/commit/6dd776019a315764a9f70147f83f1235b9b13189

Kegelisahan

Pada suatu masa, bisa jadi hidup terasa mengalir begitu saja. Tidak ada rencana, tidak ada harapan, tidak ada pemikiran dalam menjalani hidup, sebuah definisi yang tepat dari kalimat “Living the present”. Yang terjadi sering kali hanya dua jenis; yang pertama adalah hal-hal yang sudah terprogram saja, rutinitas, seperti bangun mandi, makan, bayar iuran bulanan dan tagihan listrik. Yang kedua untuk hal yang tidak terprogram akan dihadapi secara impromtu saja, alias ya sudah dipikirkan saat sesuatu butuh untuk dipikirkan.

Nampaknya tenang-tenang saja, namun ada suatu kegelisahan yang ingin dikeluarkan. Kegelisahan akan stagnasi akal dan ilmu pengetahuan yang rasanya ingin sekali diasah, disirami, ataupun dilatih dan diolah. Dilatih, diolah dan disirami jiwa dan raganya.

Entah kenapa mungkin sejak lewat beberapa bulan masa pandemi, kurangnya aktifitas fisik, terkungkungnya interaksi manusia, menjadikan gairah untuk mengolah rasa dan pikiran menjadi lemah. Keinginan selalu lebih kecil daripada tarikan pesona hiburan instan ala youtube dan juga instagram. Melelahkan dan membosankan.

Di antara

Sebagai manusia sering kita berlarut-larut dengan pertanyaan tentang tujuan, seolah tujuan adalah syarat dari suatu keberadaan. Tujuan merefleksikan adanya suatu kondisi akhir, sebuah kondisi yang diingin. Sebuah titik terang di ujung lorong, kebahagiaan, pemahaman, kekayaan, keadlian, dan surga, semua itu contoh dari tujuan yang ada dalam pikiran kita. Ada sebuah harapan – hope -. Di sisi lain tujuan juga membawa kontradiksi akan suatu kondisi akhir yang berbeda dengan yang diharapkan. Kontradiksi itu yang orang banyak mengatakan sebagai kegagalan, seperti kebuntuan, kesedihan, kemiskinan, ketidak-adilan, dan neraka.

Namun aku tidak mau terjebak dalam dikotomi yang dibawa oleh tujuan, sebuah ketercapaian atau kegagalan. Terdapat spektrum yang sangat luas dari ketercapaian dan kegagalan, seringkali keduanya tidak membentuk suatu garis linear. Kita tidak akan bisa mengatakan tercapai, agak tercapai, tidak tercapai, sangat tidak tercapai, hingga gagal total, sebuah gradasi brutal simplification dari makna suatu tujuan. Terdapat kurva-kurva seribu dimensi yang terbentuk dari keduanya, yang bahkan arti tujuan itupun akhirnya bukan lah satu.

Dari kurva-kurva yang saling melilit satu sama lain dalam sebuah bagan seribu dimensi, yang paling mengerikan bukanlah saat menemukan kondisi akhir. Yang mengerikan adalah memahami bahwa kita akan selamanya ada di antara kurva-kurva tersebut, berharap menemukan satu titik akhir yang bernama tujuan.

Pada saat Tuhan menciptakan dunia dalam 6 hari, dan pada saat ciptaannya menjalani kehidupan, kematian, bangkit dari kubur lalu kiamat, kira-kira apa tujuan Beliau? Jika tuhan mampu menciptakan sesuatu hanya dengan Kun Fayakun, lalu kenapa Beliau tidak langsung saja mencipta hari kiamat?