A detailed illustration of a human head in profile, filled with gears and circuits, representing a blend of mechanical and electronic elements.

in Artificial Intelligence, Programming, Swift, Thoughts

Seven Years later: What I Learned from Building an AI Chatbot – Part 1

Seven years ago, I embarked on an ambitious attempt to build a rudimentary rule-based AI chatbot. Frustrated by the limitations of Apple’s Siri and motivated by exciting updates to Apple’s Natural Language Processing APIs, I dreamed of building something that could understand complex queries, construct mental models of objects, and seamlessly interact with users using just their voice. Siri was merely a fancy voice control toy; I wanted to perform calculations, manipulate data and leverage the dynamic power of language to build a tiny virtual world inside my iPhone, just like an intelligent computer system from Star Trek.

Now I just want to preface this by saying, obviously I wasn’t trying to build something like ChatGPT! For a start, in 2017, the transformer technology powering GPTs was still just a Google research paper. I didn’t have a huge data set needed to train a model, nor the algorithms, money or computing power. What I wanted to do was build a simple rule-based chatbot that leveraged Apple’s machine learning-based natural language APIs, along with traditional programming techniques, to build a mental “sandbox” for manipulating any data I told it about. I wanted it to work the same way a human might intelligently “remember” and use mental abstract constructs in conversation.

While I realised it would be a good challenge, little did I know this journey would be riddled with the frustrating limitations of the technology of the time, and yet reveal unexpected insights and invaluable lessons that would help me with getting the most out of generative AI today. Lessons, which I’d like to share with you. Now, with the recent advent of technologies like multi-modal GPT models, many of my sticking points are finally being resolved. So, I feel it is the right time to revisit and reflect on my journey.

This is the first part in the story of my ambitious yet ultimately unsuccessful AI chatbot attempt, and the learnings that came from the failure I humorously nicknamed ‘Siri-ously Advanced’. Learnings that I believe can help give you a deeper understanding of language and are still useful for applying to modern day AI.

Inspiration: How Siri sparked off this AI project

In truth, my inspiration for this project started many years before, in 2011, when the iPhone 4S came out. Siri was brand new, and I remember being eager to get my new phone out of the box. I held down the big home button and told Siri “What is the weather in London?”. It blurted out about how much it was forecast to rain. I glanced out the droplet-covered window and quickly asked it another question.

“So how about Cupertino?”

Siri paused for a bit and then disappointedly said, “Sorry, I don’t know how to do that. Would you like me to search that for you on the web?”.

It was the anti-climax to a so-far impressive tech demo. For me it was obvious that Siri didn’t remember anything you said. It listened for key words or phrases in a voice snippet, delivered its result and then subsequently forgot everything you were talking about. When I said, “So how about Cupertino?” it didn’t know what I was talking about. There was nothing in the current sentence that would indicate I wanted to know about the weather and so it gave up. How could this be an “intelligent” voice assistant when it doesn’t remember anything I had said before?!

It was clear to me that Siri needed a working memory… somewhere to keep track of the context of the conversation. If it had known I was previously talking about the weather, surely it could then infer I wanted to know about the weather when mentioning a place name?!

Developing the Proof-of-Concept: Challenges and Successes

My proof-of-concept demo was simple: Tell Siri-ously Advanced that you wanted it to imagine a primitive shape such as a rectangle or circle. It would understand the properties of the shape (width, height, radii, circumference) and you could build upon it with more statements until you asked it for a property you hadn’t given it, such as the area, and it would try to work it out from the knowledge it had.

NLP and Speech Recognition Challenges

Every year at WWDC (Apple’s main developer event where they reveal new software), I would pore over the Siri or NLP APIs. Each year I would get excited and then frustrated with the technology. For example, one year I tried the natural language APIs, but they just didn’t work using the phrases I was going to use for my prototype, incorrectly identifying the verbs. I even had great trouble with the corner-stone of my vision – speech recognition.

In 2016, with the release of Speech Recognition APIs in iOS 10, I tried coupling together some of the components I would use. Soon I had a demo that would let me speak into my iPhone and dump out text. This was when it started to get frustrating.

“The width of a rectangle is 10 centimetres”, I said.

The spinner danced around and dumped out some words to the console. To my disappointment the words “The ‘with’ of a rectangle is 10 centimetres” appeared. Why was it hearing “with” instead of “width”?

I grew frustrated but calmly tried again, this time with a slower pace. “The WIDTH of a RECTANGLE is 10 centimetres”.

Again, the app’s spinner swirled around and then the console proudly spurted out something slightly different. “The ‘wife’ of a rectangle is 10 centimetres”. I let out a disappointing gasp, it was frustrating. Was it my English accent? Was I pronouncing the word correctly? How was this going to work without voice input?!

Exploring Approaches to fixing Speech Recognition

Frustrated, I tried to fix the output. I even tried string replacing “wife” and “with” with the correct result, but sometimes the word would be interpreted differently, breaking everything. It felt buggy. I was hugely disappointed. To me it seemed obvious, again it was context… why didn’t the speech recognition engine understand I was talking about geometry and measurements and factor that in when it was processing the audio?

I could try and post-process it myself with a statistical approach, but it all felt very broken, like I was trying to magically improve accuracy when the data just wasn’t there to work with. Ultimately I had to accept defeat for 2016 and try again a year later when Apple released updates to their APIs.

Now, APIs like OpenAI’s Whisper model use transformers to infer context seamlessly and more. You can even use prompts to give it additional context and help it pick out words like my cursed “with” or use a domain-specific dictionary of terms.

My AI Chatbot came to life!

It wasn’t until WWDC 2017 that many of the components needed to build Siri-ously Advanced matured and aligned with my vision.

I was excited at making it finally work after pausing the project for a number of years and carried on coding Siri-ously Advanced over the last few months of that year. It became a mild obsession. Sometimes I was so “in the zone” hours would fly by. I committed code early in the morning and even recorded a few dev diary audio recordings. Slowly but surely everything started to come together. Finally it was time to test it. I started speaking into my iPhone using the basic grammar and syntax I’d constructed.

“Imagine a rectangle.” I said in a clear voice.

The app’s spinner animated and finally spurted out “I’m imagining a rectangle” on the screen. I was overjoyed at such a simple result but the technology working to get it to do that at the time was fascinating. Speech-to-text was one of those technologies that people always promised so much about but usually failed to deliver.

“The rectangle has a width of 10”, I said. “Understood”, said the reply.

Next, I told it, “What is the area of the rectangle?” It paused for a moment and then informed me, “I need to know the height for me to tell you that.” I couldn’t believe it!

“The rectangle has a height of 5”, I said next. “What is the area of the rectangle?”

“The area of the rectangle is 50” came back the robotic reply.

I was over the moon. Here, I had a working prototype that actually understood me! Ok the concept was simple but it proved it was possible! It reminded me of earlier work I had done on a code interpreter to run a basic programming language, I was essentially programming an object with my voice!

Text output from an early version of Siri-ously Advanced

Realising the Importance of Inferring in Language

As I developed object models such as the rectangle, assigning properties to these became interesting. Do you assign numbers like you would with a programming language or do you assign a quantity and an association with another object?

It soon became apparent that if you had a phrase like “The width of a rectangle is 10 centimetres”, what you were really saying was that the rectangle’s width was made up of 10 units of 1 centimetre. Here, the numeral 10 is acting as an adjective to the noun, centimetre. That sounds obvious so far, however if you said, “The width of a rectangle is 10”, then the numeral 10 is now a noun. What changed? It seems strange that in language we should have a set of rules for physical properties and another for abstract properties such as numbers on their own. Should I have a special set of properties in my object model too?

After much thinking, I realised what you were really saying was that the rectangle’s width was made up of 10 units of anything but that the lack of a unit inferred an abstract one and by being abstract, you were essentially setting a placeholder for a physical object. In many ways a property could be broken down as a given quantity of another object.

Now my other big realisation was that most things in language are inferred which itself is an efficiency in communication. Why communicate something if you can infer what it is from previous knowledge? If you had previously said that the height of the rectangle was 10 centimetres, you might infer that the width would also be in centimetres. Back in the earlier context-unaware Siri, we had to tell it that we wanted the weather for London, and we wanted the weather for Cupertino otherwise it didn’t understand us when it should have inferred the topic of conversation.

Having a context or “working memory” is like having a cache of knowledge that compresses the communication. If we miss something off, we can use the context or other previous knowledge to infer and often using your memory uses less energy compared to physically moving your mouth and diaphragm.

If you’re still not convinced, here is another example. Imagine I said, “The house has a red door”, I could have explained what a house was and that it was the front door and even what the colour red is, but you infer what a house is from your previous knowledge, and you also infer that I’m probably talking about the front door. You’ve seen red so much you know exactly the colour I’m talking about. I don’t need to send you a colour code.

In summary much of human language relies on inference, which streamlines communication by reducing redundancy.

Applying My Insights to Generative AI

While inferring can compress down the communication, sometimes it can be lossy. That is, you lose something in generalising and you can’t work out what the original meaning was meant to be. Either because you don’t have the knowledge or context of the conversation. The way to combat this is to be very specific.

Take a simple prompt. Let’s say, “Here is a piece of text…, improve this and make it better”. You might think that seems simple enough but there are quite a lot of inferred assumptions here. For example, what does it really mean “improve and make it better”? The answer could vary massively depending on who you were talking to or what text you were trying to improve.

If you wanted to give it to an 8 year-old, you might make it better by using easier to read words and reducing complex sentences. If you wanted to give it to your boss, you might want to give it a professional tone for company documentation. Perhaps if you’re writing a motivational speech, making it better might mean using grand imagery and sentiment. The thing is most people just say, “make it better”. If you broke down all the different ways you could improve a piece of text, I’m sure you could fit it in a large book or two.

Ideally give the prompt more information such as background information, audience information, and hints about what you are expecting.

Finally remember there are so many ways to interpret the same question from different perspectives. Here are just a few of my favourite snippets to pin to the end of a prompt to make it better:

  • ‘Use everything you know about me to personalise the response to me’
  • ‘Consider this from a number of mainstream psychological perspectives’
  • ‘Analyse this from an engineering point of view.’
  • ‘How would this problem be tackled from a number of popular motivational techniques?’
  • ‘Review this for readability, factor in the audience being a tech professional on LinkedIn and consider their attention span in their busy lives.’

Summary

Inferring using context or previous knowledge is very useful in language. It is a form of compression or efficiency and can help keep conversations light. Once you notice it, you start to see it in every day communication.

For Generative AI to be more “intuitive”, it needs to have a similar “outlook” or understanding of what we are trying to infer when we ask it a question. You can help it out with being very specific or providing it more information to give it context and get targeted answers.

In addition I also realised properties of most mental objects are actually collections of physical objects, or at least can be treated as such. In truth, when people think of an object in their head, usually the properties they have associated with it are either things, feelings, places or times and probably have much to do with our early development as a human species but I’ll talk more about that in another part.

Reflection On my “Intelligent” Chatbot

Looking back, it’s interesting to see how Generative AI like ChatGPT does everything plus so much more than what I was trying to achieve. In many ways, it was a futile effort, but at the same time, I feel I learned a considerable amount about language, the psychology of how people think about objects, and artificial ‘intelligence’. Revisiting this project has not only provided valuable insights into language and AI but also inspired new ideas and approaches for my current and future projects.

For those keen to go on similar experimental journeys, I encourage you to enjoy both the challenges and the successes. Stay curious, keep experimenting, and remember that every experiment, successful or not, can help you down the line in unexpected ways. Keep an eye out for part 2 and share this post if you found it insightful.

Write a Comment

Comment