Large Language Model (LLM) Architecture



Large language models are AI systems that are designed to process and analyze vast amounts of natural language data and then use that information to generate responses to user prompts. These systems are trained on massive data sets using advanced machine learning algorithms to learn the patterns and structures of human language, and are capable of  generating natural language responses to a wide range of written inputs. Large language models are becoming increasingly important in a variety of applications such as natural language processing, machine translation, code and text generation, and more.


  • History
  • Common Use Cases
  • Working with an LLM
  • Top Current LLMs
  • LLM Precursors




Here are just a few examples of common use cases for large language models:

  1. CHATBOTS AND VIRTUAL ASSISTANTS – One of the most common implementations, LLMs can be used by organizations to provide help with things like customer support, troubleshooting, or even having open-ended conversations with userprovided prompts.
  2. CODE GENERATION AND DEBUGGING – LLMs can be trained on large amounts of code examples and give useful code snippets as a response to a request written in natural language. With the proper techniques, LLMs can also be built in a way to reference other relevant data that it may not have been trained with, such as a company’s documentation, to help provide more accurate responses.
  3. SENTIMENT ANALYSIS – Often a hard task to quantify, LLMs can help take a piece of text and gauge emotion and opinions. This can help organizations gather the data and feedback needed to improve customer satisfaction.
  4. TEXT CLASSIFICATION AND CLUSTERING – The ability to categorize and sort large volumes of data enables the identification of common themes and trends, supporting informed decision-making and more targeted strategies.
  5. LANGUAGE TRANSLATION – Globalize all your content without hours of painstaking work by simply feeding your web pages through the proper LLMs and translating them to different languages. As more LLMs are trained in other languages, quality and availability will continue to improve.
  6. SUMMARIZATION AND PARAPHRASING – Entire customer calls or meetings could be efficiently summarized so that others can more easily digest the content. LLMs can take large amounts of text and boil it down to just the most important bytes.
  7. CONTENT GENERATION – Start with a detailed prompt and have an LLM develop an outline for you. Then continue on with those prompts and LLMs can generate a good first draft for you to build off. Use them to brainstorm ideas, and ask the LLM questions to help you draw inspiration from.

Note: Most LLMs are not trained to be fact machines. They know how to use language, but they might not know who won the big sporting event last year. It’s always important to fact check and understand the responses before using them as a reference.


It does currently take a little bit more work to grab an open-source model and start using it, but progress is moving very quickly to make them more accessible to users. On Databricks, for example, we’ve made improvements to open-source frameworks like MLflow to make it very easy for someone with a bit of Python experience to pull any Hugging Face transformer model and use it as a Python object. Oftentimes, you can find an open-source model that solves your specific problem that is orders of magnitude smaller than ChatGPT, allowing you to bring the model into your environment and host it yourself. This means that you can keep the data in your control for privacy and governance concerns as well as manage your costs.

Another huge upside to using open-source models is the ability to fine-tune them to your own data. Since you’re not dealing with a black box of a proprietary service, there are techniques that let you take open source models and train them to your specific data, greatly improving their performance on your specific domain. We believe the future of language models is going to move in this direction, as more and more organizations will want full control and understanding of their LLMs.


Below are some of the most relevant large language models today. They do natural language processing and influence the architecture of future models.

  • BERT – BERT is a family of LLMs that Google introduced in 2018. BERT is a transformer-based model that can convert sequences of data to other sequences of data. BERT’s architecture is a stack of transformer encoders and features 342 million parameters. BERT was pre-trained on a large corpus of data then fine-tuned to perform specific tasks along with natural language inference and sentence text similarity. It was used to improve query understanding in the 2019 iteration of Google search.
  • Claude – The Claude LLM focuses on constitutional AI, which shapes AI outputs guided by a set of principles that help the AI assistant it powers helpful, harmless and accurate. Claude was created by the company Anthropic and powers its two main product offerings: Claude Instant and Claude 2. Claude 2 excels at complex reasoning according to Anthropic.
  • Cohere – Cohere is an enterprise LLM that can be custom-trained and fine-tuned to a specific company’s use case. The company that created the Cohere LLM was founded by one of the authors of Attention Is All You Need. One of Cohere’s strengths is that it is not tied to one single cloud — unlike OpenAI, which is bound to Microsoft Azure.
  • Falcon 40B – Falcon 40B is a transformer-based, causal decoder-only model developed by the Technology Innovation Institute. It is open source and was trained on English data. The model is available in two smaller variants as well: Falcon 1B and Falcon 7B (1 billion and 7 billion parameters). Amazon has made Falcon 40B available on Amazon SageMaker. It is also available for free on GitHub.
  • Galactica – Just three days into its public release in November 2022, Galactica was Meta’s LLM designed specifically for scientists. It was trained on a collection of academic material — 48 million papers, lecture notes, textbooks and websites. As most models do, it produced AI “hallucinations” that members of the scientific community deemed unsafe because they sounded authoritative. This made them hard to detect quickly and were generated in a domain that requires little margin for error.
  • GPT-3 – GPT-3 is OpenAI’s large language model with more than 175 billion parameters, released in 2020. GPT-3 uses a decoder-only transformer architecture. In September 2022, Microsoft announced it had exclusive use of GPT-3’s underlying model. GPT-3 is 10 times larger than its predecessor. GPT-3’s training data includes Common Crawl, WebText2, Books1, Books2 and Wikipedia. GPT-3 is the last of the GPT series of models in which OpenAI made the parameter counts publicly available. The GPT series was first introduced in 2018 with OpenAI’s paper “Improving Language Understanding by Generative Pre-Training.”
  • GPT-3.5 – GPT-3.5 is an upgraded version of GPT-3 with fewer parameters. GPT-3.5 was fine-tuned using reinforcement learning from human feedback. GPT-3.5 is the version of GPT that powers ChatGPT. There are several models, with GPT-3.5 turbo being the most capable, according to OpenAI. GPT-3.5’s training data extends to September 2021. It was also integrated into the Bing search engine but has since been replaced with GPT-4.
  • GPT-4 – GPT-4 is the largest model in OpenAI’s GPT series, released in 2023. Like the others, it’s a transformer-based model. Unlike the others, its parameter count has not been released to the public, though there are rumors that the model has more than 170 trillion. OpenAI describes GPT-4 as a multimodal model, meaning it can process and generate both language and images as opposed to being limited to only language. GPT-4 also introduced a system message, which lets users specify tone of voice and task. GPT-4 demonstrated human-level performance in multiple academic exams. At the model’s release, some speculated that GPT-4 came close to artificial general intelligence (AGI), which means it is as smart or smarter than a human. GPT-4 powers Microsoft Bing search, is available in ChatGPT Plus and will eventually be integrated into Microsoft Office products.
  • Lamda – Lamda (Language Model for Dialogue Applications) is a family of LLMs developed by Google Brain announced in 2021. Lamda used a decoder-only transformer language model and was pre-trained on a large corpus of text. In 2022, LaMDA gained widespread attention when then-Google engineer Blake Lemoine went public with claims that the program was sentient. It was built on the Seq2Seq architecture.
  • Llama – Large Language Model Meta AI (Llama) is Meta’s LLM released in 2023. The largest version is 65 billion parameters in size. Llama was originally released to approved researchers and developers but is now open source. Llama comes in smaller sizes that require less computing power to use, test and experiment with. Llama uses a transformer architecture and was trained on a variety of public data sources, including webpages from CommonCrawl, GitHub, Wikipedia and Project Gutenberg. Llama was effectively leaked and spawned many descendants, including Vicuna and Orca.
  • Orca – Orca was developed by Microsoft and has 13 billion parameters, meaning it’s small enough to run on a laptop. It aims to improve on advancements made by other open source models by imitating the reasoning procedures achieved by LLMs. Orca achieves the same performance as GPT-4 with significantly fewer parameters and is on par with GPT-3.5 for many tasks. Orca is built on top of the 13 billion parameter version of LLaMA.
  • Palm – The Pathways Language Model is a 540 billion parameter transformer-based model from Google powering its AI chatbot Bard. It was trained across multiple TPU 4 Pods — Google’s custom hardware for machine learning. Palm specializes in reasoning tasks such as coding, math, classification and question answering. Palm also excels at decomposing complex tasks into simpler subtasks. PaLM gets its name from a Google research initiative to build Pathways, ultimately creating a single model that serves as a foundation for multiple use cases. There are several fine-tuned versions of Palm, including Med-Palm 2 for life sciences and medical information as well as Sec-Palm for cybersecurity deployments to speed up threat analysis.
  • Phi-1 – Phi-1 is a transformer-based language model from Microsoft. At just 1.3 billion parameters, Phi-1 was trained for four days on a collection of textbook-quality data. Phi-1 is an example of a trend toward smaller models trained on better quality data and synthetic data. “We’ll probably see a lot more creative scaling down work: prioritizing data quality and diversity over quantity, a lot more synthetic data generation, and small but highly capable expert models,” wrote Andrej Karpathy, former director of AI at Tesla and OpenAI employee, in a tweet. Phi-1 specializes in Python coding and has fewer general capabilities because of its smaller size.
  • StableLM – StableLM is a series of open-source language models developed by Stability AI, the company behind image generator Stable Diffusion. There are 3 billion and 7 billion parameter models available and 15 billion, 30 billion, 65 billion and 175 billion parameter models in progress at time of writing. StableLM aims to be transparent, accessible and supportive.
  • Vicuna 33B – Vicuna is another influential open source LLM derived from Llama. It was developed by LMSYS and was fine-tuned using data from It is smaller and less capable that GPT-4 according to several benchmarks, but does well for a model of its size. Vicuna has only 33 billion parameters, whereas GPT-4 has trillions.


Although LLMs are a recent phenomenon, their precursors go back decades. Learn how recent precursor Seq2Seq and distant precursor ELIZA set the stage for modern LLMs.

  • Seq2Seq – Seq2Seq is a deep learning approach used for machine translation, image captioning and natural language processing. It was developed by Google and underlies some of their modern LLMs, including LaMDA. Seq2Seq also underlies AlexaTM 20B, Amazon’s large language model. It uses a mix of encoders and decoders.
  • Eliza – Eliza was an early natural language processing program created in 1966. It is one of the earliest examples of a language model. Eliza simulated conversation using pattern matching and substitution. Eliza, running a certain script, could parody the interaction between a patient and therapist by applying weights to certain keywords and responding to the user accordingly. The creator of Eliza, Joshua Weizenbaum, wrote a book on the limits of computation and artificial intelligence.

NI+IN UCHIL Founder, CEO & Technical Evangelist


Leave a Reply