Is ChatGPT Inherently Platonic?
OpenAI has released a new version of ChatGPT with new capabilities for text, video, and audio. Along with the new version come persistent questions about the new AI. Last night, I had a chat with the new ChatGPT about how AI works and how good it is. (I do the same with people. How do they work, and how good are they?) My curiosity about the new technology that is going to save or end the world—depending on who you talk to—recently peaked in an interview I did a month or so ago exploring the state and challenges of music-generating AI.
In the interview I learned that generative AI works by gathering large amounts of data—yes, this we knew already. What I didn’t know is that the next step in the process is called “tokenization” (for language) or “canonization” (for music). The program first reduces—some might say deconstructs—all the data, and then, in the final stage of output, the model attempts to use these tokens or canons to generate a response to the request. It is generally referred to as a “transformer” model.
I’m pasting my conversation with ChatGPT as the body of this article, but let me preface it by saying that while I’m highly impressed—as I think we all are—by the large datasets AI is using, I’m also skeptical of the second step, that of deconstruction. Can language, or especially music, ever be deconstructed from its context? The French philosopher Jacques Derrida would say no. I refer to him in the conversation below. However, other philosophers, such as Russel, Putnam, or Kripka, may argue that language, music, or images can be reduced to core tokens or canons independent of context. For them, this is not a problem. They would argue this is why language, music, and images work in the first place. There is something independent of the context which is related.
This latter group of philosophers might insist on Plato’s theory of forms. Plato argued that with each concept, an ideal exists somewhere in a heaven of concepts—the heaven of ideal forms. For example, I might refer to “chair” in the sentence, “This chair is the most uncomfortable chair I’ve ever sat on.” There are many chairs around my dining table, the chair at the Resolute Desk in the Oval Office, and a lounge chair at a bar. We all know what the word chair means. Plato argued that there must be some ideal form for all of these chairs. There is the core concept of chair implying chairness. Or is there? Perhaps there are only the individual chairs and the simple words in our vocabulary that we use to refer to them. Is there really some intermediate “chairness?"
This has long been a debate among philosophers, and it seems relevant to thinking about the new generative AI. Is the AI product based simply on objects in the world, or does it truly have—as it says it does—this Platonic intermediate step, this Platonic bias built in by the programmers? If it is just based on objects in the world, then why does it need to reduce or deconstruct them? Furthermore, if ChatGPT and other generative AI are biased towards Platonism, how limiting is this bias?
It might be argued that if ChatGPT is Platonically biased, it is because humans are. But not everyone would agree that they are.
ChatGPT could give us all a lesson in politeness. Does it believe in a heaven of polite forms? If so, the concept of the adverb “certainly” must be counted among them.
Theral: How does ChatGPT work?
ChatGPT: ChatGPT, like other language models, operates based on a deep learning architecture called the transformer model. It has been trained on vast amounts of text data to understand and generate human-like text. Here's a simplified overview of how it works:
“Human-like.” How do we humans like that?
• Tokenization: When you input text into ChatGPT, it first breaks down the text into smaller units called tokens. Depending on the model's configuration, these tokens could be words, subwords, or characters.
• Understanding Context: ChatGPT then processes these tokens while considering the context of the conversation. It looks at the words before and after each token to understand the meaning and context of the text.
• Generating Response: ChatGPT generates a response based on the input received after understanding the context. It predicts the most probable next token or sequence of tokens given the input and its training.
• Iterative Improvement: The model continuously refines its understanding and responses based on the feedback it receives. This feedback can come from human trainers who guide the learning process or from the ongoing training process with large datasets.
This sounds pretty good. But what does it mean to “improve?”
• Output: ChatGPT provides a response based on its understanding of the input and the patterns it has learned from the training data.
It's important to note that while ChatGPT can generate responses that seem human-like, it doesn't truly understand the context or have consciousness. Its responses are generated based on statistical patterns learned from the data it was trained on.
Thank you for the clarification. This final bit is the bullshit language put in there by the programmers to protect against liability. Then, I move on to my core criticism that deconstruction could be destructive.
Theral: Could a potential criticism of this process be that language is first deconstructed before it is reconstructed or transformed? Is something lost in the deconstruction?
ChatGPT: Yes, that's a valid point and a potential criticism of the process. When language is deconstructed into tokens or smaller units for processing by models like ChatGPT, there's a risk that some nuances or context may be lost in the process. Here are some aspects that could be considered:
• Contextual Understanding: While language models like ChatGPT are trained to understand the context to some extent, breaking down language into tokens may limit their ability to fully grasp the broader context of a conversation. Some contextual cues may be lost when sentences are broken down into individual tokens.
• Semantic Relationships: Language is rich in semantic relationships between words and phrases, which can be complex and subtle. Deconstructing language into tokens may not fully capture these relationships, leading to potential misunderstandings or misinterpretations.
• Idioms and Expressions: Languages often contain idiomatic expressions and cultural nuances that may not be fully captured when language is deconstructed into tokens. As a result, language models may struggle to interpret or generate these expressions accurately.
• Ambiguity: Language can be inherently ambiguous, with words and phrases having multiple meanings depending on context. Deconstructing language into tokens may increase the likelihood of ambiguity, as individual tokens may not fully capture the intended meaning of a sentence.
While deconstructing language into tokens is a necessary step for processing by models like ChatGPT, it's important to be aware of the potential limitations and the possibility that some information or nuances may be lost in the process. Researchers are continually working on improving language models to better capture and understand the complexities of human language.
Theral: Can you give me some concrete examples of this criticism?
ChatGPT: Certainly! Here are some examples illustrating how deconstructing language into tokens can potentially lead to loss of context or misunderstanding:
• Ambiguity in Pronouns:
• Input: "He saw the man with the telescope."
• Deconstructed tokens: ["He"”, "saw", "the", "man", "with", "the", "telescope", "."]
• In this example, it's unclear whether "he" or "the man" is the one with the telescope, leading to ambiguity in interpretation.
• Loss of Semantic Relationships:
• Input: "I saw a bat flying over the baseball field."
• Deconstructed tokens: ["I", "saw", "a", "bat", "flying", "over", "the", "baseball", "field", "."]
• The word "bat" could be interpreted as a flying mammal or a baseball bat. Without considering the semantic relationship with "baseball field," the intended meaning may be lost.
• Idiomatic Expressions:
• Input: "It's raining cats and dogs."
• Deconstructed tokens: ["It's", "raining", "cats", "and", "dogs", "."]
• This idiomatic expression, meaning heavy rain, may not be fully captured by considering individual tokens.
• Cultural Nuances:
• Input: "That's a piece of cake."
• Deconstructed tokens: ["That's", "a", "piece", "of", "cake", "."]
• While this phrase means something is easy, the cultural nuance or origin of the expression may not be apparent from the deconstructed tokens alone.
In each of these examples, deconstructing language into tokens may obscure the intended meaning or context, leading to potential misinterpretations or loss of nuance. This highlights the challenge of accurately capturing the richness of human language in machine processing.
Theral: What might Jacques Derrida, the deconstructionist theorist, say about this process?
ChatGPT: Jacques Derrida, as a deconstructionist theorist, might offer a nuanced critique of the process of deconstructing language into tokens. Derrida's philosophy emphasizes the inherent instability and multiplicity of meaning in language, suggesting that meaning is not fixed but is contingent upon context and interpretation. Here's how Derrida might approach the topic:
• Deconstruction of Binary Oppositions: Derrida famously deconstructs binary oppositions such as presence/absence, speech/writing, and signifier/signified. He might argue that the process of deconstructing language into tokens reinforces a binary understanding of language, where meaning is reduced to discrete units. Derrida might question the validity of this binary distinction and advocate for a more fluid conception of language that embraces its inherent ambiguity and multiplicity.
• Contextual Fluidity: Derrida would likely emphasize the importance of context in understanding language. He might argue that the process of breaking down language into tokens fails to capture the full complexity of context-dependent meaning. Derrida might advocate for a more nuanced approach that acknowledges the interplay between language and context, where meaning emerges through dynamic interactions rather than static analysis.
• Play of Signifiers: Derrida's notion of the "play of signifiers" suggests that meaning is constantly deferred and can never be fully fixed. He might critique the idea that language can be neatly divided into discrete tokens, arguing that meaning is always in flux and cannot be contained within rigid boundaries. Derrida might encourage an approach to language that embraces its inherent ambiguity and openness to interpretation.
• Critique of Logocentrism: Derrida critiques logocentrism, the belief in a fixed, stable foundation of meaning. He might argue that the process of deconstructing language into tokens reflects a logocentric bias, where meaning is reduced to a series of stable, predetermined units. Derrida might challenge this assumption, advocating for an understanding of language that recognizes its radical contingency and openness to interpretation.
Overall, Derrida's approach to language would likely involve a critique of the process of deconstructing language into tokens, emphasizing the need for a more fluid, contextually sensitive understanding of meaning. He would likely encourage scholars and practitioners to remain mindful of language's inherent complexity and ambiguity, resisting the temptation to reduce it to simplistic, binary categories.
Theral: What are some other criticisms of the process of reduction to tokens used by ChatGPT?
ChatGPT: Certainly! Here are some additional criticisms of the process of reducing language to tokens used by ChatGPT:
• Loss of Semantic Information: Tokenization can result in the loss of semantic information encoded in the structure of sentences. Complex relationships between words and phrases may not be fully captured when language is broken down into tokens, leading to a loss of meaning.
• Inability to Handle Contextual Ambiguity: Language often contains ambiguous phrases or expressions whose meaning can only be inferred from context. Tokenization may not adequately capture these nuances, leading to misunderstandings or misinterpretations.
• Limited Representational Capacity: Tokenization imposes a constraint on the representational capacity of language models like ChatGPT. Longer sentences or more complex syntactic structures may be truncated or oversimplified, limiting the model's ability to accurately understand or generate text.
• Dependency on Predefined Vocabulary: Tokenization relies on a predefined vocabulary of tokens, which may not encompass all possible words or phrases in a language. This can result in out-of-vocabulary (OOV) words or phrases that are not properly represented, impacting the model's performance.
• Difficulty in Handling Morphologically Rich Languages: Morphologically rich languages, which use inflections and word formations extensively, may pose challenges for tokenization. Breaking down words into individual tokens may not adequately capture their morphological structure, leading to inaccuracies in processing.
• Overemphasis on Surface-Level Features: Tokenization primarily focuses on surface-level features of language, such as individual words or characters. This may neglect deeper semantic or pragmatic aspects of language, resulting in a shallow understanding or generation of text.
• Inability to Capture Non-Verbal Communication: Language models like ChatGPT are primarily trained on textual data and may struggle to capture non-verbal aspects of communication, such as tone of voice, body language, or facial expressions.
These criticisms highlight the limitations and challenges associated with the process of reducing language to tokens in natural language processing tasks. While tokenization is a necessary step for computational processing, it's essential to recognize its shortcomings and explore ways to address them effectively.
Theral: I’m interested in further reading of the criticism of tokenization.
ChatGPT: For further reading, specifically on criticisms of tokenization in natural language processing, you might want to explore academic papers, articles, and books that discuss the limitations, challenges, and potential biases associated with tokenization. Here are some resources to consider:
• "Challenges in Representation Learning: A Report on Three Machine Learning Contests" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville: This report discusses challenges in representation learning, including tokenization-related issues such as out-of-vocabulary words and handling of rare or unseen tokens.
• "The Mythos of Model Interpretability" by Zachary C. Lipton: This paper explores the limitations and challenges of interpreting machine learning models, including tokenization-related biases and the impact on model interpretability.
• "What Can We Learn from Distributional Word Representations?" by Hinrich Schütze and Geoffrey E. Hinton: This paper examines distributional word representations, discussing tokenization challenges and their implications for downstream natural language processing tasks.
• "Language (Technology) Is Power: A Critical Survey of 'Bias' in NLP" by Su Lin Blodgett et al.: This survey paper explores issues of bias in natural language processing, including biases introduced through tokenization and preprocessing steps.
• "Ethics in NLP: An Evaluation Framework for Assessing the Ethical Impact of NLP Research" by Emily M. Bender et al.: This paper proposes an evaluation framework for assessing the ethical impact of natural language processing research, including considerations related to tokenization and data preprocessing.
• "Bias and Fairness in Natural Language Processing" edited by Dirk Hovy, Shannon L. Spruit, and Isabelle Augenstein: This book provides a comprehensive overview of bias and fairness issues in natural language processing, including discussions on tokenization-related biases and their implications for downstream applications.
• "Algorithmic Bias in Autonomous Systems" edited by Mary-Anne Williams and Amitava Datta: This edited volume explores algorithmic bias in autonomous systems, including discussions on bias introduced through tokenization and preprocessing steps in natural language processing pipelines.
These resources offer critical insights into the challenges and limitations associated with tokenization in natural language processing and provide valuable perspectives for further research and discussion.
Then, I get to the core question.
Theral: Is the tokenization used by ChatGPT inherently biased toward Platonic ideals?
ChatGPT: The tokenization process used by ChatGPT, as well as other natural language processing models, is not inherently biased towards Platonic ideals. Tokenization is a technical process that involves breaking down language into smaller units, such as words, subwords, or characters, to facilitate computational processing. While tokenization is influenced by linguistic theories and computational considerations, it is not directly based on philosophical frameworks like Platonic ideas. Platonic ideals, rooted in the philosophy of Plato, emphasize the existence of abstract, ideal forms or concepts that transcend the material world. While Plato's ideas have had a profound influence on Western thought, including philosophy and linguistics, they are not directly relevant to the technical process of tokenization in natural language processing. However, it's worth noting that the design and implementation of tokenization algorithms may reflect certain linguistic assumptions or biases, which can impact the performance of language models like ChatGPT. For example, tokenization algorithms may be designed based on linguistic theories that prioritize certain linguistic features or structures over others. Additionally, tokenization may inadvertently introduce biases related to language usage, cultural factors, or the distribution of training data.
This! No, it is not biased upon Platonic forms, says ChatGPT. It’s worth noting that it might be.
Therefore, while tokenization itself is not biased towards Platonic ideas, it is essential to critically examine the design and implementation of tokenization algorithms to identify and mitigate potential biases that may affect the performance and fairness of natural language processing models.
Theral: What linguistic theories are ChatGPT biased toward?
ChatGPT: ChatGPT, like other natural language processing models, is not explicitly biased toward specific linguistic theories. However, its design and training data can indirectly reflect linguistic principles and assumptions that are prevalent in the field of computational linguistics. Some linguistic theories and principles that may influence ChatGPT's behavior include:• Distributional Semantics: This theory posits that the meaning of words can be inferred from their distributional patterns in large corpora of text. ChatGPT's training involves exposure to vast amounts of text data, which allows it to learn associations between words and their contexts, aligning with distributional semantics.
• Syntax and Grammar: While ChatGPT does not explicitly parse sentences for syntax or grammar, its training data inherently contains syntactic and grammatical structures. The model learns patterns of word order, phrase structure, and agreement through exposure to natural language text.
• Word Sense Disambiguation: This area of linguistics deals with resolving the meaning of ambiguous words based on context. ChatGPT may implicitly perform word sense disambiguation by predicting the most likely meaning of a word given its surrounding context.
• Pragmatics: Pragmatics concerns the study of language in context, including implicature, presupposition, and speech acts. While ChatGPT does not explicitly model pragmatic phenomena, its responses may reflect pragmatic considerations learned from the training data.
• Sociolinguistics and Register Variation: ChatGPT may exhibit patterns of language use that reflect sociolinguistic factors such as dialectal variation, register, and style. Its responses may vary depending on the characteristics of the input text and the training data.
It's important to note that ChatGPT's behavior is primarily shaped by the patterns it learns from its training data rather than by adherence to specific linguistic theories. As a result, the model may exhibit biases or limitations that reflect the characteristics of the data it was trained on, including linguistic biases present in the training corpus. Ongoing research aims to address these biases and improve the robustness and fairness of natural language processing models.
Certainly!