Can LLMs I dentify Korean Adverbs? (Part 3)

With round 1 and 2 done, I was in a rhythm. I had 300+ new vocab words in my Anki deck. It took some struggle, but I still got it done in the end. The previous year, I spent weeks formatting data and entering it into Anki. I wanted to do more. I had a list of adverbs that I didn’t include in the original list because I had already added them manually to Quizlet. I decided I was just going to export the list I had in Quizlet and import it into Anki. However, that would just give me the Korean word and English definition. It wouldn’t give me the sentences as hints. So a few days later after finishing round 2, I decided to start round 3.

This time I took a different approach. Instead of making the LLM split the words into two tables. I deicided to have all the words in one list. I thought that maybe my prompt was not concise enough or maybe a little confusing. So I wrote a prompt and then asked Claude 3.7 Sonnet to refine my prompt.

This following is a list of words believed to be Korean adverbs. Create a table that has 6 columns. Column 1 is labeled "Korean" and is for the Korean word. Column 2 is for the English translation and is labeled "English". Column 3 is labeled "Hint" and is for a sentence that uses the word exactly as provided along with its English translation. Do not conjugate the word to a different form and use it for the sentence. Fill out the table using the list according to the columns. Column 4 is Labeled "Is Adverb?". If the word is an adverb add "yes" in the cell. Column 5 is labeled "Part of Speech". Column 6 is labeled "Reason".

If the word is not an adverb or is an adverbial phrase containing 2 or more separate words, in the column labeled "Is Adverb" add "no" to the cell. In the column labeled "Part of Speech" you need to provide the part of speech for the item. The column labeled "Reason" is where you need to provide the reason why the item is not considered an adverb.

The prompt it gave me was the following:

# Korean Adverbs Analysis Task

## Instructions: Analyze the following list of potential Korean adverbs and create a detailed table with the following columns:

1. **Korean**: The original Korean word as provided
2. **English**: Direct English translation
3. **Example Usage**: A Korean sentence using the word in its original form (not conjugated), followed by English translation
4. **Adverb Classification**: "YES" if the term is a true adverb; "NO" if not
5. **Actual Part of Speech**: Identify the correct grammatical classification if not an adverb
6. **Reasoning**: If not classified as an adverb, explain why with grammatical justification


## Important Rules:
- Do not modify, conjugate, or change the original Korean term in the example sentence
- For non-adverbs, clearly identify if the term is:
     * A different part of speech (specify which one)
     * A phrase containing multiple words
     * A conjugated form of another word type
- Provide specific grammatical reasoning for classifications, not just translations
- Format the table with clear column headers and consistent alignment


## List of Terms to Analyze:
[Insert your Korean word list here]

ChatGPT

ChatGPT is the worst at this task. Even though I simplified the task to just one table. It still had problems.

It only did the first 30 or so words and then I had to tell it to continue.

In the middle of processing, it stopped. There was a button that I had to press to make it continue processing. Even then I had to keep prompting it to continue the list.

The original list had 163 words. Why did you only provide 118?

You're absolutely right. I apologize for the oversight in not covering the entire list of 163 words. It seems the analysis was cut off prematurely. I'll work through the rest of the list and make sure to cover all the remaining terms to complete the table for you.

Let me continue with the rest of the terms right now.

Thanks for your patience! Let's continue from where we left off, and I'll make sure to cover all the remaining terms. Here's the continuation and completion of the analysis for the rest of the list:

When It processed the rest of the list, it stopped and I had to press the continue processing button again.

Why can't it just process the full list instead of stopping, messing up, and then having to continue prompting it to continue?

Spot checking I saw that it labeled a couple of adverbial phrases as Adverbs, but put the part of speech as an adverbial phrase.

Google Gemini Pro 2.5 (experimental)

2.5 Pro took a bit longer to analyze each word, but I guess that is because it is a thinking model. Then it displayed the tables on screen and provided a link to send to Google Sheets.

Claude 3.7 Sonnet

Claude didn't hide its work in a separate window. It added words to the table, line by line, until it was done. However it was missing words.

The original list had 163 entries but your table only has 151. Why is that?

It responded with:

You're absolutely right, and I apologize for the discrepancy. My table contains 151 entries while your original list had 163 items. This happened because I inadvertently missed several terms during my analysis.

Looking back at the list, I skipped some terms in my analysis, particularly when processing the longer sections of the list. This was an oversight on my part.

Would you like me to:

Identify the specific missing entries from your original list?
Complete the analysis for the missing terms?
Provide a complete table with all 163 entries?

I can address this discrepancy whichever way would be most helpful to you.

I told it

Identify the specific missing entries from the original list and complete the analysis for the missing terms.

I thought the missing words were probably all duplicates, but they weren't. Along with the table of 12 missing words it stated:

The discrepancy in numbers comes from:
  • I missed some unique entries (like "언제" and "그렇게")
  • I failed to account for duplicates in your original list ("보통", "실제로", "자주", "거의", "대담하게")
  • I missed the typo entry "E결국"
  • I missed the comma variant "이상하게,"
Some entries appeared twice in your original list, which is why the difference between 163 and 151 is 12 entries. I've now analyzed all missing terms to complete the full set of 163 entries from your original list.
So 5 of the words were duplicates? Thank you for removing then. That was the right thing to do, but you should have mentioned that in your first response. Please print out the tables of 12 missing words minus the 5 duplicates.

Conculsion

ChatGPT 4o could not make it to the second round because I couldn't get it to process the list. The first major problem was that I uploaded the file which caused it to use the python data analysis tool. This probably has restrictions because I have a free account and the amount of other users using the app. The next major problem was that the AI kept telling me it could do something when it couldn't. Each prompt to correct it or question it kept using up my allowed messages for the time frame. I was then locked out for 4-5 hours. The next major problem was that I copied and pasted the list to the prompt, but I didn't start a new message window. This meant the file was still being sent with every prompt and the AI kept trying to refer to the file instead of the list I pasted as a prompt. The last major problem was that suddenly the AI wasn't using my list anymore and was just filling the table with random adverb words. It really felt like ChatGPT was set to dumb. I've never had so many problems with it like this before. This kind of behavior I'd expect from Gemini 2.0 flash and under.

Google Gemini 2.0 flash never made it to the second round because I couldn't get it to process the list correctly. The first major problem was that it only gave me one table when it was done. I asked about the second table, which it provided, but when I counted the total for both tables, there were only 285 entries. So 106 entries were missing. When I pointed this out, this made it go into some analyst mode that is completely behind the scenes that has no way to indicate to the user the progress, periodic updates, or anything. 48 hours later it stated it was still not done.

I know it is not completely fair to change models during a comparison, but I changed to Google Gemini 2.5 Pro (experimental) for round two. Besides, ChatGPT kept switching me back and forth between 4o and 4o mini. So I was already using two different models for ChatGPT. Gemini 2.5 Pro (experimental) is labeled as “thinking model designed for complex tasks, including coding, math, STEM problems, and analyzing large datasets.” So I decided to tested it out for this task. I didn't think my request was a large task, had a large dataset, or something the regular models couldn't do, but I guess I am misunderstanding something (this was explained to me later on and can be found later on in the post). This model exceeded my expectations. I even thought I was going to hit some internal limit. It executed the task with little to no problems.

Claude 3.7 Sonnet has a limited amount of messages you can send on the free account. The limit is based on the number of tokens used in a prompt and the number of messages sent. Once the user uses up a certain amount of tokens (the context window gets too large), they have to start a new chat. That is problematic when running a task that needs the LLM to keep context. There were a couple of times I needed to start a new chat. Then the message limit kicked in and I couldn't do anything else until 11pm that night or the next day.

For round 1 Claude 3.7 Sonnet did the best, but for round 2 and 3, Gemini 2.5 Pro (experimental) was the best.

Why?

So why did I do all this? I didn't plan for this to turn into a project. I had a list of words I collected and I didn't want to manually go through each one making sure it was an adverb, finding the best definition, and then thinking of a sentence/finding a sentence to help learn the word. I thought this would be a fairly easy task for AI to do.

What Would I Have Done Differently

Remove Duplicates

Trying to get an accurate number of whether or not all the entries were accounted for is difficult seeing how sometimes duplicates were removed and sometimes it wasn’t. If I already knew the number of duplicates, then counting the total of entries processed would have been easier. However, that would have been more work for me to do, and I expected the AI, to be able to account for that.

I could have specified as part of the instructions to remove duplicates and provide a count of duplicates removed along with a count of total entries, and a count for entries in each table.

Providing a Definition for What is an Adverb

Each AI seems to have different definitions as to how to classify an entry as an adverb or not. Even two different instances of the same AI model had different classifications. While I did say that if it was an adverbial phrase, to classify it as not an adverb, I still didn’t say what was an adverb. This is because my own knowledge was shaky. I know there are native Korean adverbs and I don't think there is a pattern to their endings. I know that descriptive verbs can be changed into an adverb by changing its ending to -게, -히, or -이. I also knew that adding -으로 to certain words can turn it into an adverb. So since my own knowledge was shaky, I didn't want to give it my definition and have the list skewed by my lack of knowledge. I actually considered this project as a form of studying. I have a better understanding as to what is an adverb and what isn't, and maybe next time I’ll give it explicit instructions.

Provide a More Succinct Prompt

When I first started this project, I didn't fully understand the power behind Prompt Engineering. Being an engineer myself, I knew how to articlate specifically what I wanted it to do. However, I didn't fully understand the small nuances that went into Prompt Engineering.

During round three, I asked Claude 3.7 Sonnet to refine my prompt. This was a good idea, but it wasn't until later that I noitced that it lost some of my specficity in order to be more specific in other areas. So In the future I need to fully review the prompt to make sure it is covering the major points I need done. I will also consider breaking the task up into several prompts instead of one big prompt.

Use Smaller List; List in Chunks

I kept seeing videos and posts about how people were getting LLMs to write code, solve puzzles, and build apps. So I didn't think asking an LLM to find and sort Korean adverbs and then put them into sentences in English and Korean would be a lot of work. However, I had to do two to three rounds just to get tasks done. Even with Claude, I had to feed it the same list twice to make sure it caught everythng. So I asked Gemni about the dificulty of the task I gave the LLMs and in terms of data processing, was it normal to have to process the same list twice. According to Gemini 2.5 Flash:

... However, you've hit upon a key characteristic of current LLMs: they aren't perfect, deterministic data processors in the way a traditional sorting algorithm or a database query is. They can sometimes:
  1. Lose Focus: With very long lists, they might occasionally "lose track" of items, miss one or two, or fail to process the entire list consistently.
  2. Misinterpret or Hallucinate: They might misread a word, apply a label based on a misunderstanding, or even sometimes "hallucinate" items not present in the list.
  3. Lack Consistency: They might apply slightly different criteria to similar items across a long list without realizing it.
So, while the task (sorting/categorizing) doesn't inherently require you to feed the LLM the list multiple times, the limitations of the LLM might lead to an incorrect or incomplete result on the first attempt.

… So, while the multi-pass nature is normal for algorithms, needing to interact with the LLM multiple times or re-feed data because of its errors is a sign of its current limitations, not a standard procedural requirement for these types of tasks. It's more about troubleshooting the tool than the task itself!

It was basically telling me that because of the current limitations of LLMs, it was expected or "normal' to have to ask the LLM to process the list several times and check the diferences between the outputs. That sounds like more work than just writing a script to do what I was asking. The only problem would be getting the sentences with translations. It also said:

From an LLM Perspective (Relevant to our previous chat): This is likely where your question is coming from. For an LLM, a list becomes "long" when its size starts to challenge the model's ability to process it accurately and reliably. This is related to:

  • Context Window: LLMs have a limited "context window" – the amount of text they can effectively pay attention to at once. If the list, plus your instructions, exceeds this window, the LLM might simply not "see" parts of the list.
  • Attention and Consistency: Even within the context window, maintaining perfect attention and applying instructions consistently across a very long block of text (like a list) can be difficult for LLMs. This is when they might start missing items, making mistakes, or losing track, as you mentioned.
  • Specific Model: What's "long" for one LLM model (perhaps an older one with a small context window) might be quite "short" for a newer model designed with a much larger context.

...

The difficulty in giving a single, guaranteed number is that it genuinely varies significantly based on:

  1. The specific LLM model you are using: Different models (GPT-4, Gemini, Claude, etc., and even different versions) have different context window sizes and handle long inputs with varying degrees of robustness.
  2. The complexity of the task: Sorting and categorizing is more complex than just listing items. Asking it to perform multiple operations or complex logic on each item decreases the effective reliable length it can handle.
  3. The length/complexity of each item in the list: A list of 300 single words is different from a list of 300 complex sentences or paragraphs. The total token count matters more than just the item count.
  4. The length of the rest of your prompt: Your instructions and any examples also consume part of the LLM's context window.

However, to give you a practical range based on common experience with many current general-purpose LLMs:

  1. Lists with a few dozen items (say, under 50-100 simple items) are generally very reliably processed for tasks like sorting or categorization by most capable models.
  2. When you get into the hundreds of items (like 200 to 500 items), you are absolutely entering a zone where reliability can start to decrease. It's not a guaranteed failure point, but it's where you become significantly more likely to see issues like:
    1. Missing items from the output.
    2. Inconsistent labeling or categorization across the list.
    3. Incomplete processing (stopping before the end).

Addressing your specific question: Is 300 considered long?

Yes, 300 items is often considered "long" enough that you should be cautious when using an LLM for a task like sorting and categorizing. It's not necessarily guaranteed to fail, especially if the items are short and the task is simple, but it is definitely in the range where you might encounter the types of errors you're trying to avoid (missing or mislabeled items).

Practical Recommendation:

To be safe and avoid extensive testing yourself, if your list has more than ~200-300 items, particularly if the items themselves are more than just a word or two, or if the sorting/categorization logic is complex, it is a very good strategy to break the list down into smaller chunks. Processing the list in batches of, say, 100-200 items per prompt is a much more reliable approach for LLMs.

Think of ~200-300 items as a practical threshold where you should start considering breaking down the task for current general-purpose LLMs to ensure higher accuracy and completeness.

In summary, Gemini 2.5 Flash stated that my original prompt was actually complex. In wanting a deeper explaintion, I asked it to breakdown the complexity of the original prompt:

Okay, let's break down that prompt from the perspective of an LLM like GPT-4, Gemini, or Claude. Based on our previous discussion about list length, task complexity, and LLM limitations, this prompt is actually moderately to highly complex, especially as the list of words grows.

Here's why:

  1. Input Volume (List Length): As we've discussed, the most basic challenge is simply processing the input list. If this list is "long" (hundreds of items or more, especially if the items themselves are long), the LLM's ability to reliably read, retain, and process every single item without missing any or getting mixed up is already strained.
  2. Linguistic Analysis & Categorization (Core Task): The LLM isn't just moving data around; it has to perform sophisticated Korean linguistic analysis on each item.
    • It needs to determine the precise part of speech of each Korean word or phrase. This requires deep knowledge of Korean grammar and morphology.
    • Specifically, it needs to distinguish single-word adverbs from nouns, verbs, adjectives, particles, and multi-word adverbial phrases. This fine-grained distinction can be challenging, as context often plays a role in part-of-speech tagging, and here the words are presented out of sentence context initially.
  3. Conditional Logic: The prompt requires significant "if/else" branching logic applied per item:
    • Condition 1: Is it exactly a single-word Korean adverb? -> Go to Table 1 process.
    • Condition 2: Is it not a single-word adverb, and is it a phrase (2+ words) or another part of speech? -> Go to Table 2 process. The LLM must correctly evaluate this condition for every single item and route it to the right path. Failure here means putting an item in the wrong table.
  4. Differing Output Requirements per Condition: The task isn't uniform.
    • For adverbs (Table 1), it needs to generate a creative output (a sentence) and provide a translation.
    • For non-adverbs/phrases (Table 2), it needs to provide analytical outputs (part of speech, reason) and a translation. Switching between these modes correctly for each item adds complexity compared to a single output format.
  5. The "Hint" Sentence Constraint (Particularly Difficult): This is a major source of complexity and potential errors.
    • LLMs are trained to use words naturally in sentences. Korean grammar involves significant conjugation and changes to word forms based on usage.
    • The instruction "Do not conjugate the word to a different form and use it for the sentence. Use the word exactly as provided" is a negative constraint and goes directly against the LLM's natural tendency. It forces the model to do something unnatural and potentially ungrammatical in a standard sentence, while still needing to make the sentence itself understandable enough to translate. This is a very difficult tightrope for an LLM to walk consistently over many items. Expecting perfect adherence to this constraint is challenging.
  6. Generating Explanations ("Reason"): Providing a clear, accurate grammatical reason why something isn't a single adverb requires the LLM to not just identify the part of speech but also access and articulate the specific rule or observation (e.g., "it is a noun meaning...", "it is a verb stem needing conjugation," "it is a phrase not a single word").
  7. Structured Output (Tables): While LLMs can generate markdown tables, maintaining perfect table structure (correct columns, alignment, no escaped characters causing formatting issues) for potentially hundreds of rows split across two separate tables adds a formatting layer of complexity that can sometimes lead to errors, especially for long outputs.

In Summary:

This prompt requires the LLM to:

  • Process a potentially long list (volume).
  • Perform sophisticated Korean grammatical analysis on each item (linguistic skill).
  • Apply complex conditional logic per item (decision making).
  • Generate different types of output (creative vs. analytical) based on that logic.
  • Adhere to a difficult negative constraint (no conjugation in hint sentence).
  • Generate grammatical explanations.
  • Maintain complex output structure (two tables) over a long generation.

Each of these steps increases the chance of an error occurring for any given item, and these chances accumulate over a long list. It's far more demanding than simply reading a list and generating a new list based on a single, simple transformation (like just capitalizing each word).

Therefore, for lists of the length where reliability becomes a concern (as we discussed, likely in the hundreds of items), this prompt significantly increases the probability of the LLM missing an item, miscategorizing it, failing the "no conjugation" rule in the hint, or messing up the table formatting. It's a robust test of the LLM's capabilities!

When I first read this response, it felt like Gemini was complaining about its job and making it sound waaay harder than it really was. At the time, I kept thinking, "why is a computer complaining about computing?" As I learned more about LLMs, I've come to understand that LLMs are made up of lots of statisical algorithms that are meant to predict the next word. Asking a next-word predictor to be able to translate, process, and sort Korean words is a lot. Especially if the LLM wasn't specifically trained to do such work. General LLM chatbot are trained on massive amounts of data in pre-training. During post training/finetuning, they are taught skills like coding, summerizing, data processing, and may others. It all depends on what the model developers felt were important for the LLM to learn. The LMM may have skills that are not known by the model developers, which is why people are always finding new ways to use LLMs.

What experiment should I do next?

If you want to see the output from round 3, the links are below:

Comments

Popular posts from this blog

Can LLMs Identify Korean Adverbs? (Part 1)

Can LLMs Identify Korean Adverbs? (Part 2)