The article evaluates the performance of GPT4(V)-Turbo in many-shot in-context learning (ICL) across various datasets, finding mixed results. While substantial improvements were observed on datasets like HAM1000 and EuroSAT, there were challenges with timeout errors due to a shorter context window. The impact of prompt selection was also explored, indicating that variations in wording produce minor performance deviations without affecting the overall improvement trend. This analysis sheds light on the strengths and limitations of GPT4(V)-Turbo in comparison to other models like Gemini 1.5 Pro.
GPT4(V)-Turbo shows mixed results for many-shot in-context learning (ICL), improving performance significantly on some datasets while struggling with timeout errors and limited context window.
The sensitivity of performance to prompt selection reveals that, despite variations in prompt wording, there is a consistent log-linear improvement trend across tested datasets.
Collection
[
|
...
]