Evaluate and compare LLM results through JSON submissions
Evaluate LLMs on constraint satisfaction problems