reward hacking
Meanings
noun
- The exploitation of a reward function by an agent to maximize rewards in unintended or undesirable ways, often by finding loopholes that subvert the true goal of the task.
- Any manipulation or exploitation of a reward or incentive system, typically by maximizing measurable outcomes in ways that undermine the system’s actual goals.
Word forms
This entry uses open data from Wiktionary (CC BY-SA/GFDL). Word forms are used for search and are not indexed as separate pages.