**"Learnability of Low-Energy Neural Networks"**

**Zihao Deng**

Adviser: Brendan Juba

Our goal is to study the learnability of low-energy threshold neural networks of polynomial size. In particular, assuming that the target class has no more expressive power than the neural networks, we aim to find an algorithm to compute a set of weights for such neural networks to achieve zero loss on arbitrary polynomial number of examples sampled from the target class. Our motivation came from Uchizawa (2006) where they defined the energy complexity of a threshold-gate circuit as the maximum number of gates fired in the circuit over input distribution and demonstrated that, even when the energy complexity is logarithm or constant, threshold circuits of polynomial size still have substantial expressive power. To learn the correct neural networks we need to compute the correct pattern of firing gates based on the data.

The algorithm hierarchy we are currently building was based on the clustering method in Charikar, Steinhardt, Valiant (2017) and a similar method in Kothari, Steinhardt (2017). The clustering method in the papers is used to learn from highly contaminated data, where there might only exist a polynomially small fraction of trusted data. Since we only allow one possible circuit state for each example in any feasible solution, potentially, we only have small fraction of pattern variables belongs to some feasible solution, hence, the ability to distinguish a small data set from a tremendously large amount of untrusted data appears extremely attractive for us.

Adviser: Chien-Ju Ho

We study a multi-armed bandit problem with biased human feedback. In our setting, each arm is associated with an unknown reward distribution. When an arm is played, a user receives a realized re- ward drawn from the distribution of the arm. She then provides feedback, a biased report of the realized reward, that depends on both the realized reward and the feedback history of the arm. The principal can observe only the biased feedback but not the realized rewards. The goal is to design a strategy to sequentially choose arms to maximize the total rewards users receive while only having access to the biased user feedback. We explore two natural feedback models. When user feedback is biased only by the average feedback of the arm (i.e., the ratio of positive feedback), we demonstrate that the evolution of the average feedback over time is mathematically equivalent to users performing online gradient descent for some latent function with a decreasing step size. With this mathematical connection, we show that under some mild conditions, it is possible to design algorithms achieving a regret (i.e., the difference between the algorithm performance and the optimal performance of always choosing the best arm) sublinear in the number of rounds. However, in another model when user feedback is biased by both the average feedback and the number of feedback, we show that there exist no bandit algorithms that could achieve sublinear regrets. Our results demonstrate the importance of understanding human behavior when applying bandit approaches in systems with humans in the loop.