Improved Exploration through Transfer Learning in Multi-Armed Bandits

DSpace Repositorium (Manakin basiert)


Dateien:

Zitierfähiger Link (URI): http://hdl.handle.net/10900/180746
http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1807468
Dokumentart: Dissertation
Erscheinungsdatum: 2026-06-12
Sprache: Englisch
Fakultät: 7 Mathematisch-Naturwissenschaftliche Fakultät
Fachbereich: Informatik
Gutachter: Maghsudi, Setareh (Prof. Dr.)
Tag der mündl. Prüfung: 2026-03-19
DDC-Klassifikation: 004 - Informatik
Schlagworte: Bestärkendes Lernen <Künstliche Intelligenz> , Transfer Learning , N-armiger Bandit
Lizenz: http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en
Zur Langanzeige

Abstract:

Reinforcement learning is a branch of machine learning that focuses on training an agent to interact with a dynamic environment to maximize its expected cumulative reward. In contrast to offline learning problems, where the training data is immediately available, the agent typically assembles its own data set by interacting with the environment. Since model performances heavily depend on the gathered training data, a key challenge is determining an exploration strategy at the potential cost of low immediate rewards to find a policy. This is known as the exploration-exploitation trade-off. This dissertation examines how to effectively re-balance the exploration-exploitation trade-off by transferring knowledge between tasks with a shared structure to enhance the agent's overall performance. The primary focus is on developing algorithms that reduce uncertainties in the model estimations by transferring information given various assumptions while providing theoretical bounds on the regret. The contributions span single-task transfer, meta-learning, multi-task learning and non-stationary environments in the multi-armed bandit setting. In a straightforward task-to-task transfer approach, where an expert is assumed to be available to the learner, we propose a dynamic convex combination of the expert and target model. We prove that when the expert's parameter vector is close to the true task related feature vector, the learner can exploit the expert's knowledge in the early steps of the algorithm and reduce the regret with high probability. A generalization would be to investigate feature vectors that are close in a subspace. We address this idea within the concept of meta learning, where an agent sequentially interacts with multiple tasks sampled from a common meta distribution. Under the assumption of a low-dimensional subspace structure in the meta distribution, we propose a framework to estimate the subspace with projection matrices and exploit it as prior information within an OFUL and Thompson sampling based algorithm. With each task the agent interacts with, it improves its estimation of the projections for exploitation in future tasks. Theoretical guarantees are provided with an emphasis on an improvement on the regret bound with respect to the dimensionality. In a clustered setting, we assume that tasks are grouped in clusters such that only tasks of the same cluster share the same feature vector. When the number of clusters is lower than the number of dimensions, it can be interpreted as a special case of the low-dimensional subspace setting. We explore the general clustered setting in a multi-task framework, where an agent interacts with a fixed number of tasks in parallel. The agent has access to a graph, where each node is associated with a different task. We introduce a network lasso based bandit algorithm that exploits the given graph such that it implicitly learns the cluster structure. Theoretical bounds show that, with a well suited graph, this approach offers significant improvements over other baselines. Finally, we address dynamic environments or piecewise-stationary settings, where the agent typically discards all collected data points upon detecting changes in the environment and retrains its model from scratch. Instead, we propose an algorithm that only discards data points directly associated with the environmental change and retains the rest. We show that intelligent transfer of data from previous segments can reduce exploration after each change and increase overall reward. This dissertation thus proposes multiple algorithms for transferring information in several multi-armed bandit settings with the purpose of optimizing exploration. We provide both theoretical guarantees and empirical evaluations showcasing significant improvements over existing methods.

Das Dokument erscheint in: