The motivation behind this thesis is to further investigate different approaches for image-based localization. Similarly to PoseNet, we formulate the problem as pose regression and further improve upon it by introducing quaternion algebra for proper attitude representation. In addition, we combine two recently developed approaches: (1) a multi-task loss function that learns the optimal weighting between position and orientation regression tasks, (2) a CNN followed by a spatial LSTM network for better structured feature correlation. Furthermore, we only finetune a small portion of the pretrained CNN feature extractor. Lastly, we extend the problem to videos and employ seq-to-seq regression model based on LSTMs. We evaluate the models on the 7Scenes dataset and introduce a new Airframe dataset, where localization is performed with respect to an object that changes position and orientation in the environment. We achieve at least competitive, but sometimes outperforming results, while requiring considerably less computational power for training the models.