VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking (2024)

\useunder

\ul

Xuefeng Jiang1,2,∗, Fangyuan Wang1,∗, Rongzhang Zheng1,∗, Han Liu1,
Yixiong Huo1, Jinzhang Peng1, Lu Tian1, Emad Barsoum1
*Xuefeng Jiang, Fangyuan Wang and Rongzhang Zheng share equal contributions.1Advanced Micro Devices (AMD), Inc.{xuefeng.jiang, fangyuan.wang, rongzhang.zheng, han.liu, yixiong.huo, jinz.peng, lu.tian, emad.barsoum}@amd.com2Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences.{jiangxuefeng21b}@ict.ac.cn

Abstract

Precise localization is of great importance for autonomous parking task since it provides service for the downstream planning and control modules, which significantly affects the system performance. For parking scenarios, dynamic lighting, sparse textures, and the instability of global positioning system (GPS) signals pose challenges for most traditional localization methods. To address these difficulties, we propose VIPS-Odom, a novel semantic visual-inertial odometry framework for underground autonomous parking, which adopts tightly-coupled optimization to fuse measurements from multi-modal sensors and solves odometry. Our VIPS-Odom integrates parking slots detected from the synthesized bird-eye-view (BEV) image with traditional feature points in the frontend, and conducts tightly-coupled optimization with joint constraints introduced by measurements from the inertial measurement unit, wheel speed sensor and parking slots in the backend. We develop a multi-object tracking framework to robustly track parking slots’ states. To prove the superiority of our method, we equip an electronic vehicle with related sensors and build an experimental platform based on ROS2 system. Extensive experiments demonstrate the efficacy and advantages of our method compared with other baselines for parking scenarios.

I Introduction

Simultaneous localization and mapping (SLAM) is a significant task in many fields including virtual reality and autonomous driving. In the field of automotive, SLAM modules aim to achieve precise and real-time localization so that other modules like planning and control can respond on time. There are many popular visual-inertial localization estimation methods [1] that obtain feature points from camera images, incorporate other sensors like the inertial measurement unit (IMU) and wheel speed sensor (WSS) to build SLAM modules, and achieve the state-of-the-art performance.

VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking (1)

However, as shown in Fig 1, traditional visual methods suffer from unstable feature tracking which is often caused by light reflection and frequently moving objects. These methods usually cannot extract enough effective feature points in texture-less and dynamic lighting parking environments, which bring challenges to autonomous parking task, including automated parking assistance (APA) and automated valet parking (AVP). APA aims to assist parking the vehicle into a nearby vacant parking slot whileAVP focuses on long-distance navigation around the parking lot and finding a vacant parking slot.Both tasks have high requirements on localization precision.In practice, insufficient and unstable feature points in complicated parking environments cause accumulated drifts, and even failures in extreme cases [10].Though there are some methods to mitigate accumulated drifts like loop closure detection [5], it still remains a problem to design localization methods with high precision and stability in such complicated scenarios [10].

To improve localization performance for parking scenarios, a feasible strategy is to enable and fuse more sensors’ measurements to accomplish more accurate and robust localization. In recent years, it has become a trend to exploit semantic elements from the scene to optimize localization estimation of the vehicle [27, 26, 28]. The primary advantage of these methods is compensating for the weakness of traditional feature point detectors, and providing more constraints for the localization estimation.

In this work, we propose VIPS-Odom, a semantic visual-inertial optimization-based odometry system to reduce accumulated localization errors by dynamically incorporating detected parking slot corner points as extra robust feature points into the SLAM frontend, and simultaneously exploiting parking slot observations as semantic constraints of optimization in the SLAM backend. Meanwhile, different from previous works, we robustly track and maintain all parking slots’ states under a multi-object tracking framework.Our method achieves higher localization precision against other baseline methods for autonomous parking. Our main technical contributions can be summarized as follows:

  1. 1.

    To tackle difficulties in the underground environment, we propose to exploit parking slots as extra high-level semantic observation to assist SLAM. To the best of our knowledge, VIPS-Odom is the first to provide a holistic solution of fusing the parking slot observation into both the SLAM frontend and backend via an explicit way.

  2. 2.

    We equip an electronic vehicle with related sensors as our experimental platform and conduct extensive real-world experiments over both long-distance and short-distance trajectories in different underground parking lots, which is more enriched than previous related works for autonomous parking scenarios.

  3. 3.

    Experimental results reflect the superiority and stability of VIPS-Odom against other baselines, including inertial-based EKF, traditional visual-inertial optimization-based VINS, semantic visual-inertial optimization-based VISSLAM and LiDAR-based A-LOAM.

II Related Work

II-A Multi-sensor SLAM

There are some well-noted localization methods including Extended Kalman filter (EKF) [16], MSCKF [31], VINS [1], ORB-SLAM [29] and LOAM [25, 23].EKF [16] is a classical loosely-coupled framework to fuse measurements from various sensors. A-LOAM [23] calculates the odometry via LiDAR point cloud registration [25] which requires the expensive LiDAR sensor.The high cost of LiDAR sensors makes it an impractical choice for standard L2 parking applications.For commercial-grade autonomous parking systems, it is preferred to use the camera, inertial measurement unit (IMU) and wheel speed sensor (WSS) instead of the expensive LiDAR. Thus, visual-inertial SLAM (VI-SLAM) is considered to be more promising for commercial use.According to different approaches to utilizing multi-sensor measurements, VI-SLAM systems can be categorized as loosely-coupled and tightly-coupled methods according to their ways to fuse measurements from different sensors.The former updates the pose estimation with measurements from various sensors independently, while the latter tightly associates those measurements to jointly optimize the odometry, which often leads to more precise localization [26].VINS [1] is a representative state-of-the-art and open-source method, which utilizes the front-view camera and IMU to jointly optimize odometry in a sliding window composed of several keyframes.VI-SLAM system depends on low-level visual features points or directly utilize the intensity of pixels. However, both low-level visual features and image pixels suffer from instability and inconsistency during navigating in parking environments [28, 26].

II-B Semantic SLAM for Autonomous Parking

There are some attempts to incorporate semantic features into VI-SLAM systems. SLAM++ [18] is an early attempt to build a real-time object-oriented SLAM system which constructs an explicit graph of objects. [17] proposes a SLAM system where semantic objects in the environment are incorporated into a bundle-adjustment framework. Mask-SLAM [2] improves feature point selection with the assistance of semantic segmentation to exclude unstable objects. These methods usually consider semantic features from traditional objects to assist the SLAM task. However, they may be not suitable in the complicated parking environments since parking lots are decorated with scene-specific unconventional semantic markings (like parking slot lines [26], parking slot numbers [27] and bumper lines [28]).

For autonomous parking scenarios, AVP-SLAM [28] is a representative semantic SLAM system, which includes the mapping procedure and the localization procedure. In the mapping process, it constructs a semantic map including parking lines and other semantic elements segmented by a modified U-Net [3]. Based on this constructed map and inertial sensors, poses of the vehicle can be estimated in the localization procedure via a loosely-coupled way. However, the localization precision would drop if the constructed map is not frequently updated since the appearance of the parking lot will change as time goes by.

Different from AVP-SLAM, there are some attempts to utilize tightly-coupled approaches by exploiting semantic objects.VISSLAM [26] exploits parking slots as semantic features in the SLAM backend optimization for the first time. However, its system utilizes DeepPS from Matlab to detect parking slots (PSs) and resorts to a hard-match based strategy to match the previously observed PSs, which makes it prone to mismatching . To build a more robust SLAM system for parking and overcome the limitations of VISSLAM, we build our VIPS-Odom system and utilize the ubiquitous YOLOv5 [6] to perform real-time PS detection, flexibly and robustly track and maintain all PS states via a multi-object tracking framework instead of the hard-match strategy. To overcome difficulties of parking environments, we further adopt the observed PSs as robust semantic constraints in the SLAM backend and add those robust feature points to the frontend to achieve more stable and robust localization. We additionally add wheel encoder measurements into backend optimization to further improve the localization accuracy. We conduct more diverse comparison than two most relevant works AVP-SLAM and VISSLAM as shown in Tab.I.

MethodsCompared Baseline Methods for Localization EvaluationAVP-SLAM [28]ORB-SLAM2 (Camera)VIS-SLAM [26]-OursEKF (IMU+WSS) , A-LOAM (LiDAR) , VINS (IMU+Camera+WSS), VIS-SLAM (IMU+Cameras)

III Methodology

III-A Sensor Equipment & Calibration

Multi-modal sensor data of VIPS-Odom include visual feature points derived from corner detection, IMU and WSS measurements between two consecutive keyframes [1], and parking slot (PS) observations from BEV images.

Sensor Configuration Our system (in Fig.2) includes an inertial measurement unit (IMU, mounted in the center of this vehicle), a wheel speed sensor (WSS, mounted near this vehicle’s right back wheel), and four surround-view fisheye cameras mounted at the front, rear, left, and right side. Images are recorded with a resolution of 1920×1080192010801920\times 10801920 × 1080 pixels. The BEV images with resolution of 576×576576576576\times 576576 × 576 pixels cover an area of 11.32m×11.32m11.32𝑚11.32𝑚11.32m\times 11.32m11.32 italic_m × 11.32 italic_m.A 360superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT LiDAR is also mounted on the vehicle roof which produces point cloud data to support LiDAR-SLAM evaluation for A-LOAM [23].

Sensor CalibrationSince various sensor measurements are tightly-coupled in VIPS-Odom, we carefully calibrate related parameters.We utilize an open-source toolbox [21] to optimize intrinsic of four fisheye cameras. For extrinsic calibration, we follow [20] to refine the camera-IMU and IMU-WSS transformation with coarse initialization.

VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking (2)

III-B BEV Image Generation

We adopt inverse perspective mapping (IPM) [28] to construct the bird’s eye view (BEV) image with four surround-view fisheye cameras. Relationships of the body coordinate, image coordinate and BEV coordinate can be formulated as:

[xvyv1]=Tcvπ1([ufvf1])=Hbv[ubvb1]delimited-[]subscript𝑥𝑣subscript𝑦𝑣1superscriptsubscript𝑇𝑐𝑣superscript𝜋1delimited-[]subscript𝑢𝑓subscript𝑣𝑓1superscriptsubscript𝐻𝑏𝑣delimited-[]subscript𝑢𝑏subscript𝑣𝑏1\left[\begin{array}[]{c}x_{v}\\y_{v}\\1\end{array}\right]=T_{c}^{v}\pi^{-1}\left(\left[\begin{array}[]{c}u_{f}\\v_{f}\\1\end{array}\right]\right)=H_{b}^{v}\left[\begin{array}[]{c}u_{b}\\v_{b}\\1\end{array}\right][ start_ARRAY start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARRAY ] = italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( [ start_ARRAY start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARRAY ] ) = italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARRAY ](1)

where π1()superscript𝜋1\pi^{-1}(\cdot)italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⋅ ) is the inverse of fisheye projection model. Tcvsuperscriptsubscript𝑇𝑐𝑣T_{c}^{v}italic_T start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT is the extrinsic matrix of camera with respect to the body coordinate, and Hbvsuperscriptsubscript𝐻𝑏𝑣H_{b}^{v}italic_H start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT denote the transformation from the BEV coordinate to body coordinate. [xv,yv]superscriptsubscript𝑥𝑣subscript𝑦𝑣top[x_{v},y_{v}]^{\top}[ italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, [xf,yf]superscriptsubscript𝑥𝑓subscript𝑦𝑓top[x_{f},y_{f}]^{\top}[ italic_x start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and [xb,yb]superscriptsubscript𝑥𝑏subscript𝑦𝑏top[x_{b},y_{b}]^{\top}[ italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT stand for locations at the body, fisheye camera and BEV coordinate respectively.

III-C Optimization Formulation

VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking (3)

Generally, VIPS-Odom is a tightly-coupled optimization-based SLAM system which integrates measurements from multi-modal sensors. The overall framework of VIPS-Odom is shown in Fig.3.Herein we illustrate the optimization framework of VIPS-Odom. 𝒵𝒵\mathcal{Z}caligraphic_Z denotes the visual feature points captured in front-view fisheye camera, 𝒪𝒪\mathcal{O}caligraphic_O denotes the observation of parking slot center in the world frame, \mathcal{M}caligraphic_M and 𝒲𝒲\mathcal{W}caligraphic_W denote the IMU measurements and WSS measurements. We aim to solve the poses 𝒯𝒯\mathcal{T}caligraphic_T, map keypoints 𝒫𝒫\mathcal{P}caligraphic_P matched with 𝒵𝒵\mathcal{Z}caligraphic_Z and the locations of PS center \mathcal{L}caligraphic_L. The optimization objective can be defined as follows:

{,𝒯,𝒫}=argmax,𝒯,𝒫p(,𝒯,𝒫|𝒪,𝒵,,𝒲).superscript𝒯𝒫subscript𝒯𝒫𝑝𝒯conditional𝒫𝒪𝒵𝒲\small\{\mathcal{L},\mathcal{T},\mathcal{P}\}^{*}=\arg\max_{\mathcal{L},%\mathcal{T},\mathcal{P}}p(\mathcal{L},\mathcal{T},\mathcal{P}|\mathcal{O},%\mathcal{Z},\mathcal{M},\mathcal{W}).{ caligraphic_L , caligraphic_T , caligraphic_P } start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT caligraphic_L , caligraphic_T , caligraphic_P end_POSTSUBSCRIPT italic_p ( caligraphic_L , caligraphic_T , caligraphic_P | caligraphic_O , caligraphic_Z , caligraphic_M , caligraphic_W ) .(2)

Our main target is to get more precise estimation of 𝒯𝒯\mathcal{T}caligraphic_T via multi-sensor observation. Following [26], we reformulate p𝑝pitalic_p with Bayes’ theorem [9] as follows:

p(,𝒯,𝒫|O,𝒵,,𝒲)p(,𝒯,𝒫)p(O,𝒵,,𝒲|,𝒯,𝒫).proportional-to𝑝𝒯conditional𝒫𝑂𝒵𝒲𝑝𝒯𝒫𝑝𝑂𝒵conditional𝒲𝒯𝒫\small p(\mathcal{L},\mathcal{T},\mathcal{P}|O,\mathcal{Z},\mathcal{M},%\mathcal{W})\propto p(\mathcal{L},\mathcal{T},\mathcal{P})p(O,\mathcal{Z},%\mathcal{M},\mathcal{W}|\mathcal{L},\mathcal{T},\mathcal{P}).italic_p ( caligraphic_L , caligraphic_T , caligraphic_P | italic_O , caligraphic_Z , caligraphic_M , caligraphic_W ) ∝ italic_p ( caligraphic_L , caligraphic_T , caligraphic_P ) italic_p ( italic_O , caligraphic_Z , caligraphic_M , caligraphic_W | caligraphic_L , caligraphic_T , caligraphic_P ) .(3)

We further factorize p𝑝pitalic_p by separating 𝒪𝒪\mathcal{O}caligraphic_O from other modalities since visual keypoints 𝒵𝒵\mathcal{Z}caligraphic_Z and PS observations 𝒪𝒪\mathcal{O}caligraphic_O are obtained from two independent sensor measurements:

p(,𝒯,𝒫|O,𝒵,,𝒲)𝑝𝒯conditional𝒫𝑂𝒵𝒲\displaystyle p(\mathcal{L},\mathcal{T},\mathcal{P}|O,\mathcal{Z},\mathcal{M},%\mathcal{W})italic_p ( caligraphic_L , caligraphic_T , caligraphic_P | italic_O , caligraphic_Z , caligraphic_M , caligraphic_W )
p()p(𝒯,𝒫)p(O|,𝒯,𝒫)p(𝒵,,𝒲|,𝒯,𝒫)proportional-toabsent𝑝𝑝𝒯𝒫𝑝conditional𝑂𝒯𝒫𝑝𝒵conditional𝒲𝒯𝒫\displaystyle\propto p(\mathcal{L})p(\mathcal{T},\mathcal{P})p(O|\mathcal{L},%\mathcal{T},\mathcal{P})p(\mathcal{Z},\mathcal{M},\mathcal{W}|\mathcal{L},%\mathcal{T},\mathcal{P})∝ italic_p ( caligraphic_L ) italic_p ( caligraphic_T , caligraphic_P ) italic_p ( italic_O | caligraphic_L , caligraphic_T , caligraphic_P ) italic_p ( caligraphic_Z , caligraphic_M , caligraphic_W | caligraphic_L , caligraphic_T , caligraphic_P )
=p()p(𝒯,𝒫)p(O|,𝒯)p(𝒵,,𝒲|𝒯,𝒫)absent𝑝𝑝𝒯𝒫𝑝conditional𝑂𝒯𝑝𝒵conditional𝒲𝒯𝒫\displaystyle=p(\mathcal{L})p(\mathcal{T},\mathcal{P})p(O|\mathcal{L},\mathcal%{T})p(\mathcal{Z},\mathcal{M},\mathcal{W}|\mathcal{T},\mathcal{P})= italic_p ( caligraphic_L ) italic_p ( caligraphic_T , caligraphic_P ) italic_p ( italic_O | caligraphic_L , caligraphic_T ) italic_p ( caligraphic_Z , caligraphic_M , caligraphic_W | caligraphic_T , caligraphic_P )(4)
=p(𝒯)p(𝒫)p(𝒵|𝒯,𝒫)p(,𝒲|𝒯)visual-inertial termp()p(O|,𝒯)PS term,absentsubscript𝑝𝒯𝑝𝒫𝑝conditional𝒵𝒯𝒫𝑝conditional𝒲𝒯visual-inertial termsubscript𝑝𝑝conditional𝑂𝒯PS term\displaystyle=\underbrace{p(\mathcal{T})p(\mathcal{P})p(\mathcal{Z}|\mathcal{T%},\mathcal{P})p(\mathcal{M},\mathcal{W}|\mathcal{T})}_{\text{visual-inertial %term}}\underbrace{p(\mathcal{L})p(O|\mathcal{L},\mathcal{T})}_{\text{PS term}},= under⏟ start_ARG italic_p ( caligraphic_T ) italic_p ( caligraphic_P ) italic_p ( caligraphic_Z | caligraphic_T , caligraphic_P ) italic_p ( caligraphic_M , caligraphic_W | caligraphic_T ) end_ARG start_POSTSUBSCRIPT visual-inertial term end_POSTSUBSCRIPT under⏟ start_ARG italic_p ( caligraphic_L ) italic_p ( italic_O | caligraphic_L , caligraphic_T ) end_ARG start_POSTSUBSCRIPT PS term end_POSTSUBSCRIPT ,

where the first bracket is related to visual features, IMU/WSS motion data, and the remainder denotes the PS related term.Given above analysis, we can perform optimization by jointly minimizing visual reprojection error, inertial motion error, and PS error. VIPS-Odom deals with both low-level geometric/motion data as well as higher-level semantic information simultaneously, which facilitates more stable and accurate localization for parking scenarios. The backend optimization is discussed in Sec.III-F.

III-D Parking Slot (PS) Detection & Management

PS Detection Different from feature line detection in AVP-SLAM [28] and matlab-based model in VISSLAM [26], we exploit CNN for PS detection since there are previous successful attempts of using CNN to detect PS [8, 11]. Following [11], we develop our light-weight PS detector based on YOLOv5 [6] to detect PS in BEV images and predict the state of PS (i.e., occupied or vacant, see Fig.3). Detected PS states later assist the autonomous parking planning to choose a vacant PS. Training details are introduced in Sec.IV-A.

PS ManagementWe utilize a classic multi-object tracking framework SORT [15] to manage PS states. Originally, parking slots are detected in the BEV image frame. Then detected PS corner points are converted to the body frame of ego vehicle (see Eq.1).According to the timestamp when the PS is detected in the BEV images, we can fetch back the nearest pose of this vehicle among the keyframes in the sliding window (see Sec.III-F). Then the pose of PS could be converted from the body frame to the world frame.The world frame is initialized at the beginning of the trajectory (see Sec.III-A).Each parking slot can be viewed as different rectangles located in different positions in the world frame.They are also plotted on the constructed map and BEV images (see Fig.7), which assist downstream tasks in autonomous parking.To assign detections to existing parking slots, we need to update the tracker status timely.Each target’s bounding box geometry is estimated by predicting its new location in the world frame based on Kalman Filter [14].The assignment cost matrix is then computed as the intersection-over-union (IoU) distance between each detection and all predicted bounding boxes from existing targets. The assignment is solved with the Hungarian assignment algorithm [4, 15]. Additionally, a minimum IoU is imposed to reject assignments where the detection to target overlap is less than 0.30.30.30.3. Matched PS detection to existing PS will share the same id. We also implement and evaluate hard match association method used in VIS-SLAM [26], which pre-defines thresholds for previous PS association and new PS creation, and related analysis is discussed in Sec.IV-F.

III-E Frontend Integration with PS

For frontend visual odometry, visual features are detected via Shi-Tomasi algorithm [7] and we mildly maintain about one hundred feature points for each new image. Visual features are tracked by the KLT sparse optical flow [12].Note that feature point number is much smaller than other optimization-based SLAM methods because parking lots are often texture-less and dynamic lighting [28] which implies fewer feature points that can be detected.Correspondingly, traditional SLAM methods like ORB-SLAM [29] usually show unsatisfying performance in parking scenarios [13].

VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking (4)

As shown in Fig.4, we introduce detected parking slot (PS) corner points as extra robust feature points to mitigate the aforementioned issue.We integrate fisheye camera images from multiple perspectives into one BEV image through IPM (Sec. III-B). Here, the points on each BEV image correspond one-to-one to points on the original fisheye camera, therefore, each PS feature points in BEV space is paired with a pixel in front-view fish-eye camera through transformation matrix.After predicted by the PS detector (see Sec.III-D), these PS feature points in the BEV image frame can be inversely projected into the camera frame as extra feature points according to the transformation matrix between the BEV coordinate and the fisheye camera coordinate(see Sec.III-B).Aiming to maintain higher-quality and more robust feature points, we merge these extra PS feature points to the original detected Shi-tomasi visual feature point sequence dynamically.These extra feature points benefit localization under the challenging texture-less environment of parking scene.2D features are firstly undistorted and then projected to a unit sphere after passing outlier rejection following VINS [1].Keyframe selection is conducted during the SLAM frontend and we follow two commonly-used criteria in VINS [1], mainly according to the parallax between current frame and previous frames. We also conduct related experiments to examinehow the number of feature point affects our system’s precision in Sec.IV-D.

III-F Backend Optimization with PS

We elaborate on the backend non-linear optimization to exploit multi-modal sensors’ measurements. Based on VINS [1], VIPS-Odom performs optimization within a sliding window. Overall optimization objective can be roughly divided into visual-inertial term and PS term (see Sec.III-C).

Visual-inertial TermVisual-inertial term can be constructed via the formulation of VINS, and we further exploit the wheel speed sensor (WSS) to enrich inertial measurements for complicated environments. Specifically, in each sliding window, VIPS-Odom maintains a set of K𝐾Kitalic_K keyframes f0,,fK1subscript𝑓0subscript𝑓𝐾1f_{0},\ldots,f_{K-1}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT to perform backend optimization. Each keyframe fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is associated with related timestamp tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of frame k𝑘kitalic_k and camera pose 𝒯k=[Rk,tk]𝕊𝔼(3)superscript𝒯𝑘subscript𝑅𝑘subscript𝑡𝑘𝕊𝔼3\mathcal{T}^{k}=[R_{k},t_{k}]\in\mathbb{SE}(3)caligraphic_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = [ italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∈ blackboard_S blackboard_E ( 3 ).The motion (related to change of orientation, position and velocity) between two consecutive keyframes can be determined by either pre-integrated IMU/WSS data or visual odometry [1]. Referring to [24], we synthesize measurements of IMU and WSS by pre-integrating. The overall visual-intertial optimization function is formulated as follows:

(𝐱)=pj𝐞rp,jT𝐖r𝐞rp,j+k=0K2𝐞skT𝚺k,k+11𝐞sk+𝐞mT𝐞m,𝐱subscript𝑝subscript𝑗superscriptsubscript𝐞𝑟𝑝superscript𝑗𝑇subscript𝐖𝑟superscriptsubscript𝐞𝑟𝑝𝑗superscriptsubscript𝑘0𝐾2superscriptsubscript𝐞𝑠superscript𝑘𝑇superscriptsubscript𝚺𝑘𝑘11superscriptsubscript𝐞𝑠𝑘superscriptsubscript𝐞𝑚𝑇subscript𝐞𝑚\small\mathcal{F}(\mathbf{x})=\sum_{p}\sum_{j}\mathbf{e}_{r}^{p,j^{T}}\mathbf{%W}_{r}\mathbf{e}_{r}^{p,j}+\sum_{k=0}^{K-2}\mathbf{e}_{s}^{k^{T}}\mathbf{%\Sigma}_{k,k+1}^{-1}\mathbf{e}_{s}^{k}+\mathbf{e}_{m}^{T}\mathbf{e}_{m},caligraphic_F ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p , italic_j start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p , italic_j end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 2 end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT italic_k , italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + bold_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ,(5)

where (𝐱)𝐱\mathcal{F}(\mathbf{x})caligraphic_F ( bold_x ) denotes the visual-inertial backend loss function, p𝑝pitalic_p represents the index of observed feature points (or landmarks), andj𝑗jitalic_j is the index of the imageon which the landmark p𝑝pitalic_p appears. K𝐾Kitalic_K is the number of images in the sliding window. erp,jsuperscriptsubscript𝑒𝑟𝑝𝑗e_{r}^{p,j}italic_e start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p , italic_j end_POSTSUPERSCRIPT means the bundle adjustment [22] reprojection residual, esksuperscriptsubscript𝑒𝑠𝑘e_{s}^{k}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represents the inertial residual, and emsubscript𝑒𝑚e_{m}italic_e start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the marginalization residual. 𝐖rsubscript𝐖𝑟\mathbf{W}_{r}bold_W start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the uniform information matrix for all reprojection terms and 𝚺k,k+1subscript𝚺𝑘𝑘1\boldsymbol{\Sigma}_{k,k+1}bold_Σ start_POSTSUBSCRIPT italic_k , italic_k + 1 end_POSTSUBSCRIPT denotes the forward propagation of covariance [1, 24]. The vector 𝐱𝐱\mathbf{x}bold_x incorporates the motion states of keyframes, each landmark’s inverse depths [1]which can be formulated as:

𝐱=[𝐱0,𝐱1,𝐱K1,λ0,λ1,λm1],𝐱matrixsubscript𝐱0subscript𝐱1subscript𝐱𝐾1subscript𝜆0subscript𝜆1subscript𝜆𝑚1\mathbf{x}=\begin{bmatrix}\mathbf{x}_{0},\mathbf{x}_{1},\ldots\mathbf{x}_{K-1}%,\lambda_{0},\lambda_{1},\ldots\lambda_{m-1}\\\end{bmatrix},bold_x = [ start_ARG start_ROW start_CELL bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … bold_x start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_λ start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,(6)
𝐱k=[𝐩kw,𝐯kw,𝐪kw,𝐛ak,𝐛𝝎k],subscript𝐱𝑘matrixsuperscriptsubscript𝐩𝑘𝑤superscriptsubscript𝐯𝑘𝑤superscriptsubscript𝐪𝑘𝑤subscript𝐛subscript𝑎𝑘subscript𝐛subscript𝝎𝑘\mathbf{x}_{k}=\begin{bmatrix}\mathbf{p}_{k}^{w},\mathbf{v}_{k}^{w},\mathbf{q}%_{k}^{w},\mathbf{b}_{a_{k}},\mathbf{b}_{\boldsymbol{\omega}_{k}}\end{bmatrix},bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , bold_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ,(7)

where the state of k𝑘kitalic_k-th keyframe is 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, λ0,,λm1subscript𝜆0subscript𝜆𝑚1\lambda_{0},\ldots,\lambda_{m-1}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT denote the inverse depth of each landmark in camera frame,𝐩kw,𝐯kw,𝐪kwsuperscriptsubscript𝐩𝑘𝑤superscriptsubscript𝐯𝑘𝑤superscriptsubscript𝐪𝑘𝑤\mathbf{p}_{k}^{w},\mathbf{v}_{k}^{w},\mathbf{q}_{k}^{w}bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , bold_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT denote the pose, speed and orientation of the k𝑘kitalic_k-th frame which correlates to 𝒯ksuperscript𝒯𝑘\mathcal{T}^{k}caligraphic_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, 𝐛ak,𝐛𝝎ksubscript𝐛subscript𝑎𝑘subscript𝐛subscript𝝎𝑘\mathbf{b}_{a_{k}},\mathbf{b}_{\boldsymbol{\omega}_{k}}bold_b start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT bold_italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the IMU’s accelerometer and gyroscope bias corresponding to the k𝑘kitalic_k-th image.Other details of our optimization procedure are basically in accordance with VINS [1].

VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking (5)

PS TermWe incorporate PS into backend explicitly. In each sliding window, since each PS with the same id is associated with multiple observations, it naturally forms a registration constraint between each observed PS and its maintained state in the world frame. Each maintained state and PS observations are associated via PS id provided by our PS management module (discussed in Sec.III-D). Our PS term is illustrated in Fig.5.Thus, the registration error of the parking slot i𝑖iitalic_i observed at keyframe k𝑘kitalic_k can be defined as:

𝐞𝐫𝐞𝐠k,i=𝒯ki,k𝒪i,superscriptsubscript𝐞𝐫𝐞𝐠𝑘𝑖superscript𝒯𝑘superscript𝑖𝑘subscript𝒪𝑖\mathbf{e_{reg}}^{k,i}=\mathcal{T}^{k}\mathcal{L}^{i,k}-\mathcal{O}_{i},bold_e start_POSTSUBSCRIPT bold_reg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_i , italic_k end_POSTSUPERSCRIPT - caligraphic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(8)

where 𝒪isubscript𝒪𝑖\mathcal{O}_{i}caligraphic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the current maintained PS (id=i𝑖iitalic_i) location state in the world frame, 𝒯ksuperscript𝒯𝑘\mathcal{T}^{k}caligraphic_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT denotes the k𝑘kitalic_k-th keyframe’s vehicle’s pose (calculated by position and orientation in Eq.7) in the world frame, i,ksuperscript𝑖𝑘\mathcal{L}^{i,k}caligraphic_L start_POSTSUPERSCRIPT italic_i , italic_k end_POSTSUPERSCRIPT denotes the PS (id=i𝑖iitalic_i) observation in the body frame. PS error term is derived as:

𝒢(𝐱)=k=1Ki𝒮|𝒮|ρ((𝐞𝐫𝐞𝐠k,i)1αki𝐞𝐫𝐞𝐠k,i),𝒢𝐱superscriptsubscript𝑘1𝐾superscriptsubscript𝑖𝒮𝒮𝜌superscriptsuperscriptsubscript𝐞𝐫𝐞𝐠𝑘𝑖1superscriptsubscript𝛼𝑘𝑖superscriptsubscript𝐞𝐫𝐞𝐠𝑘𝑖\mathcal{G}(\mathbf{x})=\sum\limits_{k=1}^{K}\sum\limits_{i\in\mathcal{S}}^{|%\mathcal{S}|}\rho((\mathbf{e_{reg}}^{k,i})^{-1}\alpha_{k}^{i}\mathbf{e_{reg}}^%{k,i}),caligraphic_G ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT italic_ρ ( ( bold_e start_POSTSUBSCRIPT bold_reg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT bold_e start_POSTSUBSCRIPT bold_reg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , italic_i end_POSTSUPERSCRIPT ) ,(9)

where S𝑆Sitalic_S is the set of seen PSs in this sliding window and ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ) denotes the robust kernel function [32]. αkisuperscriptsubscript𝛼𝑘𝑖\alpha_{k}^{i}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is a reweighting term based on the kth𝑘𝑡k-thitalic_k - italic_t italic_h keyframe’s observation of a certain parking slot (id=i𝑖iitalic_i). If the ith𝑖𝑡i-thitalic_i - italic_t italic_h PS in the kth𝑘𝑡k-thitalic_k - italic_t italic_h keyframe is not observed, αkisuperscriptsubscript𝛼𝑘𝑖\alpha_{k}^{i}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT will be 00. The reason to add this reweighting term is that due to perspective distortion and lens distortions, outer regions of generated BEV images are often stretched or compressed, which introduces more errors to PS detection (see Fig.5). For a certain kth𝑘𝑡k-thitalic_k - italic_t italic_h keyframe, we first compute the pixel distance dkisuperscriptsubscript𝑑𝑘𝑖d_{k}^{i}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (divided by half of BEV image size in Sec.III-A) between the center point of observed ith𝑖𝑡i-thitalic_i - italic_t italic_h PS in this keyframe (supposing kth𝑘𝑡k-thitalic_k - italic_t italic_h keyframe incorporates Nksubscript𝑁𝑘N_{k}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT PSs) and the center point of BEV image in BEV image frame:

αki=Nkediki=1Nkedik.superscriptsubscript𝛼𝑘𝑖subscript𝑁𝑘superscript𝑒superscriptsubscript𝑑𝑖𝑘superscriptsubscript𝑖1subscript𝑁𝑘superscript𝑒superscriptsubscript𝑑𝑖𝑘\alpha_{k}^{i}=\frac{N_{k}e^{-d_{i}^{k}}}{\sum_{i=1}^{N_{k}}e^{-d_{i}^{k}}}.italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = divide start_ARG italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG .(10)

Given above, (𝐱)𝐱\mathcal{F}(\mathbf{x})caligraphic_F ( bold_x ) covers the visual-intertial constraint term while 𝒢(𝐱)𝒢𝐱\mathcal{G}(\mathbf{x})caligraphic_G ( bold_x ) provides a semantic registration constraint term, and their combination follows the formulation in Sec.III-C.

IV Experiments

IV-A Experiment Setup

We deploy our system on an electronic vehicle along with a workstation of Ubuntu operating system, and all modules communicate via ROS2 [30]. The CPU and GPU configurations are i7-8700 CPU@3.20GHz and A5000 respectively. The overall optimization backend is implemented via Ceres Solver [32] and we adopt commonly-used Cauchy function as robust kernel function (see Eq.9). All experiments are repeated for three times and averaged. Our experiments are conducted in two parking lots (see Fig.6).

BaselinesOur baselines include four classical methods which are suitable to deploy on a commercial-grade vehicle: (i) EKF [16]: EKF is a classic loosely-coupled approach to fusing IMU and WSS meaurements. (ii) A-LOAM [23]: A-LOAM is a LiDAR-based method which utilizes ICP [25] with a point-to-edge distance. (iii) VISSLAM [26]: VISSLAM utilizes IMU, feature points and matched parking slots to jointly optimize the odometry. Since its source codes are not released, we refer to VINS [1] to process IMU and camera measurements, and additionally add the PS measurements into the backend optimization. (iv) VINS [1]: VINS provides a tightly-coupled optimization-based approach to fusing IMU measurements and feature points from the front-view fisheye camera. In addition, we refer to [24] to add WSS measurements into VINS to improve the accuracy and stability of localization for parking environments.We compare these methods in Tab.II. Geometric Map is constructed by feature points, Point Cloud Map is constructed by LiDAR point clouds, and Semantic Map is constructed by both feature points and parking slots in our method.

MethodSensorMapPS
EKF [16]IMU + WSS×\times××\times×
A-LOAM [23]LiDARPoint Cloud×\times×
VISSLAM [26]Camera + IMUSemantic\checkmark
VINS [1]Camera + IMU + WSSGeometric×\times×
OursCamera + IMU + WSSSemantic\checkmark

PS DetectorOur PS detector is a YOLOv5 based network, which is specifically trained with nearly 50k BEV images collected in underground and outdoor parking lots for better generalization.Our PSDet module views each parking slot (PS) as a planar object with corresponding human-annotated label (i.e. PS shape, PS occupation state, 4 corner point location and related visible states), see Fig.3, 5 and 7.We only keep PS detection results with output probabilities greater than 50%.The training process lasts for 28 epochs with a fixed batchsize of 64. The well trained model is then deployed to support online inference via TensorRT [33].

Evaluation TrajectoriesWe collect a comprehensive group of short-distance trajectories (APA scenarios) and another group of long-distance trajectories (AVP scenarios) in two real-world parking lots (see Tab.IV and Fig.6, #1#1\#1# 1 denotes the first one and #2#2\#2# 2 denotes another). The vehicle moves at the speed of about 4 km/h for short-distance trajectories, and about 10 km/h for long-distance ones. During each trajectory, we record all sensors’ perception measurements into ROS bags [30] for offline evaluation. We prepare Tab.III to show lengths and duration time of trajectories.

VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking (6)

straight 0left 45right 45left 90right 90left parallelright parallelroundTraj. length withour errors (meter) Parking lot#14.94 (0.13)5.95 (0.13)10.92 (0.11)11.29 (0.23)10.24 (0.06)8.25 (0.05)13.08 (0.18)121.62 (1.09) Parking lot#24.64 (0.05)5.81 (0.01)6.00 (0.07)8.33 (0.29)12.35 (0.21)6.53 (0.07)7.91 (0.17)133.50 (1.27)Duration(second) Parking lot#124.128.752.349.152.269.174.592.0 Parking lot#224.928.839.736.356.531.538.2102.2

Ground Truth Acquisition & MetricWe use Root Mean Square Error (RMSE) which is a commonly used error metric for measuring trajectory errors. Most methods use GPS/RTK signal as ground truth in the open outdoor area, but it is not suitable for underground environments which have poor coverage of GPS/RTK measurements. Therefore, we refer to utilize motion capture system to acquire trajectories’ ground truth by following previous works [19].

IV-B Experimental Results

We compare our method with other three baseline methods for various parking trajectories, including short-distance parking scenarios and long-distance parking scenarios. Experimental results are shown in Tab.IV. The corresponding parking trajectories are shown in Fig.6.

For short-distance scenarios, our VIPS-Odom achieves better overall localization performance. Merging parking slot features to the visual-inertial odometry can effectively reduce localization error. Meanwhile, A-LOAM has comparable results with ours though LiDAR is much costly than our sensor configuration, which reflects the potential of using optimization-based methods with low-cost sensors. For long-distance scenarios, optimization-based methods, VISSLAM, VINS and VIPS-Odom, achieve better localization performance. VISSLAM got a failure case on one long-distance trajectory and we analyze the reason in Sec.IV-F. With semantic constraints provided by PS, our VIPS-Odom has lower localization error than VINS, which indicates the drifts are efficiently reduced by incorporating PS.

MethodTrajectoryEKFLOAMVISSLAMVINSOursstraight 0 (#1)0.140.480.390.11\ul0.13straight 0 (#2)0.110.12\ul0.08\ul0.080.05left 45 (#1)0.170.110.200.16\ul0.13left 45 (#2)0.130.16\ul0.030.160.01right 45 (#1)0.120.060.120.18\ul0.11right 45 (#2)0.160.120.070.070.07left 90 (#1)0.260.040.27\ul0.180.23left 90 (#2)\ul0.250.180.360.350.29right 90 (#1)0.100.130.09\ul0.070.06right 90 (#2)\ul0.110.070.230.230.21left parallel (#1)0.11\ul0.050.030.10\ul0.05left parallel (#2)0.170.110.13\ul0.080.07right parallel (#1)\ul0.160.040.220.190.18Short-distanceright parallel (#2)0.170.140.23\ul0.160.17Overall mean0.154\ul0.1290.1750.1510.126round (#1)0.357.60fail2.211.09round (#2)7.862.731.731.851.27Long-distanceOverall mean4.1055.165-\ul2.0301.180OveallOverall mean0.6480.759-\ul0.3860.258

MethodTrajectoryVINSOurs(+front.)Ours(+back.)Ours(+both)Straight 0 (#1)0.110.030.120.13Straight 0 (#2)0.080.070.070.05Left 45 (#1)0.150.150.140.13Left 45 (#2)0.040.020.040.01Right 45 (#1)0.180.170.130.11Right 45 (#2)0.070.070.060.07Left 90 (#1)0.180.260.310.23Left 90 (#2)0.350.300.320.29Right 90 (#1)0.070.030.110.06Right 90 (#2)0.230.200.120.21Left parallel (#1)0.100.040.090.05Left parallel (#2)0.080.070.060.07Right parallel (#1)0.190.200.200.18Short-distanceRight parallel (#2)0.160.140.190.17Mean0.1420.1250.1400.126Round (#1)2.212.221.601.09Round (#2)1.851.811.261.27Long-distanceMean2.032.021.431.18OverallMean0.3780.3610.3010.258

Feature Point Number #N (Fixed K=10)Keyframes Number #K (Fixed N=110)VINSOursVINSOursTrajectoryN=50N=110N=200std.N=50N=110N=200std.K=7K=10K=13std.K=7K=10K=13std.Straight 0 (#1)0.080.110.110.020.080.130.120.030.110.110.140.020.110.130.110.01Straight 0 (#2)0.050.080.080.020.050.050.060.010.080.080.070.010.070.050.070.01Left 45 (#1)0.130.150.180.030.140.130.130.010.200.150.120.040.130.130.120.01Left 45 (#2)0.040.040.100.040.020.010.090.040.070.040.040.020.040.010.010.02Right 45 (#1)0.110.180.100.040.120.110.140.020.110.180.270.080.170.110.140.03Right 45 (#2)0.070.070.080.010.060.070.060.010.070.070.080.010.060.070.070.01Left 90 (#1)0.140.180.190.030.200.230.280.040.180.180.150.020.240.230.240.01Left 90 (#2)0.280.350.330.040.320.290.340.030.310.350.320.020.300.290.300.01Right 90 (#1)0.040.070.070.020.090.060.080.020.050.070.070.010.150.060.120.05Right 90 (#2)0.230.230.220.010.280.210.160.060.190.230.240.030.210.210.230.01Left parallel (#1)0.100.100.140.020.040.050.010.020.430.100.120.190.010.050.010.02Left parallel (#2)0.080.080.070.010.010.070.050.030.100.080.100.010.080.070.100.02Right parallel (#1)0.180.190.150.020.160.180.240.040.150.190.150.020.200.180.200.01Short-distanceRight parallel (#2)0.070.160.150.050.130.170.150.020.140.160.110.030.160.170.150.01Mean0.1140.1420.1410.0260.1210.1260.1360.0270.1560.1420.1410.0360.1380.1260.1340.016Round (#1)2.272.212.640.231.331.091.250.122.532.212.080.230.871.091.010.11Round (#2)1.931.851.650.141.291.271.330.031.741.851.780.061.091.271.300.11Long-distanceMean2.102.032.1450.1851.311.181.290.0752.1352.031.930.1450.981.181.1550.110OverallMean0.3630.3780.3910.0460.2700.2580.2810.0330.4040.3780.3650.0500.2430.2580.2610.028

IV-C Ablation Study

In this work, we fuse the parking slot (PS) information into both SLAM frontend and backend, and herein we conduct ablation study to investigate the role of PS information in the frontend or backend. The experimental results are shown in Tab.V. Based on these results, we find the incorporating of PS in the frontend benefits the short-distance localization during short-distance parking trajectories, and the incorporating of PS in the backend benefits the reduction of long-distance accumulated drift over long-distance trajectories and improve the localization precision.In short-distance (near the target parking slot) parking scenarios, the parking slot features supplemented by the frontend improve the overall quality of the feature points, helping the algorithm to better track feature points and estimate pose in the backend. In long-distance scenarios, vehicles will cruise at a relatively steady speed. At this time, it is particularly important to add parking slot as extra semantic objects in the backend optimization to reduce long-term accumulated drifts.

IV-D Sensitivity Analysis

When deployed in practice, optimization-based SLAM methods have two important hyper-parameters including keyframe number of each sliding window and feature point number of tracked images. Herein we conduct sensitivity analysis on these two hyper-parameters to demonstrate the robustness (or sensitivity) of our VIPS-Odom against another optimization-based method VINS [1]. Related results in shown in Tab.VI.For feature point number selection, we choose different number of maximum feature points in the front-view camera to investigate its impact on the localization accuracy. With the assistance of PS, our VIPS-Odom is more robust to the selection of feature point in long-distance trajectories. We also observe that VIPS-Odom is slightly less robust than VINS in short-distance trajectories. Meanwhile, the selection of keyframe number in the optimization sliding window is of significance in the optimization-based methods.We choose different number of keyframes in the backend optimization to investigate its impact on the localization accuracy. As the results shown, our VIPS-Odom achieves better robustness on keyframe number selection.This helps VIPS-Odom have better generalization for more scenarios.

VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking (7)

IV-E Qualitative Evaluation and Real-time Analysis

With Rviz toolkit, we visualize one long-distance trajectory (#1) and provide qualitative evaluation of our VIPS-Odom in Fig.7. Our real-time PS detection module detects parking slots on the current frame. Furthermore, our PS management module tracks all observed parking slots under a multi-object tracking framework. With the tracking framework, we robustly track all PS states to handle real-time PS observation and manage all PS states, thus maintained well-tracked PS information (Fig.7(a)) is considered in backend. Regarding frequency, the frequency of IMU/WSS readings are 100Hz, and the frequency of fisheye cameras is 20Hz. BEV image generation and PS Detection on generated BEV images are of 10Hz. The odometry is solved timely at 25Hz.

IV-F PS Association Analysis

We compare the performance (RMSE errors) of our SORT based PS association method and hard match policy used in VISSLAM [26]. Backend optimization is both performed by VIPS-Odom for a fair comparison. Our PS association is achieved by maintaining PS states in a SORT based real-time multi-object tracking framework. Hard match policy is achieved via strictly matching with PS observation center-point averaged value, which is prone to suffer long-distance noisy observation. We showcase experimental results in Tab.VIII. For the long-distance Round (#1) trajectory, hard match policy suffers from severe PS mismatches, creates numerous new parking slots in matching module, and encounters failure to form reliable PS factors (see III-F). Above experiments demonstrate the flexibility and benefit of our PS association method against the hard match policy.

TrajectoryHard Match PolicySORT
Short-distance (Mean)0.1650.126
Long-distance Round (#1)fail1.09
Long-distance Round (#2)1.491.27

IV-G Backend PS reweighting Analysis

Herein we conduct ablation study on the utilization of reweighting PS error terms in Eq.9. We modify the αkisuperscriptsubscript𝛼𝑘𝑖\alpha_{k}^{i}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in Eq.9 as constant 1111 to dismiss the reweighting terms. From experimental results, we find utilizing the reweighting terms can help to improve the localization accuracy.

Trajectorywithout Reweightingwith ReweightingShort-distance (Mean)0.1380.126Long-distance Round (Mean)1.351.18

V Conclusion

In this work, we propose a novel tightly-coupled SLAM system VIPS-Odom and fuse the measurements from an inertial measurement unit, a wheel speed sensor and four surround-view fisheye cameras to achieve high-precision localization with stability. We detect parking slots from BEV images in real time, robustly maintain observed parking slots’ states, and simultaneously exploit parking slot observations as both frontend visual features and backend optimization factors to improve the localization precision.We also develop an experimental platform with the related sensor suite for evaluating the performance of VIPS-Odom.Extensive real-world experiments covering both short-distance and long-distance parking scenarios reflect the effectiveness and stability of our method against other baseline methods. For future works, we consider to improve VIPS-Odom in the following aspects:(1) We do not include loop detection module into our system. Since loop detection is not necessary for all short-distance APA scenarios and partial long-distance AVP scenarios. We can extend loop detection module to complete our system to further improve the performance for long-distance scenarios.(2) Beside the parking slots, We can incorporate more semantic objects into our system.

References

  • [1] Qin T, Li P, Shen S. Vins-mono: A robust and versatile monocular visual-inertial state estimator[J]. IEEE Transactions on Robotics, 2018, 34(4): 1004-1020.
  • [2] Kaneko M, Iwami K, Ogawa T, et al. Mask-SLAM: Robust feature-based monocular SLAM by masking using semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2018: 258-266.
  • [3] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing, 2015: 234-241.
  • [4] Jonker R, Volgenant T. Improving the Hungarian assignment algorithm[J]. Operations research letters, 1986, 5(4): 171-175.
  • [5] Labbe M, Michaud F. Online global loop closure detection for large-scale multi-session graph-based SLAM[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2014.
  • [6] Glenn Jocher et al. Yolo v5, 2021.
  • [7] Shi J. Good features to track[C]//1994 Proceedings of IEEE conference on computer vision and pattern recognition. IEEE, 1994: 593-600.
  • [8] Zhang L, Huang J, Li X, et al. Vision-based parking-slot detection: A DCNN-based approach and a large-scale benchmark dataset[J]. IEEE Transactions on Image Processing, 2018, 27(11): 5350-5364.
  • [9] Bernardo J M, et al. Bayesian theory[M]. John Wiley & Sons, 2009.
  • [10] Xiao J, Zhou Z, Yi Y, et al. A survey on wireless indoor localization from the device perspective[J]. ACM Computing Surveys (CSUR), 2016, 49(2): 1-31.
  • [11] Li W, Cao L, Yan L, et al. Vacant parking slot detection in the around view image based on deep learning[J]. Sensors, 2020, 20(7): 2138.
  • [12] Lucas B D, Kanade T. An iterative image registration technique with an application to stereo vision[C]//IJCAI’81: 7th international joint conference on Artificial intelligence. 1981, 2: 674-679.
  • [13] Etxeberria-Garcia, Mikel, et al. ”Visual Odometry in challenging environments: an urban underground railway scenario case.” IEEE Access 10 (2022): 69200-69215.
  • [14] Kalman R E. A new approach to linear filtering and prediction problems[J]. 1960.
  • [15] Wojke N, Bewley A, Paulus D. Simple online and realtime tracking with a deep association metric[C]//2017 IEEE international conference on image processing (ICIP). IEEE, 2017: 3645-3649.
  • [16] Huang S, Dissanayake G. Convergence and consistency analysis for extended Kalman filter based SLAM[J]. IEEE Transactions on robotics, 2007, 23(5): 1036-1049.
  • [17] Frost D, Prisacariu V, Murray D. Recovering stable scale in monocular SLAM using object-supplemented bundle adjustment[J]. IEEE Transactions on Robotics, 2018, 34(3): 736-747.
  • [18] Salas-Moreno, Renato F., et al. ”Slam++: Simultaneous localisation and mapping at the level of objects.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2013.
  • [19] Schubert D, et al. The TUM VI benchmark for evaluating visual-inertial odometry[C]//2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018: 1680-1687.
  • [20] Urban S, Leitloff J, Hinz S. Improved wide-angle, fisheye and omnidirectional camera calibration[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2015, 108: 72-79.
  • [21] Scaramuzza D, et al. A toolbox for easily calibrating omnidirectional cameras[C]//2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2006: 5695-5701.
  • [22] Triggs, Bill, et al. ”Bundle adjustment—a modern synthesis.” Vision Algorithms: Theory and Practice: International Workshop on Vision Algorithms Corfu, Greece, September 21–22, 1999 Proceedings. Springer Berlin Heidelberg, 2000.
  • [23] Shaozu Cao Tong Qin. A-loam. https://github.com/HKUST-Aerial-Robotics/A-LOAM, 2019.
  • [24] Liu, Jinxu, Wei Gao, and Zhanyi Hu. ”Visual-inertial odometry tightly coupled with wheel encoder adopting robust initialization and online extrinsic calibration.” 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019.
  • [25] Zhang J, Singh S. LOAM: Lidar odometry and mapping in real-time[C]//Robotics: Science and systems. 2014, 2(9): 1-9.
  • [26] Shao, Xuan, et al. ”A tightly-coupled semantic SLAM system with visual, inertial and surround-view sensors for autonomous indoor parking.” Proceedings of the 28th ACM International Conference on Multimedia. 2020.
  • [27] Cui, Li, et al. ”Monte-Carlo localization in underground parking lots using parking slot numbers.”//IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021.
  • [28] Qin, Tong, et al. ”Avp-slam: Semantic visual mapping and localization for autonomous vehicles in the parking lot.”//IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
  • [29] Mur-Artal R, Montiel J M M, Tardos J D. ORB-SLAM: a versatile and accurate monocular SLAM system[J]. IEEE transactions on robotics, 2015, 31(5): 1147-1163.
  • [30] Quigley M, et al. ROS: an open-source Robot Operating System[C]//ICRA workshop on open source software. 2009, 3(3.2).
  • [31] Mourikis, Anastasios I., et al. ”A multi-state constraint Kalman filter for vision-aided inertial navigation.” Proceedings of IEEE international conference on robotics and automation. IEEE, 2007.
  • [32] Agarwal S, Mierle K. Ceres solver: Tutorial & reference[J]. Google Inc, 2012, 2(72): 8.
  • [33] Vanholder H. Efficient inference with tensorrt[C]//GPU Technology Conference. 2016, 1(2).
VIPS-Odom: Visual-Inertial Odometry Tightly-coupled with Parking Slots for Autonomous Parking (2024)

References

Top Articles
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 6850

Rating: 4.8 / 5 (58 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.