基于高通跃龙IQ-9100打造具身智能机器人视觉引导抓取系统(1): 系统架构、NPU目标检测与抓取姿态估计
本文基于高通跃龙IQ-9100工业机器人平台,开发了一套完整的视觉引导抓取系统。系统采用"眼-手"协作架构,利用NPU加速的YOLOv8n-seg模型实现目标检测与实例分割,结合深度图进行抓取姿态估计。通过手眼标定建立视觉与机械臂的坐标映射,使用MoveIt2进行运动规划和碰撞避免,最终通过CAN-FD接口驱动6-DOF机械臂完成精准抓取。文章详细介绍了系统架构、硬件配置及基于
摘要:
视觉引导抓取(Vision-Guided Grasping)是具身智能机器人的核心操作能力。本文以高通跃龙IQ-9100工业机器人平台,实战开发一套完整的“眼-手”协作系统:使用NPU加速的深度学习模型检测目标物体和估计抓取姿态,通过手眼标定建立视觉与机械臂的坐标映射,利用MoveIt2进行运动规划和碰撞避免,最终通过CAN-FD接口驱动6-DOF机械臂完成精准抓取。
本篇为系列第一篇,详细介绍系统整体架构、基于NPU的YOLOv8n-seg目标检测与实例分割实现,以及基于深度图的抓取姿态估计方法。
1. IQ-9100平台介绍
高通跃龙IQ-9100是高通打造的高性能工业级平台,
可以完美应用到具身智能机器人场景。
- 在《基于高通跃龙IQ-9100打造具身智能机器人多传感器融合感知系统》文中,我们基于高通跃龙IQ-9100平台完成了具身智能机器人的硬件选型,并利用其强大的100 TOPS NPU实现了多摄像头AI感知系统。
- 在《基于高通跃龙IQ-9100打造具身智能机器人视觉SLAM与自主导航系统》》文中,我们基于高通跃龙IQ-9100平台搭建了一套完整的视觉SLAM建图 + 自主导航 + 动态避障系统。
- 本文以基于高通跃龙IQ-9100工业机器人平台,打造具身智能机器人场景下一套完整的“眼-手”协作系统。
2. 视觉抓取系统架构
2.1 系统总览
(图示:运动规划 (CPU) + 视觉感知 (NPU) + 安全岛控制)
┌─────────────────────────────────────────────────────────────────────┐
│ 视觉引导抓取系统 │
│ │
│ ┌──────────────── 视觉感知 (NPU) ──────────────────────────────┐ │
│ │ │ │
│ │ RGB-D 摄像头 │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────┐ ┌──────────────┐ ┌────────────────┐ │ │
│ │ │ 目标检测 │───→│ 抓取姿态估计 │───→│ 坐标系转换 │ │ │
│ │ │ YOLOv8 │ │ GraspNet │ │ (手眼标定) │ │ │
│ │ │ (NPU TP0)│ │ (NPU TP1) │ │ cam → base │ │ │
│ │ └──────────┘ └──────────────┘ └───────┬────────┘ │ │
│ └──────────────────────────────────────────────┼──────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────▼──────────────┐ │
│ │ 运动规划 (CPU) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌─────────────┐ ┌────────────────┐ │ │
│ │ │ MoveIt 2 │ │ 碰撞检测 │ │ 轨迹平滑 │ │ │
│ │ │ 路径规划 │ │ 场景更新 │ │ 时间参数化 │ │ │
│ │ └──────────────┘ └─────────────┘ └────────┬──────┘ │ │
│ └──────────────────────────────────────────────┼──────────────┘ │
│ │ │
│ ┌──────────────────────────────────────────────▼──────────────┐ │
│ │ 执行控制 (Safety Island) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌─────────────┐ ┌────────────────┐ │ │
│ │ │ 关节PID控制 │ │ 力矩监控 │ │ 夹爪控制 │ │ │
│ │ │ CAN-FD │ │ 碰撞保护 │ │ 抓取/释放 │ │ │
│ │ └──────────────┘ └─────────────┘ └────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
2.2 硬件配置
机械臂系统硬件:
- 末端执行器:CAN-FD(CAN3)
- 视觉传感器:RGB-D摄像头
- Eye-in-Hand:安装在末端法兰上(本文方案)
或 Eye-to-Hand:固定在工作台上方 - 计算平台:IQ-9100域控制器
- NPU TP0:目标检测
- NPU TP1:抓取姿态估计
- CPU:MoveIt2运动规划
- Safety Island:关节控制 + 力保护
机械臂系统硬件:
├── 机械臂: 6-DOF 协作臂 (如 MyCobot/xArm/UR3e)
│ ├── 关节 1-6: 各配 CAN-FD 伺服电机
│ ├── 工作半径: ~500mm
│ ├── 负载: ~2kg
│ └── 重复精度: ±0.1mm
│
├── 末端执行器: 二指平行夹爪
│ ├── 行程: 0-85mm
│ ├── 力控: 0.5-40N 可调
│ └── 控制: CAN-FD (CAN3)
│
├── 视觉传感器: RGB-D 摄像头
│ ├── Eye-in-Hand: 安装在末端法兰上 (本文方案)
│ └── 或 Eye-to-Hand: 固定在工作台上方
│
└── 计算平台: IQ-9100 域控制器
├── NPU TP0: 目标检测
├── NPU TP1: 抓取姿态估计
├── CPU: MoveIt 2 运动规划
└── Safety Island: 关节控制 + 力保护
3. 视觉感知:目标检测与实例分割(NPU加速)
以下代码实现了基于NPU(高通SNPE)或ONNX Runtime的YOLOv8n-seg模型推理,输出物体检测框和实例分割掩码。
# grasp_perception/object_detector.py
# 基于NPU的目标检测+实例分割
# 检测待抓取物体,输出物体位置、类别和掩码
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, CameraInfo
from vision_msgs.msg import Detection2DArray, Detection2D, ObjectHypothesisWithPose
from cv_bridge import CvBridge
import numpy as np
import cv2
import time
try:
import snpe
HAS_SNPE = True
except ImportError:
import onnxruntime as ort
HAS_SNPE = False
class ObjectDetectorNode(Node):
"""待抓取物体检测节点
使用 YOLOv8n-seg 同时输出检测框和实例分割掩码
"""
def __init__(self):
super().__init__('object_detector')
self.declare_parameter('model_path', '/opt/models/yolov8n_seg_int8.dlc')
self.declare_parameter('confidence', 0.6)
model_path = self.get_parameter('model_path').value
self.conf_thresh = self.get_parameter('confidence').value
self.bridge = CvBridge()
self._load_model(model_path)
self.create_subscription(Image, '/camera/color/image_raw', self._image_callback, 5)
self.create_subscription(CameraInfo, '/camera/color/camera_info', self._caminfo_callback, 5)
self.det_pub = self.create_publisher(Detection2DArray, '/object_detections', 10)
self.mask_pub = self.create_publisher(Image, '/object_masks', 5)
self.vis_pub = self.create_publisher(Image, '/detection_vis', 5)
self.camera_matrix = None
self.get_logger().info('Object detector ready (NPU accelerated)')
def _load_model(self, path):
if HAS_SNPE:
self.engine = snpe.SNPE(path, runtime='dsp', performance_profile='burst')
self.backend = 'snpe'
else:
self.engine = ort.InferenceSession(path.replace('.dlc', '.onnx'))
self.backend = 'ort'
def _caminfo_callback(self, msg):
self.camera_matrix = np.array(msg.k).reshape(3, 3)
def _image_callback(self, msg):
frame = self.bridge.imgmsg_to_cv2(msg, 'bgr8')
t0 = time.perf_counter()
blob = self._preprocess(frame)
outputs = self._infer(blob)
detections, masks = self._postprocess(outputs, frame.shape[:2])
latency = (time.perf_counter() - t0) * 1000
# 发布检测结果
det_msg = Detection2DArray()
det_msg.header = msg.header
for det in detections:
d = Detection2D()
x1, y1, x2, y2 = det['bbox']
d.bbox.center.position.x = float((x1+x2)/2)
d.bbox.center.position.y = float((y1+y2)/2)
d.bbox.size_x = float(x2-x1)
d.bbox.size_y = float(y2-y1)
hyp = ObjectHypothesisWithPose()
hyp.hypothesis.class_id = str(det['class_id'])
hyp.hypothesis.score = det['score']
d.results.append(hyp)
det_msg.detections.append(d)
self.det_pub.publish(det_msg)
# 发布掩码
if masks is not None and len(masks) > 0:
combined = np.zeros(frame.shape[:2], dtype=np.uint8)
for i, m in enumerate(masks):
combined[m > 0.5] = i + 1
mask_msg = self.bridge.cv2_to_imgmsg(combined, 'mono8')
mask_msg.header = msg.header
self.mask_pub.publish(mask_msg)
# 可视化
vis = self._visualize(frame, detections, latency)
vis_msg = self.bridge.cv2_to_imgmsg(vis, 'bgr8')
self.vis_pub.publish(vis_msg)
def _preprocess(self, frame):
resized = cv2.resize(frame, (640, 640))
blob = resized.astype(np.float32) / 255.0
blob = blob.transpose(2, 0, 1)[np.newaxis, ...]
return blob
def _infer(self, blob):
if self.backend == 'snpe':
return self.engine.execute({"images": blob})
else:
name = self.engine.get_inputs()[0].name
return self.engine.run(None, {name: blob})
def _postprocess(self, outputs, orig_shape):
detections = []
masks = []
h, w = orig_shape
if isinstance(outputs, dict):
det_out = outputs.get("output0", None)
elif isinstance(outputs, list):
det_out = outputs[0]
else:
return detections, masks
if det_out is None:
return detections, masks
preds = det_out[0] if det_out.ndim == 3 else det_out
if preds.shape[0] < preds.shape[1]:
preds = preds.T
scale_x, scale_y = w / 640.0, h / 640.0
boxes, scores, class_ids = [], [], []
for pred in preds:
class_scores = pred[4:84] if len(pred) > 84 else pred[4:]
max_score = np.max(class_scores)
if max_score < self.conf_thresh:
continue
cid = int(np.argmax(class_scores))
cx, cy, bw, bh = pred[:4]
x1 = int((cx - bw/2) * scale_x)
y1 = int((cy - bh/2) * scale_y)
x2 = int((cx + bw/2) * scale_x)
y2 = int((cy + bh/2) * scale_y)
boxes.append([x1, y1, x2-x1, y2-y1])
scores.append(float(max_score))
class_ids.append(cid)
if boxes:
indices = cv2.dnn.NMSBoxes(boxes, scores, self.conf_thresh, 0.45)
for i in indices:
idx = i if isinstance(i, int) else i[0]
x, y, bw, bh = boxes[idx]
detections.append({
'bbox': (x, y, x+bw, y+bh),
'score': scores[idx],
'class_id': class_ids[idx],
})
return detections, masks
def _visualize(self, frame, detections, latency_ms):
vis = frame.copy()
for det in detections:
x1, y1, x2, y2 = det['bbox']
cv2.rectangle(vis, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.putText(vis, f"id:{det['class_id']} {det['score']:.2f}",
(x1, y1-8), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
cv2.putText(vis, f"{latency_ms:.1f}ms", (10, 25),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)
return vis
def main(args=None):
rclpy.init(args=args)
node = ObjectDetectorNode()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()
4. 抓取姿态估计(深度图几何分析)
在获取目标物体检测框之后,我们需要估计机械臂平行夹爪的最优抓取位姿(6-DOF)。本文基于深度图几何分析 + 简化的GraspNet思路,在检测框区域内生成多个抓取候选,并评估抓取质量,最终输出相机坐标系下的抓取位姿。
# grasp_perception/grasp_estimator.py
# 基于深度图的抓取姿态估计
# 从RGB-D图像中估计平行夹爪的最优抓取位姿
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, CameraInfo
from geometry_msgs.msg import PoseStamped, PoseArray
from vision_msgs.msg import Detection2DArray
from cv_bridge import CvBridge
import numpy as np
import cv2
import math
from dataclasses import dataclass
from typing import List, Tuple, Optional
@dataclass
class GraspCandidate:
"""抓取候选"""
center_pixel: Tuple[int, int] # 像素坐标 (u, v)
angle: float # 夹爪旋转角度 (rad)
width: float # 抓取宽度 (pixel)
quality: float # 抓取质量得分 [0, 1]
depth: float # 抓取深度 (m)
position_3d: Optional[np.ndarray] = None # 相机坐标系下 [x,y,z]
class GraspEstimatorNode(Node):
"""
抓取姿态估计节点
方法:基于深度图的几何分析 + 简化的GraspNet推理
输出:相机坐标系下的 6-DOF 抓取位姿
"""
def __init__(self):
super().__init__('grasp_estimator')
self.declare_parameter('min_depth', 0.1)
self.declare_parameter('max_depth', 1.0)
self.declare_parameter('gripper_width', 0.085)
self.declare_parameter('approach_distance', 0.05)
self.bridge = CvBridge()
self.camera_matrix = None
self._latest_depth = None
self._latest_detections = None
self.create_subscription(Image, '/camera/aligned_depth_to_color/image_raw', self._depth_callback, 5)
self.create_subscription(CameraInfo, '/camera/color/camera_info', self._caminfo_callback, 5)
self.create_subscription(Detection2DArray, '/object_detections', self._det_callback, 10)
self.grasp_pub = self.create_publisher(PoseStamped, '/grasp_pose', 10)
self.candidates_pub = self.create_publisher(PoseArray, '/grasp_candidates', 10)
self.vis_pub = self.create_publisher(Image, '/grasp_vis', 5)
self.create_timer(0.1, self._estimate_loop)
self.get_logger().info('Grasp estimator ready')
def _depth_callback(self, msg):
self._latest_depth = self.bridge.imgmsg_to_cv2(msg, 'passthrough')
def _caminfo_callback(self, msg):
self.camera_matrix = np.array(msg.k).reshape(3, 3)
def _det_callback(self, msg):
self._latest_detections = msg
def _estimate_loop(self):
if (self._latest_depth is None or self._latest_detections is None or self.camera_matrix is None):
return
depth = self._latest_depth.copy()
dets = self._latest_detections
if not dets.detections:
return
det = dets.detections[0]
cx = int(det.bbox.center.position.x)
cy = int(det.bbox.center.position.y)
bw = int(det.bbox.size_x)
bh = int(det.bbox.size_y)
x1 = max(0, cx - bw // 2)
y1 = max(0, cy - bh // 2)
x2 = min(depth.shape[1], cx + bw // 2)
y2 = min(depth.shape[0], cy + bh // 2)
candidates = self._generate_grasp_candidates(depth, (x1, y1, x2, y2))
if not candidates:
return
candidates.sort(key=lambda g: g.quality, reverse=True)
best = candidates[0]
for c in candidates[:5]:
c.position_3d = self._pixel_to_3d(c.center_pixel[0], c.center_pixel[1], c.depth)
best_pose = self._grasp_to_pose(best)
self.grasp_pub.publish(best_pose)
all_poses = PoseArray()
all_poses.header = best_pose.header
for c in candidates[:5]:
p = self._grasp_to_pose(c)
all_poses.poses.append(p.pose)
self.candidates_pub.publish(all_poses)
def _generate_grasp_candidates(self, depth: np.ndarray, bbox: tuple) -> List[GraspCandidate]:
"""在检测框区域生成抓取候选"""
x1, y1, x2, y2 = bbox
roi_depth = depth[y1:y2, x1:x2].astype(np.float32)
if roi_depth.dtype == np.uint16:
roi_depth = roi_depth / 1000.0
min_d = self.get_parameter('min_depth').value
max_d = self.get_parameter('max_depth').value
valid_mask = (roi_depth > min_d) & (roi_depth < max_d)
if np.sum(valid_mask) < 100:
return []
roi_gray = ((roi_depth - min_d) / (max_d - min_d) * 255).astype(np.uint8)
roi_gray[~valid_mask] = 0
edges = cv2.Canny(roi_gray, 30, 100)
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))
edges = cv2.dilate(edges, kernel)
candidates = []
h_roi, w_roi = roi_depth.shape
gripper_w_pixel = int(self.get_parameter('gripper_width').value *
self.camera_matrix[0, 0] / np.median(roi_depth[valid_mask]))
for angle_deg in range(0, 180, 15):
angle_rad = math.radians(angle_deg)
sample_points = [
(w_roi // 2, h_roi // 2),
(w_roi // 3, h_roi // 3),
(2 * w_roi // 3, h_roi // 3),
(w_roi // 3, 2 * h_roi // 3),
(2 * w_roi // 3, 2 * h_roi // 3),
]
for su, sv in sample_points:
if not valid_mask[sv, su]:
continue
quality = self._evaluate_grasp_quality(
roi_depth, valid_mask, edges, su, sv, angle_rad, gripper_w_pixel)
if quality > 0.3:
global_u = x1 + su
global_v = y1 + sv
d = roi_depth[sv, su]
candidates.append(GraspCandidate(
center_pixel=(global_u, global_v),
angle=angle_rad,
width=gripper_w_pixel,
quality=quality,
depth=float(d)
))
return candidates
def _evaluate_grasp_quality(self, depth_roi, valid_mask, edges, u, v, angle, width) -> float:
"""评估抓取质量"""
h, w = depth_roi.shape
score = 0.5
half_w = width // 2
dx = int(half_w * math.cos(angle))
dy = int(half_w * math.sin(angle))
left_u, left_v = u - dx, v - dy
right_u, right_v = u + dx, v + dy
if not (0 <= left_u < w and 0 <= left_v < h and 0 <= right_u < w and 0 <= right_v < h):
return 0.0
center_depth = depth_roi[v, u]
if center_depth <= 0:
return 0.0
contact_region = depth_roi[max(0, v-3):v+4, max(0, u-3):u+4]
valid_contact = contact_region[contact_region > 0]
if len(valid_contact) < 3:
return 0.0
flatness = 1.0 - min(np.std(valid_contact) * 10, 1.0)
score += 0.3 * flatness
if (0 <= left_v < h and 0 <= left_u < w):
if edges[left_v, left_u] > 0:
score += 0.1
if (0 <= right_v < h and 0 <= right_u < w):
if edges[right_v, right_u] > 0:
score += 0.1
return min(score, 1.0)
def _pixel_to_3d(self, u, v, depth) -> np.ndarray:
"""像素+深度 -> 相机坐标系3D点"""
fx = self.camera_matrix[0, 0]
fy = self.camera_matrix[1, 1]
cx = self.camera_matrix[0, 2]
cy = self.camera_matrix[1, 2]
x = (u - cx) * depth / fx
y = (v - cy) * depth / fy
z = depth
return np.array([x, y, z])
def _grasp_to_pose(self, grasp: GraspCandidate) -> PoseStamped:
"""将抓取候选转为PoseStamped(相机坐标系)"""
pose = PoseStamped()
pose.header.stamp = self.get_clock().now().to_msg()
pose.header.frame_id = "camera_color_optical_frame"
if grasp.position_3d is not None:
pose.pose.position.x = float(grasp.position_3d[0])
pose.pose.position.y = float(grasp.position_3d[1])
pose.pose.position.z = float(grasp.position_3d[2])
cy = math.cos(grasp.angle / 2)
sy = math.sin(grasp.angle / 2)
cp = math.cos(-math.pi / 4)
sp = math.sin(-math.pi / 4)
pose.pose.orientation.x = sp * cy
pose.pose.orientation.y = sp * sy
pose.pose.orientation.z = cp * sy
pose.pose.orientation.w = cp * cy
return pose
def main(args=None):
rclpy.init(args=args)
node = GraspEstimatorNode()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()
小结: 本篇完成了系统的视觉感知部分:通过NPU加速的YOLOv8n-seg实现目标检测与实例分割,并基于深度图几何分析生成了抓取位姿。接下来将介绍如何建立相机到机械臂的坐标转换(手眼标定),以及如何利用MoveIt 2进行运动规划、状态机管理和完整抓取流程执行,最后给出系统性能指标与关键经验。
更多推荐




所有评论(0)