基于高通跃龙IQ-9100打造具身智能机器人视觉引导抓取系统(1): 系统架构、NPU目标检测与抓取姿态估计

本文基于高通跃龙IQ-9100工业机器人平台，开发了一套完整的视觉引导抓取系统。系统采用"眼-手"协作架构，利用NPU加速的YOLOv8n-seg模型实现目标检测与实例分割，结合深度图进行抓取姿态估计。通过手眼标定建立视觉与机械臂的坐标映射，使用MoveIt2进行运动规划和碰撞避免，最终通过CAN-FD接口驱动6-DOF机械臂完成精准抓取。文章详细介绍了系统架构、硬件配置及基于

weixin_38498942

396人浏览 · 2026-05-14 10:17:28

weixin_38498942 · 2026-05-14 10:17:28 发布

摘要：
视觉引导抓取（Vision-Guided Grasping）是具身智能机器人的核心操作能力。本文以高通跃龙IQ-9100工业机器人平台，实战开发一套完整的“眼-手”协作系统：使用NPU加速的深度学习模型检测目标物体和估计抓取姿态，通过手眼标定建立视觉与机械臂的坐标映射，利用MoveIt2进行运动规划和碰撞避免，最终通过CAN-FD接口驱动6-DOF机械臂完成精准抓取。
本篇为系列第一篇，详细介绍系统整体架构、基于NPU的YOLOv8n-seg目标检测与实例分割实现，以及基于深度图的抓取姿态估计方法。

1. IQ-9100平台介绍

高通跃龙IQ-9100是高通打造的高性能工业级平台，
在这里插入图片描述
可以完美应用到具身智能机器人场景。

在《基于高通跃龙IQ-9100打造具身智能机器人多传感器融合感知系统》文中，我们基于高通跃龙IQ-9100平台完成了具身智能机器人的硬件选型，并利用其强大的100 TOPS NPU实现了多摄像头AI感知系统。
在《基于高通跃龙IQ-9100打造具身智能机器人视觉SLAM与自主导航系统》》文中，我们基于高通跃龙IQ-9100平台搭建了一套完整的视觉SLAM建图 + 自主导航 + 动态避障系统。
本文以基于高通跃龙IQ-9100工业机器人平台，打造具身智能机器人场景下一套完整的“眼-手”协作系统。

2. 视觉抓取系统架构

2.1 系统总览

（图示：运动规划 (CPU) + 视觉感知 (NPU) + 安全岛控制）

┌─────────────────────────────────────────────────────────────────────┐
│                   视觉引导抓取系统                                    │
│                                                                     │
│  ┌──────────────── 视觉感知 (NPU) ──────────────────────────────┐    │
│  │                                                             │    │
│  │  RGB-D 摄像头                                                │    │
│  │       │                                                     │    │
│  │       ▼                                                     │    │
│  │  ┌──────────┐    ┌──────────────┐    ┌────────────────┐    │    │
│  │  │ 目标检测  │───→│ 抓取姿态估计  │───→│ 坐标系转换      │    │    │
│  │  │ YOLOv8   │    │ GraspNet     │    │ (手眼标定)      │    │    │
│  │  │ (NPU TP0)│    │ (NPU TP1)    │    │ cam → base     │    │    │
│  │  └──────────┘    └──────────────┘    └───────┬────────┘    │    │
│  └──────────────────────────────────────────────┼──────────────┘    │
│                                                 │                   │
│  ┌──────────────────────────────────────────────▼──────────────┐    │
│  │                运动规划 (CPU)                                │    │
│  │                                                             │    │
│  │  ┌──────────────┐  ┌─────────────┐  ┌────────────────┐      │    │
│  │  │ MoveIt 2     │  │ 碰撞检测     │  │ 轨迹平滑       │      │    │
│  │  │ 路径规划      │  │ 场景更新     │  │ 时间参数化     │      │    │
│  │  └──────────────┘  └─────────────┘  └────────┬──────┘      │    │
│  └──────────────────────────────────────────────┼──────────────┘    │
│                                                 │                   │
│  ┌──────────────────────────────────────────────▼──────────────┐    │
│  │              执行控制 (Safety Island)                        │    │
│  │                                                             │    │
│  │  ┌──────────────┐  ┌─────────────┐  ┌────────────────┐      │    │
│  │  │ 关节PID控制   │  │ 力矩监控    │  │ 夹爪控制        │      │    │
│  │  │ CAN-FD       │  │ 碰撞保护    │  │ 抓取/释放       │      │    │
│  │  └──────────────┘  └─────────────┘  └────────────────┘      │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

2.2 硬件配置

机械臂系统硬件：

末端执行器：CAN-FD（CAN3）
视觉传感器：RGB-D摄像头
Eye-in-Hand：安装在末端法兰上（本文方案）
或 Eye-to-Hand：固定在工作台上方
计算平台：IQ-9100域控制器
- NPU TP0：目标检测
- NPU TP1：抓取姿态估计
- CPU：MoveIt2运动规划
- Safety Island：关节控制 + 力保护

机械臂系统硬件:
├── 机械臂: 6-DOF 协作臂 (如 MyCobot/xArm/UR3e)
│   ├── 关节 1-6: 各配 CAN-FD 伺服电机
│   ├── 工作半径: ~500mm
│   ├── 负载: ~2kg
│   └── 重复精度: ±0.1mm
│
├── 末端执行器: 二指平行夹爪
│   ├── 行程: 0-85mm
│   ├── 力控: 0.5-40N 可调
│   └── 控制: CAN-FD (CAN3)
│
├── 视觉传感器: RGB-D 摄像头
│   ├── Eye-in-Hand: 安装在末端法兰上 (本文方案)
│   └── 或 Eye-to-Hand: 固定在工作台上方
│
└── 计算平台: IQ-9100 域控制器
    ├── NPU TP0: 目标检测
    ├── NPU TP1: 抓取姿态估计
    ├── CPU: MoveIt 2 运动规划
    └── Safety Island: 关节控制 + 力保护

3. 视觉感知：目标检测与实例分割（NPU加速）

以下代码实现了基于NPU（高通SNPE）或ONNX Runtime的YOLOv8n-seg模型推理，输出物体检测框和实例分割掩码。

# grasp_perception/object_detector.py
# 基于NPU的目标检测+实例分割
# 检测待抓取物体，输出物体位置、类别和掩码

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, CameraInfo
from vision_msgs.msg import Detection2DArray, Detection2D, ObjectHypothesisWithPose
from cv_bridge import CvBridge
import numpy as np
import cv2
import time

try:
    import snpe
    HAS_SNPE = True
except ImportError:
    import onnxruntime as ort
    HAS_SNPE = False

class ObjectDetectorNode(Node):
    """待抓取物体检测节点
       使用 YOLOv8n-seg 同时输出检测框和实例分割掩码
    """
    def __init__(self):
        super().__init__('object_detector')
        self.declare_parameter('model_path', '/opt/models/yolov8n_seg_int8.dlc')
        self.declare_parameter('confidence', 0.6)
        model_path = self.get_parameter('model_path').value
        self.conf_thresh = self.get_parameter('confidence').value
        self.bridge = CvBridge()
        self._load_model(model_path)

        self.create_subscription(Image, '/camera/color/image_raw', self._image_callback, 5)
        self.create_subscription(CameraInfo, '/camera/color/camera_info', self._caminfo_callback, 5)

        self.det_pub = self.create_publisher(Detection2DArray, '/object_detections', 10)
        self.mask_pub = self.create_publisher(Image, '/object_masks', 5)
        self.vis_pub = self.create_publisher(Image, '/detection_vis', 5)

        self.camera_matrix = None
        self.get_logger().info('Object detector ready (NPU accelerated)')

    def _load_model(self, path):
        if HAS_SNPE:
            self.engine = snpe.SNPE(path, runtime='dsp', performance_profile='burst')
            self.backend = 'snpe'
        else:
            self.engine = ort.InferenceSession(path.replace('.dlc', '.onnx'))
            self.backend = 'ort'

    def _caminfo_callback(self, msg):
        self.camera_matrix = np.array(msg.k).reshape(3, 3)

    def _image_callback(self, msg):
        frame = self.bridge.imgmsg_to_cv2(msg, 'bgr8')
        t0 = time.perf_counter()

        blob = self._preprocess(frame)
        outputs = self._infer(blob)
        detections, masks = self._postprocess(outputs, frame.shape[:2])

        latency = (time.perf_counter() - t0) * 1000

        # 发布检测结果
        det_msg = Detection2DArray()
        det_msg.header = msg.header
        for det in detections:
            d = Detection2D()
            x1, y1, x2, y2 = det['bbox']
            d.bbox.center.position.x = float((x1+x2)/2)
            d.bbox.center.position.y = float((y1+y2)/2)
            d.bbox.size_x = float(x2-x1)
            d.bbox.size_y = float(y2-y1)
            hyp = ObjectHypothesisWithPose()
            hyp.hypothesis.class_id = str(det['class_id'])
            hyp.hypothesis.score = det['score']
            d.results.append(hyp)
            det_msg.detections.append(d)
        self.det_pub.publish(det_msg)

        # 发布掩码
        if masks is not None and len(masks) > 0:
            combined = np.zeros(frame.shape[:2], dtype=np.uint8)
            for i, m in enumerate(masks):
                combined[m > 0.5] = i + 1
            mask_msg = self.bridge.cv2_to_imgmsg(combined, 'mono8')
            mask_msg.header = msg.header
            self.mask_pub.publish(mask_msg)

        # 可视化
        vis = self._visualize(frame, detections, latency)
        vis_msg = self.bridge.cv2_to_imgmsg(vis, 'bgr8')
        self.vis_pub.publish(vis_msg)

    def _preprocess(self, frame):
        resized = cv2.resize(frame, (640, 640))
        blob = resized.astype(np.float32) / 255.0
        blob = blob.transpose(2, 0, 1)[np.newaxis, ...]
        return blob

    def _infer(self, blob):
        if self.backend == 'snpe':
            return self.engine.execute({"images": blob})
        else:
            name = self.engine.get_inputs()[0].name
            return self.engine.run(None, {name: blob})

    def _postprocess(self, outputs, orig_shape):
        detections = []
        masks = []
        h, w = orig_shape

        if isinstance(outputs, dict):
            det_out = outputs.get("output0", None)
        elif isinstance(outputs, list):
            det_out = outputs[0]
        else:
            return detections, masks

        if det_out is None:
            return detections, masks

        preds = det_out[0] if det_out.ndim == 3 else det_out
        if preds.shape[0] < preds.shape[1]:
            preds = preds.T

        scale_x, scale_y = w / 640.0, h / 640.0
        boxes, scores, class_ids = [], [], []

        for pred in preds:
            class_scores = pred[4:84] if len(pred) > 84 else pred[4:]
            max_score = np.max(class_scores)
            if max_score < self.conf_thresh:
                continue
            cid = int(np.argmax(class_scores))

            cx, cy, bw, bh = pred[:4]
            x1 = int((cx - bw/2) * scale_x)
            y1 = int((cy - bh/2) * scale_y)
            x2 = int((cx + bw/2) * scale_x)
            y2 = int((cy + bh/2) * scale_y)

            boxes.append([x1, y1, x2-x1, y2-y1])
            scores.append(float(max_score))
            class_ids.append(cid)

        if boxes:
            indices = cv2.dnn.NMSBoxes(boxes, scores, self.conf_thresh, 0.45)
            for i in indices:
                idx = i if isinstance(i, int) else i[0]
                x, y, bw, bh = boxes[idx]
                detections.append({
                    'bbox': (x, y, x+bw, y+bh),
                    'score': scores[idx],
                    'class_id': class_ids[idx],
                })
        return detections, masks

    def _visualize(self, frame, detections, latency_ms):
        vis = frame.copy()
        for det in detections:
            x1, y1, x2, y2 = det['bbox']
            cv2.rectangle(vis, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.putText(vis, f"id:{det['class_id']} {det['score']:.2f}",
                        (x1, y1-8), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 1)
        cv2.putText(vis, f"{latency_ms:.1f}ms", (10, 25),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)
        return vis

def main(args=None):
    rclpy.init(args=args)
    node = ObjectDetectorNode()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()

4. 抓取姿态估计（深度图几何分析）

在获取目标物体检测框之后，我们需要估计机械臂平行夹爪的最优抓取位姿（6-DOF）。本文基于深度图几何分析 + 简化的GraspNet思路，在检测框区域内生成多个抓取候选，并评估抓取质量，最终输出相机坐标系下的抓取位姿。

# grasp_perception/grasp_estimator.py
# 基于深度图的抓取姿态估计
# 从RGB-D图像中估计平行夹爪的最优抓取位姿

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import Image, CameraInfo
from geometry_msgs.msg import PoseStamped, PoseArray
from vision_msgs.msg import Detection2DArray
from cv_bridge import CvBridge
import numpy as np
import cv2
import math
from dataclasses import dataclass
from typing import List, Tuple, Optional

@dataclass
class GraspCandidate:
    """抓取候选"""
    center_pixel: Tuple[int, int]   # 像素坐标 (u, v)
    angle: float                    # 夹爪旋转角度 (rad)
    width: float                    # 抓取宽度 (pixel)
    quality: float                  # 抓取质量得分 [0, 1]
    depth: float                    # 抓取深度 (m)
    position_3d: Optional[np.ndarray] = None  # 相机坐标系下 [x,y,z]

class GraspEstimatorNode(Node):
    """
    抓取姿态估计节点
    方法：基于深度图的几何分析 + 简化的GraspNet推理
    输出：相机坐标系下的 6-DOF 抓取位姿
    """
    def __init__(self):
        super().__init__('grasp_estimator')

        self.declare_parameter('min_depth', 0.1)
        self.declare_parameter('max_depth', 1.0)
        self.declare_parameter('gripper_width', 0.085)
        self.declare_parameter('approach_distance', 0.05)

        self.bridge = CvBridge()
        self.camera_matrix = None
        self._latest_depth = None
        self._latest_detections = None

        self.create_subscription(Image, '/camera/aligned_depth_to_color/image_raw', self._depth_callback, 5)
        self.create_subscription(CameraInfo, '/camera/color/camera_info', self._caminfo_callback, 5)
        self.create_subscription(Detection2DArray, '/object_detections', self._det_callback, 10)

        self.grasp_pub = self.create_publisher(PoseStamped, '/grasp_pose', 10)
        self.candidates_pub = self.create_publisher(PoseArray, '/grasp_candidates', 10)
        self.vis_pub = self.create_publisher(Image, '/grasp_vis', 5)

        self.create_timer(0.1, self._estimate_loop)
        self.get_logger().info('Grasp estimator ready')

    def _depth_callback(self, msg):
        self._latest_depth = self.bridge.imgmsg_to_cv2(msg, 'passthrough')

    def _caminfo_callback(self, msg):
        self.camera_matrix = np.array(msg.k).reshape(3, 3)

    def _det_callback(self, msg):
        self._latest_detections = msg

    def _estimate_loop(self):
        if (self._latest_depth is None or self._latest_detections is None or self.camera_matrix is None):
            return
        depth = self._latest_depth.copy()
        dets = self._latest_detections
        if not dets.detections:
            return

        det = dets.detections[0]
        cx = int(det.bbox.center.position.x)
        cy = int(det.bbox.center.position.y)
        bw = int(det.bbox.size_x)
        bh = int(det.bbox.size_y)
        x1 = max(0, cx - bw // 2)
        y1 = max(0, cy - bh // 2)
        x2 = min(depth.shape[1], cx + bw // 2)
        y2 = min(depth.shape[0], cy + bh // 2)

        candidates = self._generate_grasp_candidates(depth, (x1, y1, x2, y2))
        if not candidates:
            return

        candidates.sort(key=lambda g: g.quality, reverse=True)
        best = candidates[0]

        for c in candidates[:5]:
            c.position_3d = self._pixel_to_3d(c.center_pixel[0], c.center_pixel[1], c.depth)

        best_pose = self._grasp_to_pose(best)
        self.grasp_pub.publish(best_pose)

        all_poses = PoseArray()
        all_poses.header = best_pose.header
        for c in candidates[:5]:
            p = self._grasp_to_pose(c)
            all_poses.poses.append(p.pose)
        self.candidates_pub.publish(all_poses)

    def _generate_grasp_candidates(self, depth: np.ndarray, bbox: tuple) -> List[GraspCandidate]:
        """在检测框区域生成抓取候选"""
        x1, y1, x2, y2 = bbox
        roi_depth = depth[y1:y2, x1:x2].astype(np.float32)

        if roi_depth.dtype == np.uint16:
            roi_depth = roi_depth / 1000.0

        min_d = self.get_parameter('min_depth').value
        max_d = self.get_parameter('max_depth').value
        valid_mask = (roi_depth > min_d) & (roi_depth < max_d)
        if np.sum(valid_mask) < 100:
            return []

        roi_gray = ((roi_depth - min_d) / (max_d - min_d) * 255).astype(np.uint8)
        roi_gray[~valid_mask] = 0

        edges = cv2.Canny(roi_gray, 30, 100)
        kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))
        edges = cv2.dilate(edges, kernel)

        candidates = []
        h_roi, w_roi = roi_depth.shape
        gripper_w_pixel = int(self.get_parameter('gripper_width').value *
                              self.camera_matrix[0, 0] / np.median(roi_depth[valid_mask]))

        for angle_deg in range(0, 180, 15):
            angle_rad = math.radians(angle_deg)
            sample_points = [
                (w_roi // 2, h_roi // 2),
                (w_roi // 3, h_roi // 3),
                (2 * w_roi // 3, h_roi // 3),
                (w_roi // 3, 2 * h_roi // 3),
                (2 * w_roi // 3, 2 * h_roi // 3),
            ]
            for su, sv in sample_points:
                if not valid_mask[sv, su]:
                    continue
                quality = self._evaluate_grasp_quality(
                    roi_depth, valid_mask, edges, su, sv, angle_rad, gripper_w_pixel)
                if quality > 0.3:
                    global_u = x1 + su
                    global_v = y1 + sv
                    d = roi_depth[sv, su]
                    candidates.append(GraspCandidate(
                        center_pixel=(global_u, global_v),
                        angle=angle_rad,
                        width=gripper_w_pixel,
                        quality=quality,
                        depth=float(d)
                    ))
        return candidates

    def _evaluate_grasp_quality(self, depth_roi, valid_mask, edges, u, v, angle, width) -> float:
        """评估抓取质量"""
        h, w = depth_roi.shape
        score = 0.5
        half_w = width // 2
        dx = int(half_w * math.cos(angle))
        dy = int(half_w * math.sin(angle))
        left_u, left_v = u - dx, v - dy
        right_u, right_v = u + dx, v + dy

        if not (0 <= left_u < w and 0 <= left_v < h and 0 <= right_u < w and 0 <= right_v < h):
            return 0.0
        center_depth = depth_roi[v, u]
        if center_depth <= 0:
            return 0.0

        contact_region = depth_roi[max(0, v-3):v+4, max(0, u-3):u+4]
        valid_contact = contact_region[contact_region > 0]
        if len(valid_contact) < 3:
            return 0.0

        flatness = 1.0 - min(np.std(valid_contact) * 10, 1.0)
        score += 0.3 * flatness

        if (0 <= left_v < h and 0 <= left_u < w):
            if edges[left_v, left_u] > 0:
                score += 0.1
        if (0 <= right_v < h and 0 <= right_u < w):
            if edges[right_v, right_u] > 0:
                score += 0.1

        return min(score, 1.0)

    def _pixel_to_3d(self, u, v, depth) -> np.ndarray:
        """像素+深度 -> 相机坐标系3D点"""
        fx = self.camera_matrix[0, 0]
        fy = self.camera_matrix[1, 1]
        cx = self.camera_matrix[0, 2]
        cy = self.camera_matrix[1, 2]
        x = (u - cx) * depth / fx
        y = (v - cy) * depth / fy
        z = depth
        return np.array([x, y, z])

    def _grasp_to_pose(self, grasp: GraspCandidate) -> PoseStamped:
        """将抓取候选转为PoseStamped（相机坐标系）"""
        pose = PoseStamped()
        pose.header.stamp = self.get_clock().now().to_msg()
        pose.header.frame_id = "camera_color_optical_frame"

        if grasp.position_3d is not None:
            pose.pose.position.x = float(grasp.position_3d[0])
            pose.pose.position.y = float(grasp.position_3d[1])
            pose.pose.position.z = float(grasp.position_3d[2])

        cy = math.cos(grasp.angle / 2)
        sy = math.sin(grasp.angle / 2)
        cp = math.cos(-math.pi / 4)
        sp = math.sin(-math.pi / 4)

        pose.pose.orientation.x = sp * cy
        pose.pose.orientation.y = sp * sy
        pose.pose.orientation.z = cp * sy
        pose.pose.orientation.w = cp * cy

        return pose

def main(args=None):
    rclpy.init(args=args)
    node = GraspEstimatorNode()
    rclpy.spin(node)
    node.destroy_node()
    rclpy.shutdown()

小结： 本篇完成了系统的视觉感知部分：通过NPU加速的YOLOv8n-seg实现目标检测与实例分割，并基于深度图几何分析生成了抓取位姿。接下来将介绍如何建立相机到机械臂的坐标转换（手眼标定），以及如何利用MoveIt 2进行运动规划、状态机管理和完整抓取流程执行，最后给出系统性能指标与关键经验。