本文将详细介绍一个基于MediaPipe的多模态人体关键点检测与行为识别系统。该系统能够实时检测人体的姿势、手部和面部关键点,并识别多种复杂的人类行为。

系统概述

这个系统整合了三个独立的MediaPipe模型,形成了一个强大的人体行为分析管道:

  1. 姿势检测:识别身体主要关节点和骨骼结构

  2. 手部检测:追踪双手的21个关键点

  3. 面部检测:检测468个面部特征点

通过综合分析这三个模型的结果,系统能够识别出丰富的人类行为模式,如眨眼、手势、肢体动作等。

核心技术架构

多模型集成

class HumanMultiLandmarker:
    def __init__(self, pose_model, hand_model, face_model):
        # 初始化三个独立的MediaPipe模型
        self.pose_detector = vision.PoseLandmarker.create_from_options(...)
        self.hand_detector = vision.HandLandmarker.create_from_options(...)
        self.face_detector = vision.FaceLandmarker.create_from_options(...)

系统采用多模型并行处理架构,每个模型专门处理特定的人体部位,最后将结果融合分析。

实时处理流程

def do(self, frame, device):
    # 转换图像格式
    rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_frame)
    
    # 并行检测三个模型
    pose_res = self.pose_detector.detect(mp_image)
    hand_res = self.hand_detector.detect(mp_image)
    face_res = self.face_detector.detect(mp_image)

系统采用高效的图像处理流程,确保实时性能。

关键点检测与可视化

骨骼绘制算法

def _draw_landmarks(self, frame, landmarks, connections=None, color=(0, 255, 0)):
    """通用关键点绘制函数"""
    h, w, _ = frame.shape
    for lm in landmarks:
        cx, cy = int(lm.x * w), int(lm.y * h)
        cv2.circle(frame, (cx, cy), self.point_size, color, -1)
    
    if connections:
        for start, end in connections:
            # 绘制连接线
            x1, y1 = int(landmarks[start].x * w), int(landmarks[start].y * h)
            x2, y2 = int(landmarks[end].x * w), int(landmarks[end].y * h)
            cv2.line(frame, (x1, y1), (x2, y2), color, self.line_thickness)

系统提供两种可视化模式:

  1. 纯骨架模式:在黑底上显示清晰的骨骼结构

  2. 叠加模式:在原图上叠加关键点和连接线

行为识别算法

眼部动作检测

def _is_eye_closed(self, eye_landmarks):
    """检测眼睛是否闭合"""
    # 计算上下眼睑关键点距离
    vertical_dist = self._calculate_distance(upper_lid, lower_lid)
    horizontal_dist = self._calculate_distance(left_corner, right_corner)
    
    # 计算眼睛纵横比
    ear = vertical_dist / horizontal_dist
    return ear < self.eye_closed_threshold

基于眼睛纵横比(EAR)算法,系统能够准确检测眨眼和眼睛闭合状态。

手势识别

def _is_hand_touching_face(self, hand_landmarks, face_landmarks):
    """检测手部是否接触面部"""
    fingertip_indices = [4, 8, 12, 16, 20]  # 指尖关键点
    face_center_indices = [1, 5, 6, 10, 152, 234]  # 面部中心区域
    
    for tip_idx in fingertip_indices:
        for face_idx in face_center_indices:
            if self._calculate_distance(hand_point, face_point) < self.face_touch_threshold:
                return True
    return False

通过计算指尖与面部关键点的距离,系统能够识别手部触摸面部的动作。

肢体弯曲检测

def _is_limb_bent(self, joint_landmarks):
    """检测肢体是否弯曲"""
    # 计算关节角度
    ba = a - b
    bc = c - b
    cosine_angle = np.dot(ba, bc) / (np.linalg.norm(ba) * np.linalg.norm(bc))
    angle = np.arccos(np.clip(cosine_angle, -1.0, 1.0))
    
    return angle < self.bend_threshold  # 角度小于阈值表示弯曲

使用向量几何方法计算关节角度,识别肢体的弯曲状态。

支持的行为识别类型

1. 眼部行为

  • 眨眼检测:基于时间上下文识别瞬时眨眼动作

  • 眼睛闭合:检测长时间闭眼状态

  • 单眼动作:区分左眼和右眼的独立动作

2. 手部行为

  • 面部触摸:检测手部接触面部的动作

  • 双手协调:识别双手同时动作的模式

  • 手势识别:基于指尖位置识别简单手势

3. 肢体行为

  • 手臂弯曲:检测肘关节弯曲状态

  • 腿部动作:识别腿部接触动作

  • 身体姿势:分析整体身体姿态

三屏显示界面

系统提供丰富的可视化界面:

# 创建三种显示模式
display_frame = frame.copy()          # 原图+行为标注
skeleton_only = np.zeros_like(frame)  # 纯骨架显示
skeleton_overlay = frame.copy()       # 原图叠加骨架

# 拼接三屏显示
triple_frame = np.concatenate([display_frame, skeleton_only, skeleton_overlay], axis=1)
  1. 左侧:原始视频帧+实时行为标注

  2. 中间:纯骨架显示模式

  3. 右侧:原图叠加骨架显示

性能优化策略

计算效率优化

  • 模型并行化:三个模型独立运行,最大化利用计算资源

  • 关键点筛选:面部关键点抽样显示,避免过度绘制

  • 距离计算优化:使用平方距离比较,避免不必要的开方运算

内存管理

  • 图像复用:避免不必要的图像复制

  • 结果缓存:合理管理检测结果的生命周期

  • 资源懒加载:按需初始化模型资源

应用场景

1. 人机交互

  • 手势控制界面

  • 面部表情识别

  • 身体姿势控制

2. 健康监测

  • 疲劳驾驶检测(眨眼频率)

  • 坐姿纠正提醒

  • 康复训练指导

3. 安全监控

  • 异常行为检测

  • 跌倒检测预警

  • 禁区入侵检测

4. 体育分析

  • 运动姿势分析

  • 训练动作纠正

  • 运动表现评估

技术挑战与解决方案

挑战1:多模型协调

问题:三个模型独立运行,需要协调结果输出
解决方案:采用统一的时间戳和坐标系,确保结果同步

挑战2:实时性能

问题:多模型同时运行计算量大
解决方案:优化图像预处理和后处理流程,减少不必要的计算

挑战3:行为识别准确性

问题:环境光照、遮挡等因素影响识别精度
解决方案:采用多特征融合和阈值自适应调整

扩展方向

1. 深度学习增强

  • 使用RNN/LSTM建模时序行为模式

  • 引入注意力机制提高关键区域检测精度

  • 采用Transformer架构处理长序列行为

2. 多模态融合

  • 结合音频信息增强行为理解

  • 集成环境传感器数据

  • 添加深度相机支持3D姿态估计

3. 应用扩展

  • VR/AR交互控制

  • 智能家居手势控制

  • 远程医疗监护

使用示例

import cv2
import numpy as np
import mediapipe as mp
from mediapipe import solutions
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import time


class HumanMultiLandmarker:
    def __init__(self,
                 pose_model="文件地址/pose_landmarker_heavy.task",
                 hand_model="文件地址/hand_landmarker.task",
                 face_model="文件地址/face_landmarker.task",
                 point_size=5,
                 line_thickness=2):
        """Load pose, hand, and face models"""
        try:
            base_pose = python.BaseOptions(model_asset_path=pose_model)
            base_hand = python.BaseOptions(model_asset_path=hand_model)
            base_face = python.BaseOptions(model_asset_path=face_model)

            self.pose_detector = vision.PoseLandmarker.create_from_options(
                vision.PoseLandmarkerOptions(
                    base_options=base_pose,
                    num_poses=1,
                    running_mode=vision.RunningMode.IMAGE
                )
            )

            self.hand_detector = vision.HandLandmarker.create_from_options(
                vision.HandLandmarkerOptions(
                    base_options=base_hand,
                    num_hands=2,
                    running_mode=vision.RunningMode.IMAGE
                )
            )

            self.face_detector = vision.FaceLandmarker.create_from_options(
                vision.FaceLandmarkerOptions(
                    base_options=base_face,
                    num_faces=1,
                    running_mode=vision.RunningMode.IMAGE
                )
            )

            print("All models loaded successfully!")
        except Exception as e:
            print(f"Model loading failed: {e}")
            # Create dummy detectors to avoid subsequent errors
            self.pose_detector = None
            self.hand_detector = None
            self.face_detector = None

        # Drawing parameters
        self.point_size = point_size
        self.line_thickness = line_thickness
        self.pose_connections = solutions.pose.POSE_CONNECTIONS
        self.hand_connections = solutions.hands.HAND_CONNECTIONS

        # Action detection related variables
        self.eye_closed_threshold = 0.2  # Adjusted eye closure threshold
        self.blink_threshold = 0.3  # Blink threshold
        self.blink_cooldown = 0.5  # Blink detection cooldown time (seconds)
        self.last_blink_time = 0  # Last blink time
        self.face_touch_threshold = 0.1  # Hand-face contact threshold
        self.leg_touch_threshold = 0.15  # Hand-leg contact threshold
        self.bend_threshold = 1.5  # Limb bending threshold (radians)

    def _draw_landmarks(self, frame, landmarks, connections=None, color=(0, 255, 0)):
        """General drawing function"""
        h, w, _ = frame.shape
        for lm in landmarks:
            cx, cy = int(lm.x * w), int(lm.y * h)
            cv2.circle(frame, (cx, cy), self.point_size, color, -1)

        if connections:
            for start, end in connections:
                if start < len(landmarks) and end < len(landmarks):
                    x1, y1 = int(landmarks[start].x * w), int(landmarks[start].y * h)
                    x2, y2 = int(landmarks[end].x * w), int(landmarks[end].y * h)
                    cv2.line(frame, (x1, y1), (x2, y2), color, self.line_thickness)

    def _calculate_distance(self, point1, point2):
        """Calculate Euclidean distance between two points"""
        return ((point1.x - point2.x) ** 2 + (point1.y - point2.y) ** 2) ** 0.5

    def _is_eye_closed(self, eye_landmarks):
        """Detect if eye is closed - using a simpler method"""
        if len(eye_landmarks) < 6:
            return False

        # Calculate distance between upper and lower eyelid keypoints
        upper_lid = eye_landmarks[1]  # Upper eyelid
        lower_lid = eye_landmarks[4]  # Lower eyelid

        # Calculate vertical distance
        vertical_dist = self._calculate_distance(upper_lid, lower_lid)

        # Calculate eye width
        left_corner = eye_landmarks[0]  # Left eye corner
        right_corner = eye_landmarks[3]  # Right eye corner
        horizontal_dist = self._calculate_distance(left_corner, right_corner)

        # Calculate eye aspect ratio
        ear = vertical_dist / horizontal_dist
        return ear < self.eye_closed_threshold

    def _is_hand_touching_face(self, hand_landmarks, face_landmarks):
        """Detect if hand is touching face"""
        if not hand_landmarks or not face_landmarks or len(face_landmarks) < 10:
            return False

        # Only check distance between fingertips and face center area
        fingertip_indices = [4, 8, 12, 16, 20]  # Fingertip keypoint indices
        face_center_indices = [1, 5, 6, 10, 152, 234]  # Face center area keypoints

        for tip_idx in fingertip_indices:
            if tip_idx >= len(hand_landmarks):
                continue
            hand_point = hand_landmarks[tip_idx]

            for face_idx in face_center_indices:
                if face_idx >= len(face_landmarks):
                    continue
                face_point = face_landmarks[face_idx]

                if self._calculate_distance(hand_point, face_point) < self.face_touch_threshold:
                    return True
        return False

    def _is_hand_touching_leg(self, hand_landmarks, pose_landmarks):
        """Detect if hand is touching leg"""
        if not hand_landmarks or not pose_landmarks or len(pose_landmarks) < 27:
            return False

        # Leg keypoint indices (MediaPipe Pose model)
        left_hip = pose_landmarks[23]  # Left hip
        right_hip = pose_landmarks[24]  # Right hip
        left_knee = pose_landmarks[25]  # Left knee
        right_knee = pose_landmarks[26]  # Right knee

        leg_points = [left_hip, right_hip, left_knee, right_knee]

        # Check distance between hand fingertip keypoints and leg keypoints
        fingertip_indices = [4, 8, 12, 16, 20]  # Fingertip keypoint indices

        for tip_idx in fingertip_indices:
            if tip_idx >= len(hand_landmarks):
                continue
            hand_point = hand_landmarks[tip_idx]

            for leg_point in leg_points:
                if self._calculate_distance(hand_point, leg_point) < self.leg_touch_threshold:
                    return True
        return False

    def _is_limb_bent(self, joint_landmarks):
        """Detect if limb is bent"""
        if len(joint_landmarks) < 3:
            return False

        # Calculate joint angle
        a = np.array([joint_landmarks[0].x, joint_landmarks[0].y])
        b = np.array([joint_landmarks[1].x, joint_landmarks[1].y])
        c = np.array([joint_landmarks[2].x, joint_landmarks[2].y])

        ba = a - b
        bc = c - b

        cosine_angle = np.dot(ba, bc) / (np.linalg.norm(ba) * np.linalg.norm(bc))
        angle = np.arccos(np.clip(cosine_angle, -1.0, 1.0))

        # If angle is less than threshold, consider limb bent
        return angle < self.bend_threshold

    def detect_actions(self, frame, pose_res, hand_res, face_res):
        """Detect various actions"""
        actions = []

        # Check if detection results are valid
        if not face_res or not hasattr(face_res, 'face_landmarks') or not face_res.face_landmarks:
            return ["No face detected"]

        # Eye action detection
        if face_res.face_landmarks:
            face_landmarks = face_res.face_landmarks[0]

            # Use simpler method to detect eyes
            # Left eye keypoints (simplified version)
            left_eye_indices = [33, 160, 158, 133, 153, 144]
            right_eye_indices = [362, 385, 387, 263, 373, 380]

            # Ensure indices don't go out of bounds
            left_eye = [face_landmarks[i] for i in left_eye_indices if i < len(face_landmarks)]
            right_eye = [face_landmarks[i] for i in right_eye_indices if i < len(face_landmarks)]

            if len(left_eye) >= 6 and len(right_eye) >= 6:
                left_eye_closed = self._is_eye_closed(left_eye)
                right_eye_closed = self._is_eye_closed(right_eye)

                if left_eye_closed and right_eye_closed:
                    actions.append("Both eyes closed")
                elif left_eye_closed:
                    actions.append("Left eye closed")
                elif right_eye_closed:
                    actions.append("Right eye closed")

                # Blink detection (requires time context)
                current_time = time.time()
                if (left_eye_closed or right_eye_closed) and current_time - self.last_blink_time > self.blink_cooldown:
                    actions.append("Blinking")
                    self.last_blink_time = current_time

        # Hand action detection
        left_hand_touching_face = False
        right_hand_touching_face = False

        if hand_res and hand_res.hand_landmarks and face_res.face_landmarks:
            for i, hand_landmarks in enumerate(hand_res.hand_landmarks):
                if i < len(hand_res.handedness) and len(hand_res.handedness[i]) > 0:
                    handedness = hand_res.handedness[i][0].category_name
                    is_touching_face = self._is_hand_touching_face(hand_landmarks, face_res.face_landmarks[0])

                    if handedness == "Left" and is_touching_face:
                        left_hand_touching_face = True
                    elif handedness == "Right" and is_touching_face:
                        right_hand_touching_face = True

        if left_hand_touching_face and right_hand_touching_face:
            actions.append("Both hands touching face")
        elif left_hand_touching_face:
            actions.append("Left hand touching face")
        elif right_hand_touching_face:
            actions.append("Right hand touching face")

        # Limb bending detection
        if pose_res and pose_res.pose_landmarks:
            pose_landmarks = pose_res.pose_landmarks[0]

            # Ensure keypoints exist
            if len(pose_landmarks) >= 29:
                # Left arm bending detection
                left_arm_bent = self._is_limb_bent([pose_landmarks[11], pose_landmarks[13], pose_landmarks[15]])
                # Right arm bending detection
                right_arm_bent = self._is_limb_bent([pose_landmarks[12], pose_landmarks[14], pose_landmarks[16]])

                if left_arm_bent:
                    actions.append("Left arm bent")
                if right_arm_bent:
                    actions.append("Right arm bent")

        # If no actions detected, show prompt
        if not actions:
            actions.append("No action detected")

        return actions

    def do(self, frame, device):
        """Three-model detection + action detection + three-screen display"""
        if frame is None:
            return None

        # Create copy for drawing
        display_frame = frame.copy()

        # Convert image format
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_frame)

        # Initialize detection results
        pose_res, hand_res, face_res = None, None, None

        # Detect pose
        if self.pose_detector:
            try:
                pose_res = self.pose_detector.detect(mp_image)
            except:
                pass

        # Detect hands
        if self.hand_detector:
            try:
                hand_res = self.hand_detector.detect(mp_image)
            except:
                pass

        # Detect face
        if self.face_detector:
            try:
                face_res = self.face_detector.detect(mp_image)
            except:
                pass

        # Create skeleton images
        skeleton_only = np.zeros_like(frame)
        skeleton_overlay = frame.copy()

        # Draw pose
        if pose_res and pose_res.pose_landmarks:
            for pose_landmarks in pose_res.pose_landmarks:
                self._draw_landmarks(skeleton_only, pose_landmarks,
                                     self.pose_connections, (255, 255, 255))
                self._draw_landmarks(skeleton_overlay, pose_landmarks,
                                     self.pose_connections, (255, 255, 255))

        # Draw hands
        if hand_res and hand_res.hand_landmarks:
            for hand_landmarks in hand_res.hand_landmarks:
                self._draw_landmarks(skeleton_only, hand_landmarks,
                                     self.hand_connections, (0, 255, 255))
                self._draw_landmarks(skeleton_overlay, hand_landmarks,
                                     self.hand_connections, (0, 255, 255))

        # Draw face (only points, avoid being too dense)
        if face_res and face_res.face_landmarks:
            for face_landmarks in face_res.face_landmarks:
                # Only draw some keypoints to avoid being too dense
                for i in range(0, len(face_landmarks), 10):
                    if i < len(face_landmarks):
                        self._draw_landmarks(skeleton_only, [face_landmarks[i]], None, (0, 0, 255))
                        self._draw_landmarks(skeleton_overlay, [face_landmarks[i]], None, (0, 0, 255))

        # Action detection
        actions = self.detect_actions(frame, pose_res, hand_res, face_res)

        # Display detected actions on the left original video frame
        y_offset = 40
        for action in actions:
            cv2.putText(display_frame, action, (10, y_offset),
                        cv2.FONT_HERSHEY_SIMPLEX, 2.0, (0, 0, 255), 5)
            y_offset += 40

        # Concatenate three screens
        triple_frame = np.concatenate([display_frame, skeleton_only, skeleton_overlay], axis=1)
        return triple_frame

结论

本文介绍的多模态人体关键点检测与行为识别系统展现了计算机视觉在人机交互和行为分析领域的强大能力。通过整合MediaPipe的三个核心模型,系统能够提供丰富的人体行为理解功能。

该系统具有以下优势:

  1. 全面性:覆盖身体、手部、面部的全方位检测

  2. 实时性:优化算法确保实时处理性能

  3. 准确性:多特征融合提高行为识别精度

  4. 可扩展性:模块化设计便于功能扩展

随着计算机视觉技术的不断发展,这类多模态行为分析系统将在更多领域发挥重要作用,为人机交互、健康监测、安全防护等应用提供技术支持。

对 PiscTrace or PiscCode感兴趣?更多精彩内容请移步官网看看~🔗 PiscTrace

Logo

电影级数字人,免显卡端渲染SDK,十行代码即可调用,工业级demo免费开源下载!

更多推荐