PiscCode基于MediaPipe的实时多模态人体关键点检测与行为识别分析
基于MediaPipe的多模态人体行为识别系统,该系统整合了姿势、手部和面部三个关键点检测模型,能够实时识别眨眼、手势、肢体动作等复杂行为。系统采用并行处理架构,通过优化算法确保计算效率,并支持三屏可视化界面显示。关键技术包括基于眼睛纵横比的眨眼检测、关节角度计算的肢体弯曲识别,以及手部-面部距离测量的触摸动作检测。该系统可广泛应用于人机交互、健康监测、安全监控等领域,具有全面性、实时性和可扩展性
本文将详细介绍一个基于MediaPipe的多模态人体关键点检测与行为识别系统。该系统能够实时检测人体的姿势、手部和面部关键点,并识别多种复杂的人类行为。
系统概述
这个系统整合了三个独立的MediaPipe模型,形成了一个强大的人体行为分析管道:
-
姿势检测:识别身体主要关节点和骨骼结构
-
手部检测:追踪双手的21个关键点
-
面部检测:检测468个面部特征点
通过综合分析这三个模型的结果,系统能够识别出丰富的人类行为模式,如眨眼、手势、肢体动作等。
核心技术架构
多模型集成
class HumanMultiLandmarker:
def __init__(self, pose_model, hand_model, face_model):
# 初始化三个独立的MediaPipe模型
self.pose_detector = vision.PoseLandmarker.create_from_options(...)
self.hand_detector = vision.HandLandmarker.create_from_options(...)
self.face_detector = vision.FaceLandmarker.create_from_options(...)
系统采用多模型并行处理架构,每个模型专门处理特定的人体部位,最后将结果融合分析。
实时处理流程
def do(self, frame, device):
# 转换图像格式
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_frame)
# 并行检测三个模型
pose_res = self.pose_detector.detect(mp_image)
hand_res = self.hand_detector.detect(mp_image)
face_res = self.face_detector.detect(mp_image)
系统采用高效的图像处理流程,确保实时性能。
关键点检测与可视化
骨骼绘制算法
def _draw_landmarks(self, frame, landmarks, connections=None, color=(0, 255, 0)):
"""通用关键点绘制函数"""
h, w, _ = frame.shape
for lm in landmarks:
cx, cy = int(lm.x * w), int(lm.y * h)
cv2.circle(frame, (cx, cy), self.point_size, color, -1)
if connections:
for start, end in connections:
# 绘制连接线
x1, y1 = int(landmarks[start].x * w), int(landmarks[start].y * h)
x2, y2 = int(landmarks[end].x * w), int(landmarks[end].y * h)
cv2.line(frame, (x1, y1), (x2, y2), color, self.line_thickness)
系统提供两种可视化模式:
-
纯骨架模式:在黑底上显示清晰的骨骼结构
-
叠加模式:在原图上叠加关键点和连接线
行为识别算法
眼部动作检测
def _is_eye_closed(self, eye_landmarks):
"""检测眼睛是否闭合"""
# 计算上下眼睑关键点距离
vertical_dist = self._calculate_distance(upper_lid, lower_lid)
horizontal_dist = self._calculate_distance(left_corner, right_corner)
# 计算眼睛纵横比
ear = vertical_dist / horizontal_dist
return ear < self.eye_closed_threshold
基于眼睛纵横比(EAR)算法,系统能够准确检测眨眼和眼睛闭合状态。
手势识别
def _is_hand_touching_face(self, hand_landmarks, face_landmarks):
"""检测手部是否接触面部"""
fingertip_indices = [4, 8, 12, 16, 20] # 指尖关键点
face_center_indices = [1, 5, 6, 10, 152, 234] # 面部中心区域
for tip_idx in fingertip_indices:
for face_idx in face_center_indices:
if self._calculate_distance(hand_point, face_point) < self.face_touch_threshold:
return True
return False
通过计算指尖与面部关键点的距离,系统能够识别手部触摸面部的动作。
肢体弯曲检测
def _is_limb_bent(self, joint_landmarks):
"""检测肢体是否弯曲"""
# 计算关节角度
ba = a - b
bc = c - b
cosine_angle = np.dot(ba, bc) / (np.linalg.norm(ba) * np.linalg.norm(bc))
angle = np.arccos(np.clip(cosine_angle, -1.0, 1.0))
return angle < self.bend_threshold # 角度小于阈值表示弯曲
使用向量几何方法计算关节角度,识别肢体的弯曲状态。

支持的行为识别类型
1. 眼部行为
-
眨眼检测:基于时间上下文识别瞬时眨眼动作
-
眼睛闭合:检测长时间闭眼状态
-
单眼动作:区分左眼和右眼的独立动作
2. 手部行为
-
面部触摸:检测手部接触面部的动作
-
双手协调:识别双手同时动作的模式
-
手势识别:基于指尖位置识别简单手势
3. 肢体行为
-
手臂弯曲:检测肘关节弯曲状态
-
腿部动作:识别腿部接触动作
-
身体姿势:分析整体身体姿态
三屏显示界面
系统提供丰富的可视化界面:
# 创建三种显示模式
display_frame = frame.copy() # 原图+行为标注
skeleton_only = np.zeros_like(frame) # 纯骨架显示
skeleton_overlay = frame.copy() # 原图叠加骨架
# 拼接三屏显示
triple_frame = np.concatenate([display_frame, skeleton_only, skeleton_overlay], axis=1)
-
左侧:原始视频帧+实时行为标注
-
中间:纯骨架显示模式
-
右侧:原图叠加骨架显示
性能优化策略
计算效率优化
-
模型并行化:三个模型独立运行,最大化利用计算资源
-
关键点筛选:面部关键点抽样显示,避免过度绘制
-
距离计算优化:使用平方距离比较,避免不必要的开方运算
内存管理
-
图像复用:避免不必要的图像复制
-
结果缓存:合理管理检测结果的生命周期
-
资源懒加载:按需初始化模型资源
应用场景
1. 人机交互
-
手势控制界面
-
面部表情识别
-
身体姿势控制
2. 健康监测
-
疲劳驾驶检测(眨眼频率)
-
坐姿纠正提醒
-
康复训练指导
3. 安全监控
-
异常行为检测
-
跌倒检测预警
-
禁区入侵检测
4. 体育分析
-
运动姿势分析
-
训练动作纠正
-
运动表现评估
技术挑战与解决方案
挑战1:多模型协调
问题:三个模型独立运行,需要协调结果输出
解决方案:采用统一的时间戳和坐标系,确保结果同步
挑战2:实时性能
问题:多模型同时运行计算量大
解决方案:优化图像预处理和后处理流程,减少不必要的计算
挑战3:行为识别准确性
问题:环境光照、遮挡等因素影响识别精度
解决方案:采用多特征融合和阈值自适应调整
扩展方向
1. 深度学习增强
-
使用RNN/LSTM建模时序行为模式
-
引入注意力机制提高关键区域检测精度
-
采用Transformer架构处理长序列行为
2. 多模态融合
-
结合音频信息增强行为理解
-
集成环境传感器数据
-
添加深度相机支持3D姿态估计
3. 应用扩展
-
VR/AR交互控制
-
智能家居手势控制
-
远程医疗监护
使用示例
import cv2
import numpy as np
import mediapipe as mp
from mediapipe import solutions
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
import time
class HumanMultiLandmarker:
def __init__(self,
pose_model="文件地址/pose_landmarker_heavy.task",
hand_model="文件地址/hand_landmarker.task",
face_model="文件地址/face_landmarker.task",
point_size=5,
line_thickness=2):
"""Load pose, hand, and face models"""
try:
base_pose = python.BaseOptions(model_asset_path=pose_model)
base_hand = python.BaseOptions(model_asset_path=hand_model)
base_face = python.BaseOptions(model_asset_path=face_model)
self.pose_detector = vision.PoseLandmarker.create_from_options(
vision.PoseLandmarkerOptions(
base_options=base_pose,
num_poses=1,
running_mode=vision.RunningMode.IMAGE
)
)
self.hand_detector = vision.HandLandmarker.create_from_options(
vision.HandLandmarkerOptions(
base_options=base_hand,
num_hands=2,
running_mode=vision.RunningMode.IMAGE
)
)
self.face_detector = vision.FaceLandmarker.create_from_options(
vision.FaceLandmarkerOptions(
base_options=base_face,
num_faces=1,
running_mode=vision.RunningMode.IMAGE
)
)
print("All models loaded successfully!")
except Exception as e:
print(f"Model loading failed: {e}")
# Create dummy detectors to avoid subsequent errors
self.pose_detector = None
self.hand_detector = None
self.face_detector = None
# Drawing parameters
self.point_size = point_size
self.line_thickness = line_thickness
self.pose_connections = solutions.pose.POSE_CONNECTIONS
self.hand_connections = solutions.hands.HAND_CONNECTIONS
# Action detection related variables
self.eye_closed_threshold = 0.2 # Adjusted eye closure threshold
self.blink_threshold = 0.3 # Blink threshold
self.blink_cooldown = 0.5 # Blink detection cooldown time (seconds)
self.last_blink_time = 0 # Last blink time
self.face_touch_threshold = 0.1 # Hand-face contact threshold
self.leg_touch_threshold = 0.15 # Hand-leg contact threshold
self.bend_threshold = 1.5 # Limb bending threshold (radians)
def _draw_landmarks(self, frame, landmarks, connections=None, color=(0, 255, 0)):
"""General drawing function"""
h, w, _ = frame.shape
for lm in landmarks:
cx, cy = int(lm.x * w), int(lm.y * h)
cv2.circle(frame, (cx, cy), self.point_size, color, -1)
if connections:
for start, end in connections:
if start < len(landmarks) and end < len(landmarks):
x1, y1 = int(landmarks[start].x * w), int(landmarks[start].y * h)
x2, y2 = int(landmarks[end].x * w), int(landmarks[end].y * h)
cv2.line(frame, (x1, y1), (x2, y2), color, self.line_thickness)
def _calculate_distance(self, point1, point2):
"""Calculate Euclidean distance between two points"""
return ((point1.x - point2.x) ** 2 + (point1.y - point2.y) ** 2) ** 0.5
def _is_eye_closed(self, eye_landmarks):
"""Detect if eye is closed - using a simpler method"""
if len(eye_landmarks) < 6:
return False
# Calculate distance between upper and lower eyelid keypoints
upper_lid = eye_landmarks[1] # Upper eyelid
lower_lid = eye_landmarks[4] # Lower eyelid
# Calculate vertical distance
vertical_dist = self._calculate_distance(upper_lid, lower_lid)
# Calculate eye width
left_corner = eye_landmarks[0] # Left eye corner
right_corner = eye_landmarks[3] # Right eye corner
horizontal_dist = self._calculate_distance(left_corner, right_corner)
# Calculate eye aspect ratio
ear = vertical_dist / horizontal_dist
return ear < self.eye_closed_threshold
def _is_hand_touching_face(self, hand_landmarks, face_landmarks):
"""Detect if hand is touching face"""
if not hand_landmarks or not face_landmarks or len(face_landmarks) < 10:
return False
# Only check distance between fingertips and face center area
fingertip_indices = [4, 8, 12, 16, 20] # Fingertip keypoint indices
face_center_indices = [1, 5, 6, 10, 152, 234] # Face center area keypoints
for tip_idx in fingertip_indices:
if tip_idx >= len(hand_landmarks):
continue
hand_point = hand_landmarks[tip_idx]
for face_idx in face_center_indices:
if face_idx >= len(face_landmarks):
continue
face_point = face_landmarks[face_idx]
if self._calculate_distance(hand_point, face_point) < self.face_touch_threshold:
return True
return False
def _is_hand_touching_leg(self, hand_landmarks, pose_landmarks):
"""Detect if hand is touching leg"""
if not hand_landmarks or not pose_landmarks or len(pose_landmarks) < 27:
return False
# Leg keypoint indices (MediaPipe Pose model)
left_hip = pose_landmarks[23] # Left hip
right_hip = pose_landmarks[24] # Right hip
left_knee = pose_landmarks[25] # Left knee
right_knee = pose_landmarks[26] # Right knee
leg_points = [left_hip, right_hip, left_knee, right_knee]
# Check distance between hand fingertip keypoints and leg keypoints
fingertip_indices = [4, 8, 12, 16, 20] # Fingertip keypoint indices
for tip_idx in fingertip_indices:
if tip_idx >= len(hand_landmarks):
continue
hand_point = hand_landmarks[tip_idx]
for leg_point in leg_points:
if self._calculate_distance(hand_point, leg_point) < self.leg_touch_threshold:
return True
return False
def _is_limb_bent(self, joint_landmarks):
"""Detect if limb is bent"""
if len(joint_landmarks) < 3:
return False
# Calculate joint angle
a = np.array([joint_landmarks[0].x, joint_landmarks[0].y])
b = np.array([joint_landmarks[1].x, joint_landmarks[1].y])
c = np.array([joint_landmarks[2].x, joint_landmarks[2].y])
ba = a - b
bc = c - b
cosine_angle = np.dot(ba, bc) / (np.linalg.norm(ba) * np.linalg.norm(bc))
angle = np.arccos(np.clip(cosine_angle, -1.0, 1.0))
# If angle is less than threshold, consider limb bent
return angle < self.bend_threshold
def detect_actions(self, frame, pose_res, hand_res, face_res):
"""Detect various actions"""
actions = []
# Check if detection results are valid
if not face_res or not hasattr(face_res, 'face_landmarks') or not face_res.face_landmarks:
return ["No face detected"]
# Eye action detection
if face_res.face_landmarks:
face_landmarks = face_res.face_landmarks[0]
# Use simpler method to detect eyes
# Left eye keypoints (simplified version)
left_eye_indices = [33, 160, 158, 133, 153, 144]
right_eye_indices = [362, 385, 387, 263, 373, 380]
# Ensure indices don't go out of bounds
left_eye = [face_landmarks[i] for i in left_eye_indices if i < len(face_landmarks)]
right_eye = [face_landmarks[i] for i in right_eye_indices if i < len(face_landmarks)]
if len(left_eye) >= 6 and len(right_eye) >= 6:
left_eye_closed = self._is_eye_closed(left_eye)
right_eye_closed = self._is_eye_closed(right_eye)
if left_eye_closed and right_eye_closed:
actions.append("Both eyes closed")
elif left_eye_closed:
actions.append("Left eye closed")
elif right_eye_closed:
actions.append("Right eye closed")
# Blink detection (requires time context)
current_time = time.time()
if (left_eye_closed or right_eye_closed) and current_time - self.last_blink_time > self.blink_cooldown:
actions.append("Blinking")
self.last_blink_time = current_time
# Hand action detection
left_hand_touching_face = False
right_hand_touching_face = False
if hand_res and hand_res.hand_landmarks and face_res.face_landmarks:
for i, hand_landmarks in enumerate(hand_res.hand_landmarks):
if i < len(hand_res.handedness) and len(hand_res.handedness[i]) > 0:
handedness = hand_res.handedness[i][0].category_name
is_touching_face = self._is_hand_touching_face(hand_landmarks, face_res.face_landmarks[0])
if handedness == "Left" and is_touching_face:
left_hand_touching_face = True
elif handedness == "Right" and is_touching_face:
right_hand_touching_face = True
if left_hand_touching_face and right_hand_touching_face:
actions.append("Both hands touching face")
elif left_hand_touching_face:
actions.append("Left hand touching face")
elif right_hand_touching_face:
actions.append("Right hand touching face")
# Limb bending detection
if pose_res and pose_res.pose_landmarks:
pose_landmarks = pose_res.pose_landmarks[0]
# Ensure keypoints exist
if len(pose_landmarks) >= 29:
# Left arm bending detection
left_arm_bent = self._is_limb_bent([pose_landmarks[11], pose_landmarks[13], pose_landmarks[15]])
# Right arm bending detection
right_arm_bent = self._is_limb_bent([pose_landmarks[12], pose_landmarks[14], pose_landmarks[16]])
if left_arm_bent:
actions.append("Left arm bent")
if right_arm_bent:
actions.append("Right arm bent")
# If no actions detected, show prompt
if not actions:
actions.append("No action detected")
return actions
def do(self, frame, device):
"""Three-model detection + action detection + three-screen display"""
if frame is None:
return None
# Create copy for drawing
display_frame = frame.copy()
# Convert image format
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=rgb_frame)
# Initialize detection results
pose_res, hand_res, face_res = None, None, None
# Detect pose
if self.pose_detector:
try:
pose_res = self.pose_detector.detect(mp_image)
except:
pass
# Detect hands
if self.hand_detector:
try:
hand_res = self.hand_detector.detect(mp_image)
except:
pass
# Detect face
if self.face_detector:
try:
face_res = self.face_detector.detect(mp_image)
except:
pass
# Create skeleton images
skeleton_only = np.zeros_like(frame)
skeleton_overlay = frame.copy()
# Draw pose
if pose_res and pose_res.pose_landmarks:
for pose_landmarks in pose_res.pose_landmarks:
self._draw_landmarks(skeleton_only, pose_landmarks,
self.pose_connections, (255, 255, 255))
self._draw_landmarks(skeleton_overlay, pose_landmarks,
self.pose_connections, (255, 255, 255))
# Draw hands
if hand_res and hand_res.hand_landmarks:
for hand_landmarks in hand_res.hand_landmarks:
self._draw_landmarks(skeleton_only, hand_landmarks,
self.hand_connections, (0, 255, 255))
self._draw_landmarks(skeleton_overlay, hand_landmarks,
self.hand_connections, (0, 255, 255))
# Draw face (only points, avoid being too dense)
if face_res and face_res.face_landmarks:
for face_landmarks in face_res.face_landmarks:
# Only draw some keypoints to avoid being too dense
for i in range(0, len(face_landmarks), 10):
if i < len(face_landmarks):
self._draw_landmarks(skeleton_only, [face_landmarks[i]], None, (0, 0, 255))
self._draw_landmarks(skeleton_overlay, [face_landmarks[i]], None, (0, 0, 255))
# Action detection
actions = self.detect_actions(frame, pose_res, hand_res, face_res)
# Display detected actions on the left original video frame
y_offset = 40
for action in actions:
cv2.putText(display_frame, action, (10, y_offset),
cv2.FONT_HERSHEY_SIMPLEX, 2.0, (0, 0, 255), 5)
y_offset += 40
# Concatenate three screens
triple_frame = np.concatenate([display_frame, skeleton_only, skeleton_overlay], axis=1)
return triple_frame
结论
本文介绍的多模态人体关键点检测与行为识别系统展现了计算机视觉在人机交互和行为分析领域的强大能力。通过整合MediaPipe的三个核心模型,系统能够提供丰富的人体行为理解功能。
该系统具有以下优势:
-
全面性:覆盖身体、手部、面部的全方位检测
-
实时性:优化算法确保实时处理性能
-
准确性:多特征融合提高行为识别精度
-
可扩展性:模块化设计便于功能扩展
随着计算机视觉技术的不断发展,这类多模态行为分析系统将在更多领域发挥重要作用,为人机交互、健康监测、安全防护等应用提供技术支持。
对 PiscTrace or PiscCode感兴趣?更多精彩内容请移步官网看看~🔗 PiscTrace
更多推荐





所有评论(0)