While traditional algorithms used to be written for specific tasks and machines on a case-by-case basis, the possibility of teaching robots how to perform actions using a generic algorithm or technology is an aspiration with the potential of revolutionizing the field. Willing to advance towards that objective, we have applied the Synthetic Cognition algorithm to OpenAI Gym’s environment for robotic simulations, which uses the MuJoCo physics engine.
The most complex of the available tasks was chosen: Fetch, Pick And Place.
As its name suggests, the task requires to complete a 3-step process, where a robotic arm needs to move its gripper to an object (a black cube), pick it and then carry it to a goal location (marked with a red sphere). In every episode the target location is initialized in a different 3D position, and the episode is considered to be successfully completed if the object is brought to the target within the time limit.
In every time frame, an agent in this environment can observe the 3D position of the gripper, the object and the target location, and use actuators to perform two actions: moving the arm’s gripper towards a specific direction (defined using a 3D vector, whose magnitude determines the movement speed), and changing the speed of opening or closing the two gripper end-effectors, critical to hold and carry the object. Notice that the environment does not provide any image as part of the agent’s observation, although we can assume that in an industrial setting the mentioned 3D positions would be obtained from images, e.g. using techniques such as LIDAR and point clouds.
The task was solved following a supervised learning approach, more specifically as a multi-value regression task, where the target variables are the 3D movement vector and the opening/closing speed of the end-effectors (simplified as a single real-valued number). In order to achieve generalization capabilities, the agent was set to use relative positions, namely positions of the object and the goal relative to the arm gripper’s location. For the sake of efficiency, the agent is made to only perceive one of those two relative positions at a time: the position of the object if it is not being carried, and otherwise the position of the goal. The intuition behind this idea is that the agent only needs to focus in the next relevant position (not both), depending on its state. Such contextual perception can be seen as a proprio-cognition or internal sensor, that based on the magnitude of the vector representing the relative position of the object can tell whether it is carried or not (i.e. it is carried when there is almost no distance to the object), analogously to a categorical or boolean value.
Both the relative positions and the movement vector are 3D numeric values that can be translated into the essential activated dimensions format of Synthetic Cognition using a voxel encoder. The voxel encoder discretizes a 3D space (bounded by minimum x, y and z values for positions or vectors), performing a series of divisions in each axis, so that any real-valued 3D point can be mapped to one of the resulting cuboids or voxels. The addition of an overlap feature allows the model to infer positions (or vectors) located in voxels close in space (since they are mapped to common essential activated dimensions).
Multiple alternatives existed for the training phase, including the recording of experiences of a human controlling the robot simulator. In the end, to facilitate prototyping, an automated guiding algorithm was developed to solve the task, which first moves the gripper to be on top of the object (with a small vertical margin), later descends while closing the end-effectors (which were originally open) to grab the object, and then moves the gripper (that holds the object) to the target goal. The expert algorithm was executed in 200 episodes, which were sufficient to register, as a training dataset, roughly 19000 time-steps of perceptions and their associated micro-movements performed by the arm.
Synthetic Cognition was capable of learning how to perform suitable actions from perceptions thanks to the guided training period, being capable of generalizing to unseen configurations. In the testing phase, Synthetic Cognition solves the task on its own (within the time limit) with a success rate of about 75% with a background of just 200 episodes, a figure expected to grow beyond 90% when longer training periods are provided. Even in the episodes where the object falls down accidentally, the agent is capable of going back to pick it, even if such behaviour was never explicitly coded.
Notice that, as of the time this article was written, Synthetic Cognition was using only low-cognition capabilities, and this task has a clear episodic component, where multiple sequential phases are required to complete the objective. In this case, the challenge was solved using a context perception mimicking a proprio-cognition sensor, but once the high-cognition features of the model are available, we expect to solve this type of episodic problems in an even more effective and efficient manner.