Home >Technology peripherals >AI >Southern Science and Technology's Black Technology: Eliminate video characters with one click, the special effects artist's savior is here!
This video segmentation model from Southern University of Science and Technology can track anything in the video.
Not only can it "watch", but it can also "cut". It is also easy for it to remove individuals from the video.
In terms of operation, the only thing you need to do is a few clicks of the mouse.
The special effects artist seemed to have found a savior after seeing the news, saying bluntly that this product will change the rules of the game in the CGI industry.
This model is called TAM (Track Anything Model). Is it similar to the name of Meta’s image segmentation model SAM?
Indeed, TAM extends SAM to the video field and lights up the skill tree of dynamic object tracking.
#Video segmentation models are actually not a new technology, but traditional segmentation models do not alleviate human work.
The training data used by these models all require manual annotation, and even need to be initialized with the mask parameters of specific objects before use.
The emergence of SAM provides a prerequisite for solving this problem - at least the initialization data no longer needs to be obtained manually.
Of course, TAM does not use SAM frame by frame and then superimpose it. It also needs to build the corresponding spatiotemporal relationship.
The team integrated SAM with a memory module called XMem.
You only need to use SAM to generate initial parameters in the first frame, and XMem can guide the subsequent tracking process.
There can be many tracking targets, such as the following picture of Along the River During the Qingming Festival:
Even if the scene changes, it will not affect the performance of TAM:
We experienced it and found that TAM uses an interactive user interface, which is very simple and friendly to operate.
In terms of hard power, TAM’s tracking effect is indeed good:
However, the accuracy of the elimination function in some details needs to be improved.
As mentioned above, TAM is based on SAM and combines memory capabilities to establish spatio-temporal association. realized.
Specifically, the first step is to initialize the model with the help of SAM's static image segmentation capabilities.
With just one click, SAM can generate the initialization mask parameters of the target object, replacing the complex initialization process in the traditional segmentation model.
With the initial parameters, the team can hand it over to XMem for semi-manual intervention training, greatly reducing human workload.
In this process, some manual prediction results will be used to compare with the output of XMem.
In the actual process, as time goes by, it becomes more and more difficult for XMem to obtain accurate segmentation results.
When the difference between the results and expectations is too large, the re-segmentation step will be entered. This step is still completed by SAM.
After SAM re-optimization, most of the output results are relatively accurate, but some still require manual adjustment.
The training process of TAM is roughly like this, and the object elimination skills mentioned at the beginning are formed by combining TAM with E2FGVI.
E2FGVI itself is also a video element elimination tool. With the support of TAM's precise segmentation, its work is more targeted.
To test TAM, the team evaluated it using the DAVIS-16 and DAVIS-17 data sets.
#The intuitive feeling is still very good, and it is indeed true from the data.
Although TAM does not require manual setting of mask parameters, its two indicators of J (regional similarity) and F (boundary accuracy) are very close to the manual model.
Even the performance on the DAVIS-2017 data set is slightly better than that of STM.
Among other initialization methods, the performance of SiamMask cannot be compared with TAM;
Although another method called MiVOS performs better than TAM, it has evolved for 8 rounds after all...
TAM is from the Visual Intelligence and Perception (VIP) Laboratory of Southern University of Science and Technology.
The research directions of this laboratory include text-image-sound multi-model learning, multi-model perception, reinforcement learning and visual defect detection.
Currently, the team has published more than 30 papers and obtained 5 patents.
The leader of the team is Associate Professor Zheng Feng of Southern University of Science and Technology. He graduated with a doctorate from the University of Sheffield in the UK. He has worked for the Institute of Advanced Studies of the Chinese Academy of Sciences, Tencent Youtu and other institutions. He entered Southern University of Science and Technology in 2018 and was promoted to Associate Professor.
Paper address:
https://arxiv.org/abs/2304.11968
GitHub page:
https://github.com/gaomingqi/Track-Anything
Reference link:
https://twitter.com/bilawalsidhu/status/1650710123399233536 ?s=20
The above is the detailed content of Southern Science and Technology's Black Technology: Eliminate video characters with one click, the special effects artist's savior is here!. For more information, please follow other related articles on the PHP Chinese website!