{"id":52320,"date":"2023-06-16T16:44:25","date_gmt":"2023-06-16T20:44:25","guid":{"rendered":"https:\/\/coinscreed.com\/staging\/?p=52320"},"modified":"2023-06-16T16:44:28","modified_gmt":"2023-06-16T20:44:28","slug":"metas-voicebox-generalizes-text-to-speech","status":"publish","type":"post","link":"https:\/\/coinscreed.com\/staging\/metas-voicebox-generalizes-text-to-speech\/","title":{"rendered":"Meta\u2019s \u2018Voicebox\u2019 Generalizes Text-to-Speech"},"content":{"rendered":"\n<p>Meta AI Voicebox is a text-to-speech (TTS) tool that generates results up to 20 times faster than comparable state-of-the-art <a href=\"https:\/\/coinscreed.com\/staging\/mistral-ai-raises-113m-in-seed-funding.html\" target=\"_blank\" rel=\"noreferrer noopener\">artificial intelligence models<\/a>.\u00a0<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/coinscreed.com\/staging\/wp-content\/uploads\/2023\/03\/image-31.png\" alt=\"Meta\u2019s \u2018Voicebox\u2019 Generalizes Text-to-Speech\" class=\"wp-image-46459\" width=\"794\" height=\"460\" srcset=\"https:\/\/coinscreed.com\/staging\/wp-content\/uploads\/2023\/03\/image-31.png 791w, https:\/\/coinscreed.com\/staging\/wp-content\/uploads\/2023\/03\/image-31-300x174.png 300w, https:\/\/coinscreed.com\/staging\/wp-content\/uploads\/2023\/03\/image-31-768x445.png 768w, https:\/\/coinscreed.com\/staging\/wp-content\/uploads\/2023\/03\/image-31-150x87.png 150w, https:\/\/coinscreed.com\/staging\/wp-content\/uploads\/2023\/03\/image-31-750x434.png 750w\" sizes=\"(max-width: 794px) 100vw, 794px\" \/><figcaption class=\"wp-element-caption\">Meta\u2019s \u2018Voicebox\u2019 Generalizes Text-to-Speech<\/figcaption><\/figure>\n\n\n\n<p>Voicebox eschews traditional TTS architecture in favor of a model analogous to OpenAI's ChatGPT or Google's Bard.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe title=\"Meet Voicebox \u2013 the first ever generative AI speech model #voicebox #ai #meta\" width=\"800\" height=\"450\" src=\"https:\/\/www.youtube.com\/embed\/vjqK031bgQQ?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>Meta's offering can generalize through in-context learning, one of the primary distinctions between Voicebox and similar TTS models, such as ElevenLabs Prime Voice AI.<\/p>\n\n\n\n<p>Similar to ChatGPT and other transformer models, Voicebox employs massive training datasets. Previous attempts to utilize vast audio data produced severely degraded audio outputs. Due to this, most TTS systems use limited, highly curated, labeled datasets.<\/p>\n\n\n\n<p>This limitation is surmounted by Meta's innovative training scheme, which eschews labels and curation in favor of an architecture capable of &#8220;in-filling&#8221; audio data.<\/p>\n\n\n\n<p>Voicebox is the &#8220;first model that can generalize to speech-generation tasks it was not specifically trained to perform with state-of-the-art performance,&#8221; according to a blog post published by <a href=\"https:\/\/coinscreed.com\/staging\/senators-question-metas-ai-model-llama-over-ethical-risks.html\" target=\"_blank\" rel=\"noreferrer noopener\">Meta AI <\/a>on June 16.<\/p>\n\n\n\n<p>This enables Voicebox to convert text to speech, eliminate unwanted background noise by synthesizing substitute speech, and apply a speaker's voice to different language outputs.<\/p>\n\n\n\n<p>According to a research paper published by Meta, its pre-trained Voicebox system can perform all these tasks using only the desired output text and a three-second audio sample.<\/p>\n\n\n\n<p>The arrival of robust speech generation occurs at a particularly sensitive moment, as social media companies continue to struggle with moderation, and the upcoming presidential election in the United States threatens to test the limits of online misinformation detection once again.<\/p>\n\n\n\n<p>For instance, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Donald_Trump#:~:text=Donald%20John%20Trump%20(born%20June,States%20from%202017%20to%202021.\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">former U.S. President Donald Trump<span class=\"wpil-link-icon\" title=\"Link goes to external site.\" style=\"margin: 0 0 0 5px;\"><svg width=\"24\" height=\"24\" style=\"height:16px; width:16px; fill:#000000; stroke:#000000; display:inline-block;\" viewBox=\"0 0 24 24\" version=\"1.1\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" xmlns:svg=\"http:\/\/www.w3.org\/2000\/svg\"><g id=\"wpil-svg-outbound-7-icon-path\" fill=\"none\" clip-path=\"url(#clip0_31_188)\">\r\n                            <path d=\"M9.16724 14.8891L20.1672 3.88908\" stroke-linecap=\"round\"\/>\r\n                            <path d=\"M13.4497 3.53554L20.5208 3.53554L20.5208 10.6066\" stroke-linecap=\"round\" stroke-linejoin=\"round\"\/>\r\n                            <path d=\"M17.5 13.5L17.5 16.26C17.5 17.4179 17.5 17.9968 17.2675 18.4359C17.0799 18.7902 16.7902 19.0799 16.4359 19.2675C15.9968 19.5 15.4179 19.5 14.26 19.5L7.74 19.5C6.58213 19.5 6.0032 19.5 5.56414 19.2675C5.20983 19.0799 4.92007 18.7902 4.73247 18.4359C4.5 17.9968 4.5 17.4179 4.5 16.26L4.5 9.74C4.5 8.58213 4.5 8.0032 4.73247 7.56414C4.92007 7.20983 5.20982 6.92007 5.56414 6.73247C6.0032 6.5 6.58213 6.5 7.74 6.5L11 6.5\" stroke-linecap=\"round\"\/>\r\n                        <\/g>\r\n                        <defs>\r\n                            <clipPath id=\"clip0_31_188\">\r\n                                <rect fill=\"white\" height=\"24\" width=\"24\"\/>\r\n                            <\/clipPath>\r\n                        <\/defs><\/svg><\/span><\/a> is accused of mishandling sensitive government documents after leaving office. The evidence against him includes audio recordings where he allegedly admitted to potential misconduct.<\/p>\n\n\n\n<p>While there are currently no indications that the former president will dispute the contents of the audio recordings, his case demonstrates that data integrity is fundamental to the U.S. legal system and, by extension, democracy.<\/p>\n\n\n\n<p>Voicebox is not the first instrument of its kind but one of the most powerful. As a result, Meta has developed a tool that, according to the company, can &#8220;trivially detect&#8221; the difference between real and fake audio. As per the blog entry:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><em>\u201cAs with other powerful new AI innovations, we recognize that this technology brings the potential for misuse and unintended harm. In our paper, we detail how we built a highly effective classifier that can distinguish between authentic speech and audio generated with Voicebox to mitigate these possible future risks.\u201d<\/em><\/p>\n<\/blockquote>\n\n\n\n<p>AI has become as indispensable to daily operations as the internet and electricity in the cryptocurrency industry. The largest exchanges rely on artificial intelligence chatbots for consumer interactions and sentiment analysis, and trading bots have become widespread.<\/p>\n\n\n\n<p>The advent of robust text-to-speech systems such as Voicebox, combined with <a href=\"https:\/\/coinscreed.com\/staging\/how-ai-and-machine-learning-are-revolutionizing-trading-strategies-in-the-cryptocurrency-market.html\" target=\"_blank\" rel=\"noreferrer noopener\">automated trading<\/a>, could assist cryptocurrency traders who rely on TTS systems that currently may need help with crypto jargon or multilingual support.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Meta AI Voicebox is a text-to-speech (TTS) tool that generates results up to 20 times faster than comparable state-of-the-art artificial intelligence models.\u00a0 Voicebox eschews traditional TTS architecture in favor of a model analogous to OpenAI&#8217;s ChatGPT or Google&#8217;s Bard. Meta&#8217;s offering can generalize through in-context learning, one of the primary distinctions between Voicebox and similar [&hellip;]<\/p>\n","protected":false},"author":12,"featured_media":46459,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[9],"tags":[8149,14898,14900,14899],"class_list":["post-52320","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech","tag-meta-2","tag-text-to-speech","tag-tts","tag-voicebox"],"jetpack_featured_media_url":"https:\/\/coinscreed.com\/staging\/wp-content\/uploads\/2023\/03\/image-31.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/posts\/52320","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/users\/12"}],"replies":[{"embeddable":true,"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/comments?post=52320"}],"version-history":[{"count":0,"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/posts\/52320\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/media\/46459"}],"wp:attachment":[{"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/media?parent=52320"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/categories?post=52320"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/coinscreed.com\/staging\/wp-json\/wp\/v2\/tags?post=52320"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}