Bing Speech to Text API - Communication via websocket in C #

Question

Bing Speech to Text API - Communication via websocket in C #

I am trying to get the Bing Speech API to work in C # via WebSockets. I looked at the implementation in Javascript here and followed the protocol instructions here , but I came across a full brick wall. I cannot use the existing C # service because I work in a Linux container, so I need to use the implementation on .net Core. Annoyingly, the existing service is closed-source!

I can successfully connect to the web juice, but I cannot get the server to respond to my connection. I expect to receive a turn.start text message from the server, but I boot from the server as soon as I sent a few bytes of the audio file. I know that the audio file is in the correct format because I got it directly from the C # service example here .

I feel like I've run out of options here. The only thing I can think about right now is that I didn’t send sound bites correctly. Currently Im just sending the audio file in sequential 4096 bytes. I know that the first audio message contains the RIFF header, which is only 36 bytes, and then I just send it along with the following (4096-36) bytes.

Here is my full code. You just need to run it as a .net kernel or .net console application, and you need an audio file and an API key.

 using Newtonsoft.Json; using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Net.Http; using System.Net.WebSockets; using System.Text; using System.Threading; using System.Threading.Tasks; namespace ConsoleApp3 { class Program { static void Main(string[] args) { Task.Run(async () => { var bingService = new BingSpeechToTextService(); var audioFilePath = @"FILEPATH GOES HERE"; var authenticationKey = @"BING AUTHENTICATION KEY GOES HERE"; await bingService.RegisterJob(audioFilePath, authenticationKey); }).Wait(); } } public class BingSpeechToTextService { /* #region Private Static Methods */ private static async Task Receiving(ClientWebSocket client) { var buffer = new byte[128]; while (true) { var result = await client.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None); var res = Encoding.UTF8.GetString(buffer, 0, result.Count); if (result.MessageType == WebSocketMessageType.Text) { Console.WriteLine(Encoding.UTF8.GetString(buffer, 0, result.Count)); } else if (result.MessageType == WebSocketMessageType.Close) { Console.WriteLine($"Closing ... reason {client.CloseStatusDescription}"); var description = client.CloseStatusDescription; //await client.CloseOutputAsync(WebSocketCloseStatus.NormalClosure, "", CancellationToken.None); break; } else { Console.WriteLine("Other result"); } } } /* #endregion Private Static Methods */ /* #region Public Static Methods */ public static UInt16 ReverseBytes(UInt16 value) { return (UInt16)((value & 0xFFU) << 8 | (value & 0xFF00U) >> 8); } /* #endregion Public Static Methods */ /* #region Interface: 'Unscrypt.Bing.SpeechToText.Client.Api.IBingSpeechToTextJobService' Methods */ public async Task<int?> RegisterJob(string audioFilePath, string authenticationKeyStr) { var authenticationKey = new BingSocketAuthentication(authenticationKeyStr); var token = authenticationKey.GetAccessToken(); /* #region Connect web socket */ var cws = new ClientWebSocket(); var connectionId = Guid.NewGuid().ToString("N"); var lang = "en-US"; cws.Options.SetRequestHeader("X-ConnectionId", connectionId); cws.Options.SetRequestHeader("Authorization", "Bearer " + token); Console.WriteLine("Connecting to web socket."); var url = $"wss://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?format=simple&language={lang}"; await cws.ConnectAsync(new Uri(url), new CancellationToken()); Console.WriteLine("Connected."); /* #endregion*/ /* #region Receiving */ var receiving = Receiving(cws); /* #endregion*/ /* #region Sending */ var sending = Task.Run(async () => { /* #region Send speech.config */ dynamic speechConfig = new { context = new { system = new { version = "1.0.00000" }, os = new { platform = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36", name = "Browser", version = "" }, device = new { manufacturer = "SpeechSample", model = "SpeechSample", version = "1.0.00000" } } }; var requestId = Guid.NewGuid().ToString("N"); var speechConfigJson = JsonConvert.SerializeObject(speechConfig, Formatting.None); StringBuilder outputBuilder = new StringBuilder(); outputBuilder.Append("path:speech.config\r\n"); //Should this be \r\n outputBuilder.Append($"x-timestamp:{DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffK")}\r\n"); outputBuilder.Append($"content-type:application/json\r\n"); outputBuilder.Append("\r\n\r\n"); outputBuilder.Append(speechConfigJson); var strh = outputBuilder.ToString(); var encoded = Encoding.UTF8.GetBytes(outputBuilder.ToString()); var buffer = new ArraySegment<byte>(encoded, 0, encoded.Length); if (cws.State != WebSocketState.Open) return; Console.WriteLine("Sending speech.config"); await cws.SendAsync(buffer, WebSocketMessageType.Text, true, new CancellationToken()); Console.WriteLine("Sent."); /* #endregion*/ /* #region Send audio parts. */ var fileInfo = new FileInfo(audioFilePath); var streamReader = fileInfo.OpenRead(); for (int cursor = 0; cursor < fileInfo.Length; cursor++) { outputBuilder.Clear(); outputBuilder.Append("path:audio\r\n"); outputBuilder.Append($"x-requestid:{requestId}\r\n"); outputBuilder.Append($"x-timestamp:{DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffK")}\r\n"); outputBuilder.Append($"content-type:audio/x-wav"); var headerBytes = Encoding.ASCII.GetBytes(outputBuilder.ToString()); var headerbuffer = new ArraySegment<byte>(headerBytes, 0, headerBytes.Length); var str = "0x" + (headerBytes.Length).ToString("X"); var headerHeadBytes = BitConverter.GetBytes((UInt16)headerBytes.Length); var isBigEndian = !BitConverter.IsLittleEndian; var headerHead = !isBigEndian ? new byte[] { headerHeadBytes[1], headerHeadBytes[0] } : new byte[] { headerHeadBytes[0], headerHeadBytes[1] }; //Audio should be pcm 16kHz, 16bps mono var byteLen = 8192 - headerBytes.Length - 2; var fbuff = new byte[byteLen]; streamReader.Read(fbuff, 0, byteLen); var arr = headerHead.Concat(headerBytes).Concat(fbuff).ToArray(); var arrSeg = new ArraySegment<byte>(arr, 0, arr.Length); Console.WriteLine($"Sending data from {cursor}"); if (cws.State != WebSocketState.Open) return; cursor += byteLen; var end = cursor >= fileInfo.Length; await cws.SendAsync(arrSeg, WebSocketMessageType.Binary, true, new CancellationToken()); Console.WriteLine("Data sent"); var dt = Encoding.ASCII.GetString(arr); } await cws.SendAsync(new ArraySegment<byte>(), WebSocketMessageType.Binary, true, new CancellationToken()); streamReader.Dispose(); /* #endregion*/ { var startWait = DateTime.UtcNow; while ((DateTime.UtcNow - startWait).TotalSeconds < 30) { await Task.Delay(1); } if (cws.State != WebSocketState.Open) return; } }); /* #endregion*/ /* #region Wait for tasks to complete */ await Task.WhenAll(sending, receiving); if (sending.IsFaulted) { var err = sending.Exception; throw err; } if (receiving.IsFaulted) { var err = receiving.Exception; throw err; } /* #endregion*/ return null; } /* #endregion Interface: 'Unscrypt.Bing.SpeechToText.Client.Api.IBingSpeechToTextJobService' Methods */ public class BingSocketAuthentication { public static readonly string FetchTokenUri = "https://api.cognitive.microsoft.com/sts/v1.0"; private string subscriptionKey; private string token; private Timer accessTokenRenewer; //Access token expires every 10 minutes. Renew it every 9 minutes. private const int RefreshTokenDuration = 9; public BingSocketAuthentication(string subscriptionKey) { this.subscriptionKey = subscriptionKey; this.token = FetchToken(FetchTokenUri, subscriptionKey).Result; // renew the token on set duration. accessTokenRenewer = new Timer(new TimerCallback(OnTokenExpiredCallback), this, TimeSpan.FromMinutes(RefreshTokenDuration), TimeSpan.FromMilliseconds(-1)); } public string GetAccessToken() { return this.token; } private void RenewAccessToken() { this.token = FetchToken(FetchTokenUri, this.subscriptionKey).Result; Console.WriteLine("Renewed token."); } private void OnTokenExpiredCallback(object stateInfo) { try { RenewAccessToken(); } catch (Exception ex) { Console.WriteLine(string.Format("Failed renewing access token. Details: {0}", ex.Message)); } finally { try { accessTokenRenewer.Change(TimeSpan.FromMinutes(RefreshTokenDuration), TimeSpan.FromMilliseconds(-1)); } catch (Exception ex) { Console.WriteLine(string.Format("Failed to reschedule the timer to renew access token. Details: {0}", ex.Message)); } } } private async Task<string> FetchToken(string fetchUri, string subscriptionKey) { using (var client = new HttpClient()) { client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", subscriptionKey); UriBuilder uriBuilder = new UriBuilder(fetchUri); uriBuilder.Path += "/issueToken"; var result = await client.PostAsync(uriBuilder.Uri.AbsoluteUri, null); Console.WriteLine("Token Uri: {0}", uriBuilder.Uri.AbsoluteUri); return await result.Content.ReadAsStringAsync(); } } } } }

+5

c # microsoft-cognitive

Stephen ellis Aug 3 '17 at 19:25

source share

1 answer

Stephen ellis · Accepted Answer · 2017-08-05T08:57:08+0000

I knew it would be easy.

After the disappointment of several hours of coding, I found a problem. I forgot to send request id along with the call to speech.config .

Bing Speech to Text API - Communication via websocket in C #

More articles: