How do you quickly turn texts into labels for your machine learning?
Your manager and clients want you to use machine learning to predict an outcome. You know how to tackle the structured data you have, but what to do with the column of text data (e.g. customer comments, technician notes)? What if the outcome itself needs to be coded up based on the text? You need to quickly and accurately code documents into categories but you don’t have the time and resources to read and code thousands of documents.
Thresher will help you quickly code documents by suggesting keywords to build accurate and precise queries to classify documents.
A data scientist used Thresher ‘Quick Code’ mode to create a query in less than 15 minutes that classified more than 5,500 SMS messages as spam or not spam. This classifier had 95% accuracy relative to human coders. Furthermore, the posts classified by the Thresher-built query were used to train a machine learning model to predict future spam texts. The model had similar performance to a model trained with human coded labels. The ‘Quick Code’ approach was accurate and fast. But in addition the data scientist found it valuable because:
1) They could use the query to easily explain to their managers what words and numbers were commonly found in spam text.
2) When spammers changed their patterns, the data scientist iterated again with Thresher’s ‘Quick Code’ to update the query and prediction models, which helped the team stay abreast of the changes in spam tactics.
How do you conduct accurate and meaningful analyses when the stakes are high and events are moving fast?
As an analyst, you wrestle with an ever-growing array of text in multiple languages. From social media to technical documents, you need to collect the right data to conduct analysis that will give you the right answers. You can’t afford to miss out because someone used an unusual turn of phrase, slang or a codeword. You need to make your analyses easy to understand and transparent to your managers.
Thresher will help you quickly and transparently collect the right data so that you can produce high quality product in a timely manner.
Researchers at Harvard used Thresher to find the codewords that Chinese netizens use to evade Chinese government censorship. The Chinese government heavily censored online conversations about Bo Xilai--a Chinese politician whose career ended in scandal. Online authors generated dozens of codewords to refer to various players in the political drama. Thresher discovered several of these otherwise unknown words such as ‘bmelon’, - a covert reference to Bo’s son Bo Guagua (gua = melon). With these words researchers were better able to understand the conversation around this topic.
How do you find all the right documents for a case when people have an incentive to hide and obfuscate?
You have tremendous expertise and context to bring to bear on a case. Even with predictive analytics, you like to use keywords to find potentially relevant documents because keywords are:
Transparent - you can understand (and explain to the partner and client) how you got the documents.
Flexible - you can make changes as the case strategy evolves.
Precise - you can target exactly what you are looking for to reduce review time and effort.
But, when you try to collect all documents relevant to a case, people can thwart your efforts if the language they used in their emails, social media accounts or memos was novel or intentionally evasive -- leaving important documents undiscovered.
Thresher helps you collect the right documents for review in a manner that you can explain to the partners and change as the case strategy evolves. Please contact us for more e-Discovery information.
How do you collect good data from social media to ensure accurate research when so many of authors use ever-evolving slang?
You often are collecting social media data to conduct studies about sentiment. With most social media sources, you need to collect just the right set of data you need using keywords. You think of keywords that you hope will have enough signal. But thinking of good keywords is hard because of the cognitive biases we humans have. Indeed, the creators of our underlying technology knew this. They found that they didn’t have the keywords they needed to collect the signal they wanted."Computer-Assisted Keyword and Document Set Discovery from Unstructured Text"
King, Lam, Roberts. 2017
Use Thresher to cultivate signal-rich data sets so that your research produces more accurate results.
Researchers used Thresher to improve their queries by identifying words they could use to discover new content and eliminate irrelevant content. In one case, researchers were trying to find content about tobacco and marijuana use online. For example, Thresher identified hashtags that often do not co-occur with tobacco or marijuana, but signal relevant content. Researchers also used Thresher to eliminate irrelevant content. For example, ‘pot’ is a synonym for marijuana. But the word also is associated with many other food terms. Thresher helped identify words like pie, chicken, roast, crock, and belly as words the often appear with ‘pot’ but signal the content is unrelated to the marijuana conversation.
Government data scientists, analysts, lawyers and researchers share many of the same challenges as their private sector counterparts. But some of their needs are different. We love hard problems, and we know the government tackles some of the hardest.
So - from the beginning - we built in the infrastructure needed to support our government clients. This includes building our software on US-based AWS cloud servers to leverage their government-approved security protocols. We also offer on-premise solutions for government customers with sensitive data. We are registered with SAM, have multiple employees with experience in the national security sector, and our accounting system is compatible with government contracting.
We are proud recipients of DARPA-sponsored Small Business Innovation Research contracts. Their support is an important part of our broader commitment to continuous innovation and rigorous testing of our core technologies.