hackGSU is a hackathon hosted twice a year by the GSU student chapter of the ACM and the GSU student branch of the IEEE. This semester, hackGSU will be a 500+ participant event taking place from March 31 to April 2, 2017. Free registration, free food, and cash prizes: register at https://my.mlh.io/register by March 25.
Machine Learning Methods for Finding Textual Features of Depression from Publications
Advisor: Dr. Yanqing Zhang
The sheer volume of medical publications is growing exponentially. The MEDLINE 2015 database contains over 23.5 million records, and the database is currently growing at the rate of 1,000,000 new citations compared with MEDLINE 2014. With this growth rate, it is extremely challenging to keep up-to-date with all the new discoveries and theories even within one’s own field of biomedical research. Therefore, we propose a machine-learning framework to analyze publications and from them extract important textual features, novel finds, and new contributions. In many ways, texts are like data, but it is important to keep in mind that texts are not data since mathematical models cannot be applied to them. Even though texts have some formats or laws to follow, we still need to mold the texts into a shape fit for analysis. Text representation is one of the most challenging problems in text mining and information retrieval, which is used to numerically describe unstructured text data to make it mathematically computable. Two text representation methods are proposed and tested on document classification, clustering, and outlier detection. During a download of the publications by keyword search, we find some outlier publications that contain the keyword but whose contents are not associated with the keyword. Therefore, outlier publication removal is a key step in our framework to improve the quality of information retrieval. We also develop an automatic text summarization method that not only reduces document size but also increases the dissimilarity of documents. TextRank is extended from PageRank to rank the importance of words in a document. By integrating with Word2Vec, we develop a hybrid method that finds targeted textual features regarding a given topic such as “depression.” Next, we will further classify the found text features into different categories, such as diagnosis methods, treatment methods, drugs/medicines, and biomarkers using either graph analysis or named-entity recognition techniques.
Dr. Yanqing Zhang (chair)
Dr. Saeid Belkasim
Dr. Raj Sunderraman
Dr. Ruiyan Luo
Modern Big Data Pipelines with Dataflow Programming Models: The Open Source HPCC Systems Approach to Big Data Analytics
Dr. Flavio Villanustre
LexisNexis and HPCC Systems
Big data analytics can be a daunting field. The complexity of the analysis is usually compounded with the volume of data to impair the tractability of certain problems. The widespread explicit MapReduce programming model used by Hadoop and other big data platforms only makes things worse, burdening programmers with decomposing and translating algorithms into the Map, Shuffle/Distribute, and Reduce basic building blocks while representing data structures as simple key/value pairs. The approach that the open-source HPCC Systems platform takes to tackle this problem is novel and relies on an open dataflow programming language (ECL) equipped with all of the high-level data primitives that a programmer would need to implement high-level algorithms with little effort. ECL, a statically typed compiled language, combined with a distributed storage, workflow, and execution engine, provides a consistent and seamless environment for big data analytics. During this presentation, we’ll introduce the audience to the open-source HPCC Systems big data platform and its ECL programming language, and we’ll showcase different data analytics scenarios from the industry.
About the Speaker: Dr. Flavio Villanustre is VP Technology for LexisNexis and HPCC Systems. In this position, Flavio is responsible for developing the open-source developer and research community for the HPCC Systems big data platform, information security, and overall technology architecture. Prior to 2001, Dr. Villanustre served in different companies in a variety of roles in infrastructure, information security, and information technology. In addition, Dr. Villanustre has been involved with the open-source community for over 15 years through multiple initiatives. Some of these include founding the first Linux user group in Buenos Aires (BALUG) in 1994, releasing several pieces of software under different open-source licenses, and evangelizing open source to different audiences through conferences, training, and education. Prior to his technology career, Dr. Villanustre was a neurosurgeon.
13th International Symposium on Bioinformatics Research and Applications
The International Symposium on Bioinformatics Research and Applications (ISBRA) provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications.