Ben Resume HTML Template

Yifei Sun

Master Student @ UMICH

I am a data enthusiast pursuing a Master's degree in Information with a specialization in Big Data Analytics at the University of Michigan. With a strong academic foundation and practical experience as a Data Analyst Intern, I'm passionate about using data to drive insights. I'm seeking opportunities to apply my skills in a real-world context. Now I am actively looking for internship and full-time opportunities in Data Science and Data Analytics.

    Interests
  • Data Science
  • Machine Learning
  • Data Analytics
  • Data Visualization
    Education
  • MSc in Information Science (GPA 3.98/4.0), 2024
  • Master of Urban and Regional Planning (GPA 3.98/4.0), 2024
  •     University of Michigan
  • BSc in Civil and Environmental Engineering, 2021
  •     University of Illinois Urbana-Champaign

yifeisun0102@gmail.com

Ben Resume HTML Template
Unraveling Seattle’s Travel Patterns: A Data Science Exploration
    Big Data Analytics Capstone

    

              
  • -Employed Random Forest, Partial Least Squares Regression and Gradient Boosting to predict traveling frequency of residents based on features including demographic characteristics and spatial properties.
  • -Innovated a new way of doing clustering of nodes in a flow graph, combining graph topology structure, Laplacian eigenmap embedding and HodgeRank potential.
  • -Developed an interactive web application visualizing the results and deployed it on Replit.





Ben Resume HTML Template
Climatic Chronicles: Exploring Weather and Climate Data Across 430 Cities Globally with Innovative Visualizations and Decision Support
    Information Visualization

    

    
  • -Leveraged web crawling tools and API requests to gather weather and climate data of 430 cities from diverse sources.
  • -Employed R packages, including ggplot, to craft innovative and informative visualizations of the weather and climate data.
  • -Implemented the Analytic Hierarchy Process to recommend cities based on different climate factors.
  • -Engineered a web application using Shiny in R, enabling users to search, retrieve, and visualize weather data interactively.





Ben Resume HTML Template
Harvesting Hydrogen Insights: Unveiling Safety Incidents and Fueling Stations through Web Crawling and Streamlit Visualization
    Information Visualization

    

    
  • -Compiled data on 220 hydrogen safety incidents and 1789 hydrogen fueling stations using web crawling techniques.
  • -Translated text in the data using automated testing tools such as Selenium, accelerated by multithreading methods.
  • -Customized data visualization in web application using Streamlit and Pyecharts for future use in industry reports.





Ben Resume HTML Template
Linguistic Family Tree Explorer: Unveiling Language Evolution and Relationships
    Intermediate Programming

    

    
  • -Applied web crawling techniques to gather linguistic data of 2739 languages from Wikipedia and other specific websites.
  • -Constructed linguistic trees from the gathered data using Object-Oriented Programming.
  • -Employed treelib, clustering methods from SciPy, and Networkx to visualize linguistic data in trees, dendrograms, and graphs.
  • -Developed a Streamlit web application to allow users to interact with the data and customize the visualizations.





Ben Resume HTML Template
Unraveling Amazon's E-Commerce Tapestry: A Network Analysis of Consumer Behavior and Market Trends
    Network

    

PDF

    
  • -Employed different centralities in Networkx to identify 10 most important categories in the co-appearance network.
  • -Verified the scale free nature of the category network by log-log plot on degree distribution.
  • -Utilized spectral clustering, Louvain method, etc. to detect 10 clusters of categories in the co-purchase network.
  • -Implemented Logistic Regression with graph features to predict whether a viewed product will be purchased and achieved a F1 score of 0.78.





Ben Resume HTML Template
Spatiotemporal Analysis of Housing Prices in NYC: Integrating Predictive Modeling, Fast Fourier Transform, and Random Forest
    Data Mining

    

PDF

    
  • -Employed FFT, Filter and IFFT on detrended daily property sale data, denoised and predicted the data, and determined the period of 7 days in the data.
  • -Utilized GeoPandas and Matplotlib to visualize the yearly trend in maps and identified areas of interest.
  • -Applied H20’s random forest to predict property price based on spatial, temporal, and inherent features of the property and gain the feature importance rank throughout the years.
  • -Explored VARIMA models of different parameters to fit the daily property sale number and get the feature importance map.





Ben Resume HTML Template
Approximate Frank-Wolfe Algorithms over Graph-structured Support Sets.
    Machine Learning (ECE)

    

PDF

    
  • -Reviewed a paper on solving the graph-structured convex optimization problem using the approximate Frank-Wolfe algorithm
  • -Re-implemented and analyzed the original algorithm and proposed extensions, including the backtracking line-search method, which effectively reduced the number of iterations.
  • -Proposed a new DMO method (Top-g+ optimal visiting) to get an approximate IPO, which showed some improvement in convergence rate but with a significant increase in running time.
  • -Compared the objective function values for the FW method via DMO, the random PGD method, and the best PGD method and obtained good results.





Ben Resume HTML Template
Exploring the Evolution of Movie Genres, Revenues, and Industry Dynamics Over a Century: A Comprehensive Analysis Using Data Science and Visualization.
    Data Analysis and Manipulation

    

PDF

    
  • -Utilized MapReduce and SparkSQL in Python to analyze movie metadata, extracting genre, language, budget, revenue, and release dates from over 45,000 records spanning 100 years.
  • -Employed Pandas and Matplotlib in Jupyter Notebook to visualize genre trends, revealing a decline in the popularity of drama and romance genres, while action and adventure genres gained prominence.
  • -Conducted Spark and SparkSQL analysis to determine the adjusted revenue and profit of movies, factoring in inflation using CPI data, highlighting the impact of changing economic conditions.
  • -Identified Gone with the Wind as the most profitable movie in history when adjusted for inflation, emphasizing the historical significance of certain films despite the increasing diversity of entertainment options over the years.





Ben Resume HTML Template
Decoding Depreciation: Analyzing Used Car Trade Patterns in the U.S. for Informed Purchases.
    Data Analysis and Manipulation

    

PDF

    
  • -Analyzed used car trade data from Kaggle, comprising 420,000 records from April 4th, 2021, to May 5th, 2021, using Python.
  • -Explored age and mileage patterns, revealing significant differences in distribution across car conditions; identified Ford's durability based on mileage.
  • -Conducted regression analysis, showing a clear negative correlation between car age and price, and car mileage and price; highlighted variations among car models.
  • -Utilized PCA regression to model forecasts of secondhand prices, indicating positive relationships between depreciation rates and the original price with respect to age and mileage.





Ben Resume HTML Template
Livability Insights: A Multi-Dataset Analysis of U.S. Cities Using Python, Pandas, and Tableau.
    Information Visualization

    

    
  • -Analyzed five datasets, including Zillow Home Value Index for housing costs, A Countrywide Traffic Accident Dataset for traffic accidents, EPA's Walkability Index for walkability, Comparative Climatic Data for weather, and National Center for Health Statistics/USALEEP for life expectancy.
  • -Processed and cleaned datasets using Python and Pandas, ensuring consistency at the city level; joined datasets in Tableau to create a comprehensive dashboard.
  • -Discovered correlations between factors affecting livability, such as higher housing costs in walkable cities and potential weather-related challenges influencing life expectancy.
  • -Implemented user feedback to enhance visualizations, addressing issues with legends, organization, descriptions, and units, aiming for a more user-friendly experience.





Ben Resume HTML Template
Inequalities Unveiled: Exploring the Work and Health Dynamics of People with Disabilities through Interactive Visualizations.
    Information Visualization

    

    
  • -Explored income, work, and health dynamics of people with disabilities using Altair visualizations.
  • -Uncovered disparities, showcasing lower earnings, part-time work prevalence, and older age across industries.
  • -Utilized interactive scatter plots and pie charts for dynamic data exploration.
  • -Addressed challenges in finding specific data on medical barriers, emphasizing the importance of good medical care for individuals with disabilities.





Ben Resume HTML Template
Enhancing Movie Search: A Tip-of-the-Tongue Known-Item Retrieval Approach with Abstract Information.
    Information Retrieval

    

    
  • -Developed a vertical search engine for movie retrieval, focusing on Tip of the Tongue Known-Item Retrieval to assist users with vague memories.
  • -Utilized two datasets, one with movie metadata and another with scraped user reviews, to enhance search accuracy.
  • -Explored various machine learning models, including BM25, TFIDF, DFIZ, and DirichletLM_df, combined with features like sequential dependence models and review lengths.
  • -Addressed challenges such as missing data and observed variations in performance based on dataset characteristics, emphasizing the importance of dataset quality for model effectiveness.





Ben Resume HTML Template
Analyzing Global Health Factors: A Multivariate Regression Approach.
    Regressional Analysis

    

    
  • -Analyzed a dataset from the Global Health Observatory with 183 countries and 11 variables, using linear regression to understand factors influencing life expectancy.
  • -Cleaned the data by removing rows with missing values and potential data entry errors, resulting in a dataset of 141 rows for analysis.
  • -Explored collinearity among predictors, identified highly correlated variables, and performed variable selection to build a predictive model for life expectancy.
  • -Concluded that, among various factors, developing countries significantly reduce life expectancy, alcohol consumption has a negative impact, BMI has a positive impact, GDP shows a positive correlation, and schooling has a significant positive effect on life expectancy, with interactions between status and schooling being notable.





Ben Resume HTML Template
Predicting Stock Prices: A Comparative Analysis of Deep Learning Models.
    Applied Machine Learning

    

    
  • -Implemented CNN, RNN, LSTM, and non-deep learning stacking models for stock price prediction.
  • -Conducted experiments and analyzed model performance, revealing that RNN outperformed other deep learning models with a Mean Square Error (MSE) of 2.1835e-06.
  • -Explored the randomness of stock price change rate, finding that it behaves like a random variable with limited correlation between values.
  • -Investigated the impact of model simplification, discovering that a dummy predictor using only the previous day's price achieved the best MSE of 3.120314.





Ben Resume HTML Template
Indentify Landcover Types by Satellite Remote Sensing Data.
     Intermediate GIS

    

    
  • -Established a consulting firm specializing in GIS services, tasked with creating a landcover map for the Southeast Michigan Council of Governments (SEMCOG) using NASA's Landsat satellite data.
  • -Developed a methodology involving the calculation of three indexes (NDVI, MNDWI, MBI) to identify vegetation coverage, water bodies, and urban/bare soil areas, respectively.
  • -Emphasized the importance of selecting appropriate seasonal satellite data to ensure accurate results, considering factors like snow cover, frozen water bodies, and their impact on index calculations.
  • -Explored the potential use of machine learning methods for landcover mapping but decided against it due to insufficient samples, opting for manual interpretation and mapping expertise over automated solutions.





Ben Resume HTML Template
Geographic and Demographic Characteristics of U.S. Gun Violence.
    GIS

    

    
  • -Utilized GIS methodology to analyze U.S. gun violence, integrating data on gun incidents, county boundaries, and demographic information.
  • -Identified spatial patterns, revealing a positive correlation between gun violence and population density at the county level, while in New York City and surroundings, incidents clustered in areas with lower median household income, younger age, and higher female population proportion.
  • -Addressed limitations, such as tied geocoded records, emphasizing the need for further studies with larger datasets for more conclusive results.
  • -Demonstrated correlations through visualizations and statistical analyses, offering insights into the geographic and demographic characteristics associated with U.S. gun violence.