Data Science Tech Brief By HackerNoon
Technology
About
Learn the latest data science updates in the tech world.
Episodes
- How I Decoded My Apple Watch Metrics: Taking a Look At The Raw Numbers (Part 2)
This episode discusses decoding Apple Watch metrics by parsing Apple Health XML and GPX files. It covers using Python to stream large CDA files, extract workout kinematics, and convert raw data into clean CSVs for machine learning.
- Why AI Agents Are Creating a New Kind of Data Engineer
This episode discusses how AI agents are changing the role of data engineers, leading to the emergence of intelligence engineers who build and govern AI agents and data pipelines. It explores the evolving responsibilities and skills requir…
- The Architectural Limits of Data Lakes and the Rise of Lakehouses
This episode discusses the architectural limitations of data lakes, highlighting their inability to ensure reliability despite solving storage issues. It introduces lakehouse architecture as a solution, explaining how it adds transactions,…
- The Economic Case for Investing in Youth Education
This episode discusses the economic case for investing in youth education, highlighting the strong returns shown by causal studies, especially in early childhood and low-income countries. It covers topics such as data science, statistics,…
- HiveMQ and TimescaleDB: It Just Works!
This episode discusses how HiveMQ and MQTT facilitated real-time SCADA data streaming to power machine learning for optimizing an industrial dosing process. The combination with TimescaleDB allowed the system to scale and improve productio…
- 102 Blog Posts To Learn About Datasets
This episode highlights 102 blog posts from HackerNoon to help listeners learn about datasets. The posts cover data science topics and related exclusive content.
- Why More Data Doesn’t Guarantee Better Insights in Modern Data Systems
This episode explains that simply having more data does not lead to better insights in modern data systems. It highlights issues like poor data quality, bias, and pipeline problems that hinder analytics at scale, emphasizing that the goal…
- 500 Blog Posts To Learn About Data
This episode highlights 500 HackerNoon blog posts that offer comprehensive learning resources on data science. The content is curated from hackernoon.com and covers topics related to data, learning data, and data science.
- 228 Blog Posts To Learn About Data Visualization
This episode features 228 HackerNoon blog posts covering data visualization, curated from the HackerNoon data science section. The content is suitable for learning about the topic.
- The Hard Lessons of Managing a Data Science Team
A data science team lead shares four key lessons learned in managing a struggling team. By implementing a framework to fix output quality, protect focus, raise technical standards, and improve planning and recognition, the team's rework ra…
- 95 Blog Posts To Learn About Data Storage
This episode highlights 95 HackerNoon blog posts covering data storage. The posts offer a comprehensive guide to learning about data storage and related topics within data science.
- 70 Blog Posts To Learn About Data Scraping
This episode features a curated list of 70 HackerNoon blog posts covering data scraping. The posts offer comprehensive learning material on the subject for those interested in data science.
- 500 Blog Posts To Learn About Data Science
This episode highlights 500 free blog posts from HackerNoon to learn about Data Science. The content is categorized under data-science, learn, and learn-data-science, with the original story by @learn.
- 110 Blog Posts To Learn About Data Management
This episode highlights 110 blog posts from HackerNoon focused on data management. The posts cover essential information for learning about data management and related data science topics, all available for free.
- 402 Blog Posts To Learn About Data Analytics
This episode highlights 402 blog posts from HackerNoon focused on learning Data Analytics. The content is sourced from HackerNoon's data science section and covers topics related to data analytics.
- 50 Blog Posts To Learn About Data Collection
This episode features 50 HackerNoon blog posts covering data collection. The content is curated to help listeners learn about the topic and related subjects. More information can be found on hackernoon.com.
- 427 Blog Posts To Learn About Data Analysis
This episode highlights 427 blog posts from HackerNoon focused on learning data analysis. The content covers various aspects of the topic and is suitable for those looking to deepen their understanding.
- Your Dashboard Isn’t Wrong - Your KPI Logic Is
Dashboards often face trust issues due to poorly defined Key Performance Indicators (KPIs). The episode argues for correcting the underlying metric logic rather than solely focusing on the visual presentation layer.
- The Hidden Cost of Scraping Everything (and Why Datasets Win)
This episode discusses the hidden costs associated with data scraping, arguing that ready-to-use datasets offer a superior solution. Datasets provide clean, structured, and query-ready data, which is faster and cheaper than scraping, ultim…
- 500 Blog Posts To Learn About Big Data
This episode highlights 500 free HackerNoon blog posts curated for learning about Big Data. The content covers topics related to data science and is available on hackernoon.com.
- 263 Blog Posts To Learn About Analytics
This episode highlights 263 free HackerNoon blog posts curated to help listeners learn about analytics. The content, originally published on HackerNoon, also links to further data science resources and exclusive articles on analytics.
- They Got Lost in the Transformer, Episode 1: What Even Is an Embedding?
This episode introduces word embeddings and Transformers, explaining how language is converted into numerical vectors. It uses the example of 'King - Man + Woman = Queen' to illustrate how meaning is derived from relationships within a con…
- Kafka vs Azure Event Hubs: The Tradeoffs You Only See in Production
This episode compares Kafka and Azure Event Hubs using production experience, discussing their tradeoffs regarding throttling and exactly-once semantics. Kafka offers more control and guarantees, while Event Hubs provides operational simpl…
- Clarifying the Difference Between Data Strategy, Analytics, and AI Governance
This episode clarifies the differences between Data Strategy, Data Governance, and AI Governance, proposing a framework to prevent pilot sprawl and enable scalable, value-driven analytics across industries.
- The “Store Everything” Cloud Model Is Breaking Under Modern AI Workloads
"The "Store Everything" cloud model is failing due to modern AI workloads. The article suggests AI Edge Proxies as a solution to cut storage costs by 60% and address industrial latency, promoting the concept of Smart Data. It explains that…
- AI Belongs Inside DataOps, Not Just at the End of the Pipeline
This episode argues that AI should be integrated upstream within DataOps processes to automate enforcement, detect anomalies, and maintain documentation. AI-augmented DataOps enhances reliability and trust at scale, freeing engineers to fo…
- Stop Torturing Your Data: How to Automate Rigor With AI
This episode discusses how improvising in data analysis can lead to bias and p-hacking. It introduces an AI prompt designed to enforce methodological discipline and a pre-commitment strategy, acting as a roadmap to ensure research validity…
- Minimum Incident Lineage (MIL): A Run-Level Evidence Standard for Reproducible Data Incidents
The episode introduces Minimum Incident Lineage (MIL), a standard for capturing run-level evidence to make data incidents reproducible and auditable. MIL enables faster triage and resolution without storing raw data, focusing on essential…
- 5 Ways Spark 4.1 Moves Data Engineering From Manual Pipelines to Intent-Driven Design
Apache Spark 4.1 enhances data engineering by simplifying Change Data Capture and lifecycle management. It shifts from manual pipelines to intent-driven design, potentially reducing development time by up to 90% and addressing data stalene…
- Beyond Prediction: Econometric Data Science for Measuring True Business Impact
This episode explores econometric data science, detailing how its methodologies model counterfactual consequences to predict outcomes without intervention. This approach is vital for accurately measuring business impact, determining ROI, a…
- Designing Economic Intelligence: Econometrics-First Approaches in Data Science
This episode discusses designing economic intelligence by embedding structured reasoning into decision systems. It highlights econometrics as a logical foundation, viewing decisions as interventions within an economic context and emphasizi…
- From Forecasting to BI: Inside Shravanthi Ashwin Kumar’s Data-Driven Finance Playbook
This episode details Shravanthi Ashwin Kumar’s data-driven finance playbook, focusing on analytics, forecasting, and tech-powered decision-making. Her expertise includes financial modeling, BI tools like SQL and Python, and delivering meas…
- Causal Thinking in the Age of Big Data: Modern Econometrics for Data Scientists
This episode discusses the limitations of predictive models in data science, highlighting the need for causal thinking and modern econometrics as data scientists influence policy and strategy.
- Data Pipeline Testing: The 3 Levels Most Teams Miss
This episode discusses the importance of data pipeline testing, highlighting three critical levels: schema, business logic, and contracts. It explains how many data teams fail to test data itself, leading to issues like inaccurate dashboar…
- HSM: The Original Tiering Engine Behind Mainframes, Cloud, and S3
This episode details the history and mechanics of Hierarchical Storage Management (HSM), the original data tiering engine. It explains how HSM, with its five key components, has evolved from mainframes to cloud storage, managing data lifec…
- Navigating Architectural Trade-offs at Scale to Meet AI Goals in 2026
This Data Science Tech Brief episode discusses navigating architectural trade-offs to meet AI goals by 2026. Success requires clarity on data infrastructure, including auto-scaling compute and workload isolation for a stable and secure fou…
- Will AI Take Your Job? The Data Tells a Very Different Story
This episode discusses the common anxiety surrounding AI and job displacement. Referencing historical technological revolutions, it suggests that long-term outcomes often present a more optimistic narrative than initial fears indicate.
- You Don’t Need an API for Everything (Sometimes Scraping Is Enough)
This episode explores how web scraping can be a practical and efficient alternative to APIs for automating data collection from public web pages. It highlights that for repetitive browsing tasks, scraping structured data can save time and…
- How to Use Propensity Score Matching to Measure Down Stream Causal Impact of an Event
This episode explains propensity score matching (PSM) as a statistical method to measure the downstream causal impact of events like advertising. It addresses the challenge of non-random ad exposure and hidden biases by creating comparable…
- How to Analyze Call Sentiment With Open-Source NLP Libraries
This episode explores call sentiment analysis using open-source NLP libraries. It details how analyzing customer emotions, polarity, intensity, and temporal shifts across large call volumes can reveal systemic trends and improve customer s…
- How Bayesian Tail-Risk Modeling can save your Retail Business Marketing Budget
This episode discusses how Bayesian tail-risk modeling can protect retail marketing budgets. It explains that average ROI is often misleading due to "fat tails" or rare, extreme negative events that conventional models underestimate.
- Architecting Trustworthy Healthcare Data Platforms Using Declarative Pipelines
This episode discusses the critical role of data quality in digital healthcare data platforms, emphasizing it as a mandatory requirement. It explores the architecture of trustworthy platforms using declarative pipelines, covering aspects r…
- When A/B Tests Aren’t Possible, Causal Inference Can Still Measure Marketing Impact
This episode of Data Science Tech Brief discusses measuring marketing impact when A/B tests are not feasible. It explores causal inference techniques such as Diff-in-Diff, Synthetic Control, and Meta's GeoLift, providing guidance on data p…
- Why Data Quality Is Becoming a Core Developer Experience Metric
This episode discusses how bad data quality negatively impacts developer productivity by causing bugs and requiring defensive coding. It explains that treating data quality as core infrastructure through real-time validation APIs can creat…
- Why “Accuracy” Fails for Uplift Models (and What to Use Instead)
This episode of Data Science Tech Brief discusses the limitations of traditional accuracy metrics for uplift models. It suggests that alternative performance evaluation methods are necessary for this specific machine learning task.
- Turning Your Data Swamp into Gold: A Developer’s Guide to NLP on Legacy Logs
This episode details a developer's guide to using NLP on legacy maintenance logs. It covers a practical pipeline for cleaning logs through normalization, TF-IDF, and cosine similarity to enhance data quality and detect fraud, using Python…
- Data Monetization Strategies in Government Digital Platforms
This episode discusses strategies for monetizing data within government digital platforms. It emphasizes that government data is a strategic asset that can drive innovation, trust, transparency, and economic value, while also highlighting…
- Why Partner Data Became My Toughest Engineering Problem
The episode discusses how inconsistent data definitions in partner portals can cause system slowdowns. The author shares their experience of fixing data lineage, which reduced deal registration time significantly and improved overall stabi…
- PBIX Is Not Going Away - But PowerBI Will Never Work the Same Again
Power BI is moving from the PBIX format to PBIR. The new PBIR format uses a structured, project-based approach replacing the single binary file, which will improve collaboration, explicit change tracking, and governance for Power BI report…
- Smart Fire Protection: How AI Is Changing Preventive Maintenance Forever
This episode discusses the transformation of fire protection maintenance through AI and IoT. It highlights how predictive monitoring, AI-driven analytics, and digital tools are reducing failures, improving compliance, and enabling self-mon…