Big Data Processing

Big data collection, storage and processing technologies

data-sciencedata-analysisdata-engineeringbig-datapandasstatisticsdataanalyticssparkhadoop
microsoft
ML-For-Beginners
microsoft
76.9k

Microsoft Azure cloud advocates are pleased to offer a 12-week, 26-lesson machine learning course. In this course, you will learn what is sometimes called classical machine learning, using Scikit-learn as the library, avoiding deep learning, which will be covered in our upcoming "Beginner AI" course. Pair these courses with our "Beginner Data Science" course!

grafana
grafana
grafana
69.6k

Grafana - A tool for monitoring, metric analysis and dashboards for Graphite, InfluxDB and Prometheus, etc.

apache
superset
apache
67.8k

Data visualization and data exploration platform, providing various visualization templates and interactive dashboards for clearer data presentation; built-in SQL IDE, allowing users to better operate data; API is open and flexible, with high customizability.

binhnguyennus
awesome-scalability
binhnguyennus
64.8k

A project dedicated to large-scale system design, which gathers the patterns and best practices of scalable, reliable and high-performance systems. It provides developers with rich resources and references to help them design and implement efficient large-scale systems.

scikit-learn
scikit-learn
scikit-learn
63.2k

scikit-learn is a Python module for machine learning built on top of SciPy.

mendableai
firecrawl
mendableai
53.1k

Asabeneh
30-Days-Of-Python
Asabeneh
48.9k

A Python tutorial suitable for beginners to learn. The tutorial aims to teach you the basic programming knowledge and advanced development skills of Python, such as web crawling, data analysis, statistical analysis, virtual environment building, API construction, etc., through 30 days of coding learning.

coollabsio
coolify
coollabsio
44.7k

A self-hosted solution for a project that is open source on GitHub, which can be used as an alternative to Heroku and Netlify. It supports reverse proxy, free SSL certificate configuration, multiple common database configurations, one-click installation and upgrade of projects, and other functions. Coolify aims to provide a flexible self-hosted solution that allows developers to easily deploy and manage their applications.

run-llama
llama_index
run-llama
43.9k

A data framework for LLM (large language model) applications. It provides a solution for data storage and management for LLM applications, helping users build and manage LLM applications more efficiently.

metabase
metabase
metabase
43.4k

A quick data analysis and visualization tool that provides users with a friendly user experience and integration capabilities. It helps companies easily explore and understand their own data without the need for complex data queries and analytical skills. For enterprises and data analysts who need to quickly obtain data insights, Metabase is a powerful and easy-to-use BI tool.

ClickHouse
ClickHouse
ClickHouse
42.6k

A free big data analysis database management system (DBMS) designed for handling massive amounts of data. It provides powerful analytical functions that can be used for real-time queries and analysis of large-scale data sets, helping users quickly extract valuable information from massive data.

apache
spark
apache
41.8k

Spark - Apache Spark is a fast and general-purpose cluster computing system for big data. It provides high-level APIs in Scala, Java, Python and R, as well as an optimized engine for generic computation graphs that support data analysis.

apache
airflow
apache
41.8k

A scheduled task management platform, which manages and schedules various offline scheduled tasks with a built-in web management interface. When the number of scheduled tasks reaches hundreds, it becomes impossible to effectively and conveniently manage these tasks using crontab. This project was born to solve this problem.

streamlit
streamlit
streamlit
41.1k

Streamlit is an open-source Python library that makes it easy to create and share beautiful custom web applications for machine learning and data science. Streamlit converts data scripts into sharable web applications in minutes. It's all written in pure Python. No front-end experience is required, so you can build and share data applications faster than ever before.

gradio-app
gradio
gradio-app
39.7k

The open-source project named Gradio on GitHub can generate a simple and elegant UI interface for machine learning models in just a few minutes, allowing you to demonstrate your projects in the browser. Through this interface, you can complete operations such as dragging and uploading images, pasting text, recording sounds, etc., and view the model output content.

DataExpert-io
data-engineer-handbook
DataExpert-io
37.3k

A learning guide for data engineers covering books, courses, interview materials, excellent blogs, communities and bloggers worth following.

mindsdb
mindsdb
mindsdb
35.4k

An innovative platform that integrates machine learning into databases through SQL. It treats models as virtual tables (AI-tables), allowing users to directly use SQL queries for time series, regression, and classification predictions without the need for complex data preparation and preprocessing steps. This greatly simplifies the machine learning development process. MindsDB provides developers with a simple and efficient way to accomplish machine learning tasks.

DataTalksClub
data-engineering-zoomcamp
DataTalksClub
32.5k

Data Engineering Zoomcamp (DataTalksClub/data-engineering-zoomcamp) offers a free data engineering course designed to help learners master the basic concepts and skills of data engineering. Whether it's data stream processing, data warehouse construction, or ETL process design, this course provides valuable learning resources for those aspiring to enter the field of data engineering.

microsoft
Data-Science-For-Beginners
microsoft
30.5k

Microsoft's Azure cloud advocates are happy to offer a 10-week, 20-lesson course on data science. Each lesson includes pre- and post-lesson quizzes, written instructions for completing the course, solutions, and assignments. Our project-based teaching method allows you to learn while building, which is a proven way to "stick" with a new skill.

AMAI-GmbH
AI-Expert-Roadmap
AMAI-GmbH
30.2k

An AI technology roadmap, initiated by the German software company AMAI GmbH, contains relevant knowledge points in the field of AI technology, each of which is accompanied by detailed documents

PostHog
posthog
PostHog
28.7k

PostHog provides open-source product analytics, session recording, feature flagging, and A/B testing that you can self-host.

eugeneyan
applied-ml
eugeneyan
28.3k

A selection of papers, technical articles and well-known blogs related to data science and machine learning, covering 24 technical directions such as data engineering, natural language processing, computer vision, reinforcement learning, etc. Most of the articles come from world-renowned universities and enterprises.

getredash
redash
getredash
27.7k

An open source BI tool that provides web-based database query and data visualization functionality

d2l-ai
d2l-en
d2l-ai
26.7k

An interactive deep learning book that provides code, math, and discussions across multiple frameworks. This project has been adopted at over 500 universities in 70 countries around the world, including Stanford University, Massachusetts Institute of Technology, Harvard University, Cambridge University, etc. It provides rich resources and an interactive learning experience for learning deep learning.

apache
flink
apache
25.2k

Flink - Apache Flink is an open source stream processing framework with powerful streaming and batch processing capabilities

fastai
fastbook
fastai
23.7k

The non-profit technology organization fast.ai recently opened its new version of the deep learning course.

sinaptik-ai
pandas-ai
sinaptik-ai
21.9k

ml-tooling
best-of-ml-python
ml-tooling
21.8k

It includes some practical machine learning and Python open source projects and tools. There are more than 900 projects in total, including data visualization, natural language processing, text and image data, web crawling, etc.

dataease
dataease
dataease
21.5k

An open-source data visualization analysis tool that helps users quickly analyze data and gain insights into business trends, thereby achieving business improvement and optimization.

PrefectHQ
prefect
PrefectHQ
20.2k

Python's data stream orchestration platform. If the programs for acquiring, cleaning, and processing data are considered as individual tasks, this project can integrate these tasks into a workflow, enabling their deployment, scheduling, and monitoring on a web platform.

airbytehq
airbyte
airbytehq
19.3k

An open-source data integration platform that can complete data integration in just a few minutes through APIs, applications, command-line tools, and other methods for subsequent use and management.

Avaiga
taipy
Avaiga
18.6k

Quickly build data-driven web applications. This is a project based on Python and Flask, combined with front-end technologies such as React, providing developers with a simple and efficient development framework. It can simplify the development process of data processing, API development, and user interface construction. Whether you are a data scientist, machine learning engineer, or web developer, you can use Taipy to quickly complete the entire process from prototype to web application. Sharing from @Liu Sanfei

Tencent
APIJSON
Tencent
18.1k

A framework for quickly developing API services, providing fully automated APIs for simple add, delete, modify and query operations as well as complex queries and simple transaction operations. With APIJSON, users no longer need to write interfaces and documents, greatly improving development efficiency.

dair-ai
ML-YouTube-Courses
dair-ai
16.8k

The ML YouTube Courses project is dedicated to providing users with the latest machine learning and artificial intelligence courses, all of which can be found on YouTube. By aggregating various educational resources, this project offers learners and practitioners a convenient platform to easily browse, filter, and select course content that suits their learning needs. Whether you are a beginner or a professional, ML YouTube Courses is an ideal choice for discovering quality machine learning educational resources.

heibaiying
BigData-Notes
heibaiying
16.6k

A Big Data Primer

openobserve
openobserve
openobserve
16.4k

OpenObserve is a cloud-native visualization monitoring platform specifically designed for logs, metrics, tracing, and analytics, engineered for PB-scale. It offers 10 times simplicity, 140 times lower storage costs, high performance, and an Elasticsearch/Splunk/Datadog alternative for PB-scale (logs, metrics, tracing).

bharathgs
Awesome-pytorch-list
bharathgs
16.1k

A list of open-source libraries related to PyTorch on GitHub, containing learning tutorials, examples, etc.

argoproj
argo-workflows
argoproj
16.0k

FavioVazquez
ds-cheatsheets
FavioVazquez
15.7k

Data Science Cheat Sheet

marimo-team
marimo
marimo-team
15.6k

Innovative responsive Python notebook. This project is a responsive notebook designed specifically for Python, which automatically executes and updates the dependent code cells when interacting with the UI, ensuring consistency between the code and output. It is stored in pure Python files, making it easy to manage and run, and supports execution as a script or deployment as an interactive web application.

GaiZhenbiao
ChuanhuChatGPT
GaiZhenbiao
15.4k

ChuanhuChatGPT is an open-source chatbot project based on Transformers, providing powerful dialogue generation capabilities and various pre-trained models. This project uses advanced Transformer technology to enable interesting conversations with the robot. Developers can quickly build interactive and natural-flowing chatbots using ChuanhuChatGPT to meet various application needs.

apache
hadoop
apache
15.2k

Hadoop - Apache Hadoop uses a simple programming model to distribute large data sets across computer clusters for processing.

Kanaries
pygwalker
Kanaries
15.1k

A recently popular Python library on GitHub that can be used to simplify the data analysis and data visualization workflow in Jupyter Notebook.

bbfamily
abu
bbfamily
14.6k

A free and open-source quantitative trading & investment architecture system based on Python, supporting stocks, futures, foreign exchange, digital currencies (BTC\ETH\LTC\ETC\BCC), etc.

aalansehaiyang
technology-talk
aalansehaiyang
14.6k

A summary of Java ecosystem common technology frameworks, open source middleware, system architecture, project management, classic architecture cases, databases, commonly used third-party libraries, online operation and maintenance, etc.

andkret
Cookbook
andkret
14.5k

Provides practical guidance and best practices for data engineers on data processing, analysis, and management. This project collects the knowledge shared by experienced experts to help data engineers better address challenges in the data field.

microsoft
nni
microsoft
14.3k

A lightweight but powerful toolkit to help users automate feature engineering, neural network architecture search, hyperparameter tuning and model compression

virgili0
Virgilio
virgili0
14.2k

A machine learning guide that can serve as your machine learning mentor, providing you with a complete learning path to learn more about the use of tools and master more skills

apache
doris
apache
14.2k

A high-performance, real-time analytical database based on MPP architecture, which performs excellently in scenarios with massive data and high concurrency. Currently, it is widely used in many well-known companies to build applications such as user analysis, log retrieval analysis, and user profiling.

datastacktv
data-engineer-roadmap
datastacktv
12.7k

The latest learning route guide for data engineers in 2020, which contains: CS foundation, database foundation, relational database, cluster computing foundation, data processing, monitoring data pipeline, data security and privacy, etc.

trinodb
trino
trinodb
11.8k

vesoft-inc
nebula
vesoft-inc
11.6k

Nebula - Nebula Graph is an open-source graph database that excels at handling ultra-large-scale datasets with billions of vertices and trillions of edges

ludwig-ai
ludwig
ludwig-ai
11.6k

A low-code framework designed for building custom deep learning models, neural networks, and other AI models. The project aims to lower the development barrier for AI applications, enabling developers to create and deploy custom AI models more easily without requiring expertise in deep learning.

OpenRefine
OpenRefine
OpenRefine
11.5k

A desktop tool for data cleaning, which analyzes and organizes data through visualization. It supports multiple platforms, including Windows, Linux, and Mac operating systems. The tool has functions such as querying, filtering, deduplication, and analysis, allowing users to organize messy data into "clean" spreadsheets in a simple and intuitive way. Without the need for programming and SQL knowledge, OpenRefine provides users with a powerful and user-friendly data cleaning experience.

pwxcoo
chinese-xinhua
pwxcoo
11.3k

The Chinese Xinhua Dictionary database, including common idioms, proverbs, words and characters

wandb
wandb
wandb
10.3k

A lightweight machine learning visualization tool. It is used for visualizing and tracking machine learning experiments, allowing experiments to be tracked, compared, and visualized with just a few lines of code. For machine learning engineers and data scientists, this tool provides a convenient and efficient way to manage experiments and results.

wangzhiwubigdata
God-Of-BigData
wangzhiwubigdata
10.2k

A big data interview question solution, mainly divided into three major chapters: Big Data Development Foundation, Framework Learning, and Practical Advanced, which includes high-frequency interview questions on technologies such as high concurrency, distributed, Hadoop, Spark, Flink, and Kafka.

microsoft
computervision-recipes
microsoft
9.7k

A computer vision guide, "Computer Vision Recipes," provides code examples and best practices for building computer vision systems.

alexeygrigorev
data-science-interviews
alexeygrigorev
9.5k

A data science-related interview question, mainly divided into two parts: knowledge theory (such as linear regression, neural network, decision tree, text classification, etc.) and technical application (such as SQL, Python, algorithm, etc.) content

oceanbase
oceanbase
oceanbase
9.4k

OceanBase is a distributed relational database developed by Ant Group. It is based on the Paxos protocol and a distributed architecture, which realizes high availability and linear scalability. The OceanBase database can run on common server clusters without relying on special hardware architectures. This project aims to provide a reliable relational database solution for enterprise-level applications.

© 2025 GitHub Fun. All rights reserved.