Explanation of 75 core terms in the field of big data!

Author:Zhang Yue Time:2020/09/18 阅读：3783

Recently, Ramesh Dontha published two consecutive articles on DataConomy, briefly and comprehensively introducing the 7 […]

Recently, Ramesh Dontha published two articles on DataConomy, briefly and comprehensively introducing 75 core terms about big data. This is not only a good introductory material for big data beginners, but also useful for advanced practitioners. The role of checking for leaks and filling gaps. This article is divided into part one (25 terms) and part two (50 terms).

Previous article (25 terms)

If you are new to big data, you may find this field difficult to understand and have no way to start. However, you can start with this list of 25 big data terms, so let’s get started.

Algorithm: Algorithm can be understood as a mathematical formula or statistical process used for data analysis. So, why does “algorithm” have anything to do with big data? You know, although the word algorithm is a general term, in this era of popular big data analysis, algorithms are often mentioned and become more and more popular.

Analytics: Let’s imagine a very likely situation. Your credit card company sends you an email recording the fund transfers in your card throughout the year. If you take this list at this time and start to seriously study your financial status in food, What will happen to the percentages spent on clothing, entertainment, etc.? You're doing analytics, you're digging out useful information from your raw data that can help you make decisions about your spending in the coming year. So what if you took a similar approach to posts from people across the city on Twitter and Facebook? In this case, we can call it big data analysis. The so-called big data analysis is to reason about large amounts of data and derive useful information from it. There are three different types of analysis methods below, and now we will sort them out separately.

Descriptive Analytics: If you only say that your credit card consumption last year was: 25% for food, 35% for clothing, 20% for entertainment, and the remaining 20% for miscellaneous expenses, then this analysis method is called descriptive analysis. Of course, you can also find out more details.

Predictive Analytics: If you analyze the history of credit card consumption in the past five years and find that the annual consumption basically shows a continuous change trend, then in this case you can predict with a high probability: the consumption status in the coming year should be the same as in the past. is similar. This does not mean that we are predicting the future, but it should be understood that we are "predicting with probability" what may happen. In the predictive analysis of big data, data scientists may use advanced technologies, such as machine learning, and advanced statistical processing methods (we will talk about this later) to predict weather conditions, economic changes, etc.

Prescriptive Analytics: Here we still use the example of credit card transfer to understand. If you want to find out which types of your consumption (such as food, entertainment, clothing, etc.) can have a huge impact on overall consumption, then the normative analysis method based on predictive analytics (Predictive Analytics) introduces "dynamic indicators (action)" ” (such as cutting back on food or clothing or entertainment) and analyzing the resulting results to define an optimal spending item that will reduce your overall expenses. You can extend this to the field of big data and imagine how a person in charge can make so-called "data-driven" decisions by observing the impact of multiple dynamic indicators in front of him.

Batch processing: Although batch data processing has existed since the mainframe era, batch processing has gained more significance in the face of the big data era that processes large amounts of data. Batch data processing is an efficient way to process large amounts of data, such as a bunch of transaction data collected over a period of time. Distributed computing (Hadoop), which will be discussed later, is a method that specializes in processing batch data.

Cassandra:It is a popular open source data management system developed and operated by the Apache Software Foundation. Apache has mastered many big data processing technologies, and Cassandra is their system specifically designed to process large amounts of data among distributed servers.

Cloud computing: Although the word cloud computing is now a household name, there is no need to go into details here, but for the sake of completeness of the entire article, the author still adds the cloud computing entry here. Essentially, when software or data is processed on remote servers and these resources can be accessed from anywhere on the network, it is called cloud computing.

Cluster computing: This is a visual term to describe computing in a cluster that uses the rich resources of multiple servers. A more technical understanding is that in the context of cluster processing, we may discuss nodes, cluster management layer, load balancing, parallel processing, etc.

Dark data: This is a coined word. In the author’s opinion, it is used to scare people and make senior management sound obscure. Basically, the so-called dark data refers to all the data that companies accumulate and process that are actually not used at all. In this sense, we call them "dark" data, and they may not be analyzed at all. . This data can be information from social networks, call center records, meeting minutes, etc. Many estimates believe that 60% to 90% in all companies' data may be dark data, but in fact no one knows.

Data lake: When the author first heard this term, I really thought it was an April Fool’s Day joke. But it's really a term. So a data lake is a repository of company-level data in a large number of raw formats. Here we introduce the data warehouse. A data warehouse is a similar concept to the data lake mentioned here, but the difference is that it stores structured data that has been cleaned and integrated with other resources. Data warehouses are often used for general purpose data (but not necessarily so). It is generally believed that a data lake can make it easier for people to access the data you really need. In addition, it can also be more convenient for you to process and use them effectively.

Data mining: Data mining is about the process of finding meaningful patterns and gaining relevant insights from a large set of data using complex pattern recognition techniques. It is closely related to the "analysis" mentioned above. In data mining, you will first mine the data and then analyze the results. To get meaningful patterns, data miners use statistics (a classic old method), machine learning algorithms, and artificial intelligence.

data scientist: Data scientist is a very sexy profession nowadays. It refers to a group of people who can understand, process and derive insights by extracting raw data (this is what we called the data lake earlier). Some of the necessary skills of a data scientist can be said to be only possessed by superhuman beings: analytical skills, statistics, computer science, creativity, storytelling, and the ability to understand the business context. No wonder these people are well paid.

Distributed File System: The amount of big data is too large to be stored in a single system. A distributed file system is a file system that can store large amounts of data on multiple storage devices. It can reduce the cost and complexity of storing large amounts of data.

ETL:ETL stands for Extract, Transform and Load. It refers to the process of "extracting" raw data, "converting" the data into a "suitable for use" form through cleaning/enrichment methods, and "loading" it into an appropriate library for system use. Even though ETL originates from data warehouses, this process is also used when retrieving data, for example, from external sources in big data systems.

Hadoop: When people think about big data, they immediately think of Hadoop. Hadoop is an open source software architecture (the logo is a cute elephant) consisting of the Hadoop Distributed File System (HDFS), which allows the storage, abstraction and analysis of big data using distributed hardware. If you really want to impress someone with this thing, you can tell them YARN (Yet Another Resource Scheduler), which, as the name suggests, is another resource scheduler. I was really struck by the people who came up with these names. The Apache Foundation, which proposed Hadoop, is also responsible for Pig, Hive and Spark (these are the names of some software). Are you not surprised by these names?

In-memory computing: It is generally believed that any calculation that does not involve I/O access will be faster. In-memory computing is a technology that moves all working data sets into the collective memory of the cluster, avoiding the need to write intermediate results to disk during calculations. Apache Spark is an in-memory computing system, which has great advantages over I/O-bound systems such as Mapreduce.

Internet of Things (IoT): The latest buzzword is the Internet of Things (IoT). IoT is the interconnection of computing devices in embedded objects (such as sensors, wearables, cars, refrigerators, etc.) through the Internet, and they can send and receive data. The Internet of Things generates massive amounts of data and brings many opportunities for big data analysis.

Machine Learning: Machine learning is a method of designing systems that can learn, adjust, and improve based on the data they are fed. Using programmed predictions and statistical algorithms, they continually approximate "correct" behaviors and thoughts, and they can improve further as more data is fed into the system.

MapReduce:MapReduce may be a bit difficult to understand, let me try to explain it. MapReduceMapReduce is a programming model, and the best understanding is to note that Map and Reduce are two different processes. In MapReduce, the programming model first divides a large data set into small pieces (these small pieces are called "tuples" in technical terms, but I will try to avoid obscure technical terms when describing them), and then these small pieces are Be distributed to different computers in different locations (that is, the cluster described previously), which is necessary during the Map process. The model then collects the results of each calculation and "reduces" them into a part. The data processing model of MapReduce is inseparable from the Hadoop distributed file system.

Non-relational database (NoSQL): This word sounds almost like the antonym of "SQL, Structured Query Language". SQL is required for traditional relational data management systems (RDBMS), but NOSQL actually refers to "more than SQL". NoSQL actually refers to database management systems that are designed to handle large amounts of data without structure (or "schema", outline). NoSQL is suitable for big data systems because large-scale unstructured databases require the flexibility and distributed-first features of NoSQL.

R language: Can anyone give a worse name to a programming language? The R language is such a language. However, R is a language that works well for statistical work. If you don't know R language, don't say you are a data scientist. Because R language is one of the most popular programming languages for data science.

Spark (Apache Spark): Apache Spark is a fast in-memory data processing engine that can efficiently execute stream processing, machine learning, and SQL workloads that require iterative database access. Spark is usually much faster than MapReduce, which we discussed earlier.

Stream processing: Stream processing is designed for continuous processing of streaming data. Combined with streaming analytics (the ability to continuously compute numerical and statistical analyses), stream processing methods are particularly capable of processing large-scale data in real time.

Structured v Unstructured Data: This is one of the comparisons in big data. Structured data is basically any data that can be placed in a relational database, organized in such a way that it can be related to other data through tables. Unstructured data refers to any data that cannot be placed in a relational database, such as email messages, statuses on social media, and human speech.

Next article (50 terms)

This article is a continuation of the previous article. Due to the overwhelming response to the previous article, I decided to introduce 50 more related terms. Here is a brief review of the terms covered in the previous article: algorithm, analysis, descriptive analysis, preprocessing analysis, predictive analysis, batch processing, Cassandra (a large-scale distributed data storage system), cloud computing, cluster computing , dark data, data lakes, data mining, data scientists, distributed file systems, ETL, Hadoop (a software platform for developing and running large-scale data processing), in-memory computing, Internet of Things, machine learning, Mapreduce (the core component of hadoop 1), NoSQL (non-relational database), R, Spark (computing engine), stream processing, structured vs unstructured data.

Let's move on to another 50 big data terms.

Apache: The Software Foundation (ASF) provides many open source projects for big data, currently there are more than 350. It would take a lot of time to explain these items, so I've only chosen to explain a few popular terms.

Apache Kafka: Named after Czech writer Franz Kafka, used for building real-time data pipelines and streaming applications. What makes it so popular is its ability to store, manage, and process data streams in a fault-tolerant manner and is said to be very "fast." Kafka is currently very popular given that social networking environments heavily involve processing data streams.

Apache Mahout: Mahout provides a library of pre-made algorithms for machine learning and data mining, and can also be used as an environment for creating additional algorithms. In other words, the best environment for machine learning geeks.

Apache Oozie: In any programming environment, you need some workflow system to schedule and run work in a predefined way and with defined dependencies. Oozie provides exactly this for big data jobs written in languages like pig, MapReduce, and Hive.

Apache Drill, Apache Impala, Apache Spark SQL: These three open source projects all provide fast and interactive SQL, such as interacting with Apache Hadoop data. These features are useful if you already know SQL and work with data stored in big data formats (ie, HBase or HDFS). Sorry, that's a little weird here.

Apache Hive: Do you know SQL? If you know it, you will be ready to get started with Hive. Hive facilitates reading, writing, and managing large data sets residing in distributed storage using SQL.

Apache Pig:Pig is a platform for creating, querying, and executing routines on large distributed data sets. The scripting language used is called Pig Latin (I'm not talking nonsense, trust me). Pig is said to be easy to understand and learn. But I wonder how much can be learned?

Apache Sqoop: A tool for moving data from Hadoop to non-Hadoop data stores such as data warehouses and relational databases.

Apache Storm: A free and open source real-time distributed computing system. It makes it easier to process unstructured data while using Hadoop for batch processing.

Artificial Intelligence (AI): Why does AI appear here? You may ask, isn't this a separate field? All of these technology trends are closely connected, so we’d better sit back and keep learning, right? AI develops intelligent machines and software through a combination of hardware and software that can sense the environment and take necessary actions when needed, constantly learning from these actions. Doesn’t it sound a lot like machine learning? Come be "confused" with me.

Behavioral Analytics: Have you ever wondered how Google serves ads for the products/services you need? Behavioral analysis focuses on understanding what consumers and applications do, and how and why they work in certain ways. This involves understanding our surfing patterns, social media interaction behaviors, and our online shopping activities (shopping carts, etc.), connecting these unrelated data points, and trying to predict the results. As an example, after I found a hotel and cleared my cart, I received a call from the resort's vacation line. Do I have to say more?

Brontobytes: 1 followed by 27 zeros, which is the size of the storage unit in the future digital world. And here we are, talking about Terabyte, Petabyte, Exabyte, Zetabyte, Yottabyte and Brontobyte. You must read this article to understand these terms in depth.

Business Intelligence: I will reuse Gartner's definition of BI because it explains it well. Business intelligence is an umbrella term that includes applications, infrastructure, tools, and best practices for accessing and analyzing information to improve and optimize decision-making and performance.

Biometrics: This is a technology that combines James Bondish technology and analysis technology to identify people through one or more physical characteristics of the human body, such as facial recognition, iris recognition, fingerprint recognition, etc.

Clickstream analytics: Used to analyze online click data when users browse on the Internet. Ever wonder why certain Google ads linger even when switching websites? Because the guys at Google know what you’re clicking on.

Cluster Analysis:It is an exploratory analysis that attempts to identify the structure of the data, also known as segmentation analysis or classification analysis. More specifically, it attempts to identify homogenous groups of cases (i.e. observations, participants, interviewees). If groupings were previously unknown, cluster analysis was used to identify groups of cases. Because it is exploratory, it does make a distinction between dependent and independent variables. Different cluster analysis methods provided by SPSS can handle binary, nominal, ordinal, and scale (interval or ratio) data.

Comparative Analytics: Because the key to big data lies in analysis, I will explain the meaning of analysis in depth in this article. As the name suggests, comparative analysis is the use of statistical techniques such as pattern analysis, filtering, and decision tree analysis to compare multiple processes, data sets, or other objects. I know it's getting less and less technical, but I still can't completely avoid the jargon. Comparative analysis can be used in the healthcare field to give more effective and accurate medical diagnoses by comparing large amounts of medical records, documents, images, etc.

Connection Analytics: You must have seen the spider web like graph connecting people to topics to identify the influencers of a specific topic. Correlation analysis analysis can help discover the relevant connections and impacts between people, products, systems in the network, and even the combination of data and multiple networks.

Data Analyst: Data Analyst is a very important and popular job which is responsible for collecting, editing and analyzing data in addition to preparing reports. I will write a more detailed article about data analysts.

Data Cleansing: As the name suggests, data cleaning involves detecting and correcting or deleting inaccurate data or records in the database, and then remembering the "dirty data." With the help of automated or manual tools and algorithms, data analysts can correct and further enrich the data to improve data quality. Remember, dirty data leads to faulty analysis and poor decision-making.

Data as a Service (DaaS): We have Software as a Service (SaaS), Platform as a Service (PaaS), and now we have DaaS, which means: Data as a Service. By providing users with on-demand access to cloud data, DaaS providers can help us obtain high-quality data quickly.

Data virtualization: This is a data management method that allows an application to extract and manipulate data without knowing the technical details (such as where the data is stored and in what format). For example, social networks use this method to store our photos.

Dirty Data: Since big data is so attractive, people have also begun to add other adjectives to data to form new terms, such as dark data, dirty data, small data, and now Smart data. Dirty data is data that is not clean, in other words, data that is inaccurate, duplicated, and inconsistent. Obviously, you don't want to mess around with dirty data. So, fix it as soon as possible.

Fuzzy logic: How many times have we been certain about something, such as 100% being correct? Very rare! Our brains aggregate data into partial facts, which are further abstracted into some kind of threshold that determines our decisions. Fuzzy logic is a way of computing that, as opposed to the "0's" and "1's" of things like Boolean algebra, aims to mimic the human brain by gradually eliminating parts of the truth.

Gamification: In a typical game, you have an element like scores to compete with others, and there are clear rules of the game. Gamification in big data is the use of these concepts to collect and analyze data or motivate players.

Graph Databases: Graph data uses concepts like nodes and edges to represent people and businesses and the relationships between them to mine data in social media. Ever marveled at what Amazon tells you about what others are buying when you buy a product? Yes, this is a graph database.

Hadoop User Experience (Hue): Hue is an open source interface that makes working with Apache Hadoop easier. It is a web-based application; it has a file browser for distributed file systems; it has task design for MapReduce; it has Oozie, a framework that can schedule workflows; it has a shell, an Impala, and a Hive UI and a set of Hadoop APIs.

High Performance Analytics Applications (HANA): This is a software and hardware memory platform designed by SAP for big data transmission and analysis.

HBase: A distributed column-oriented database. It uses HDFS as its underlying storage and supports both batch computing using MapReduce and batch computing using transaction interactions.

Load balancing: To achieve optimal results and utilization of the system, distribute the load to multiple computers or servers.

Metadata: Metadata is data that can describe other data. Metadata summarizes basic information about the data, which makes it easier to find and use specific instances of the data. For example, author, creation date, modification date, and size of the data are basic document metadata. In addition to document files, metadata is used for images, videos, spreadsheets, and web pages.

MongoDB: MongoDB is a cross-platform open source database oriented to text data models, rather than traditional table-based relational databases. This database structure is primarily designed to make the integration of structured and unstructured data faster and easier for certain types of applications.

Mashup: Fortunately, this term has a similar meaning to the word "mashup" we use in daily life, which means mix and match. Essentially, a mashup is a method of merging disparate data sets into a single application (e.g., combining real estate data with geolocation data, demographic data). This really makes visualization cool.

Multi-Dimensional Databases: This is a database optimized for data online analytical processing (OLAP) and data warehouse. If you don't know what a data warehouse is, I can explain that a data warehouse is nothing else. It just centralizes the storage of data from multiple data sources.

MultiValue Databases: Multi-value database is a non-relational database that can directly understand three-dimensional data, which is good for directly operating HTML and XML strings.

Natural Language Processing: Natural language processing is a software algorithm designed to allow computers to more accurately understand daily human language, allowing humans to interact with computers more naturally and effectively.

Neural Network: According to this description (http://neuralnetworksanddeeplearning.com/), neural networks are a very beautiful programming paradigm inspired by biology that enable computers to learn from observed data. It's been a long time since anyone called a programming paradigm beautiful. In fact, neural networks are models inspired by real-life brain biology... A term closely associated with neural networks is deep learning. Deep learning is a collection of learning techniques in neural networks.

Pattern Recognition: Pattern recognition occurs when an algorithm needs to determine regression or patterns on large-scale data sets or on different data sets. It is closely related to machine learning and data mining, and is even considered synonymous with the latter two. This visibility can help researchers discover profound patterns or draw conclusions that might be considered absurd.

Radio Frequency Identification／RFID: Radio frequency identification is a type of sensor that uses non-contact wireless radio frequency electromagnetic fields to transmit data. With the development of the Internet of Things, RFID tags can be embedded in any possible "thing", which can generate a lot of data that needs to be analyzed. Welcome to the world of data.

Software as a Service (SaaS): Software as a Service allows service providers to host applications on the Internet. SaaS providers provide services in the cloud.

Semi-structured data: Semi-structured data refers to data that is not formatted in traditional ways, such as those data fields related to traditional databases or commonly used data models. Semi-structured data is not completely raw data or completely unstructured data. It may contain some data tables, tags or other structural elements. Examples of semi-structured data are graphs, tables, XML documents, and emails. Semi-structured data is very popular on the World Wide Web and can often be found in object-oriented databases.

Sentiment Analysis: Sentiment analysis involves capturing, tracking, and analyzing the emotions, emotions, and opinions expressed by consumers in many types of interactions and documents that exist in social media, customer representative phone interviews, and surveys. Text analysis and natural language processing are typical techniques in the sentiment analysis process. The goal of sentiment analysis is to identify or evaluate attitudes or emotions held toward a company, product, service, person, or time.

Spatial analysis: Spatial analysis refers to the analysis of spatial data to identify or understand the patterns and rules of data distributed in geometric space. This type of data includes geometric data and topological data.

Stream processing: Stream processing is designed to perform real-time "continuous" query and processing of "stream data". In order to continuously perform real-time numerical calculations and statistical analysis on large amounts of streaming data at a very fast speed, there is a clear need for stream processing of streaming data on social networks.

Smart Data:It is data that is useful and actionable after some algorithmic processing.

Terabyte:This is a relatively large unit of numerical data, with 1TB equal to 1,000GB. It is estimated that 10TB could hold all the printed matter of the Library of Congress, and 1TB could hold the entire Encyclopedia Brittanica.

Visualization: With a reasonable visualization in place, the raw data can be used. Of course the visualization here goes beyond simple charts. Rather, it is a complex chart that can contain many variables of the data while still being readable and understandable.

Yottabytes: Close to 1000 Zettabytes, or 2500 trillion DVDs. All digital storage today is about 1 Yottabyte, and this number doubles every 18 months.

Zettabytes: Close to 1000 Exabytes, or 1 billion Terabytes.

TAGS

TAGS

Explanation of 75 core terms in the field of big data!

Previous article (25 terms)

Next article (50 terms)

Leave a Reply Cancel reply

Looking for customers & partners

QQ group: 866985746

knowledge base

global smart city

contact us

Log in

Quick registration

TAGS

Explanation of 75 core terms in the field of big data!

Previous article (25 terms)

Next article (50 terms)

Leave a Reply Cancel reply

Looking for customers & partners

QQ group: 866985746

knowledge base

global smart city

contact us