大数据分析(全英文) - 北京理工大学 - 学堂在线
1.判断题 (2分)
Item based collaborative Filtering algorithm calculate the item similarity according to item features. ( )
2.判断题 (2分)
Database provides the physical storage structure; ( )
3.判断题 (2分)
we can find one kind of tool to deal with all the data manage problems of the Database. ( )
4.判断题 (2分)
we can find one kind of tool to deal with all the data manage problems of the Big data. ( )
5.判断题 (2分)
The data in HDFS is immutable. ( )
6.判断题 (2分)
In HDFS, each storage file is first divided into multiple data blocks with a flexible length according to the data size. ( )
7.判断题 (2分)
Hadoop is the only big data architecture.
8.判断题 (2分)
HDFS support batch reading, writing operation and updating operation. ( )
9.判断题 (2分)
DFS distributed file system provides the logical storage structure of the data. ( )
10.判断题 (2分)
In customer Collaborative filtering, the similar users are defined based on the common items they purchased ( )
11.主观题 (10分)
Recommendation System – matrix decomposition Task:
You are given a dataset collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. It includes 100,000 ratings (1-5) from 943 users on 1682 movies.
You need to use the matrix decomposition method to predict the missing values of the rating matrix to complete a recommendation system so that you can recommend movies to a user based on the predicted ratings.
Submission Requirement:
Put submit following 3 screenshots of the 3-fold cross validation results of rating prediction in a single PDF file -- ID_NAME.PDF
matrix decomposition DATA.rar
代码语言
字数统计
文档将自动保存
添加附件
( 可上传1个附件,文件不超过100M) ?
上传附件,允许上传一个附件,100M以内
占位
12.单选题 (2分)
The two main components of big data are ( ) and ( ). ()
ADistributed Storage, Distributed Processing
BDistributed Collection, Distributed Processing
CDistributed Collection, Distributed Storage
DDistributed Collection, Distributed application
13.单选题 (2分)
MPP (Massively Parallel Processing) improves performance through ( ) parallelism. ( ) coordinates work with ( ) , ( ) coordinates work with one or more ( ). ( ) process queries in parallel. ( ) have their own CPU disk memory in shared nothing architecture. High speed interconnect for continuous pipelining of data processing. ( )
Asegment hosts, Master, segment host, Segment host, segment instances, Segment instances, Segment hosts
Bsegment instance, Master, segment host, Segment host, segment instances, Segment instances, Segment hosts
Csegment instance, Master, segment host, Segment host, segment instances, Segment instances, Segment instances
Dsegment hosts, Master, segment host, Segment host, segment instances, Segment hosts, Segment hosts
14.单选题 (2分)
Which of following description about the search interface of deep web is NOTcorrect ()
Ahas complex interfaces
Bsupports queries on several attributes
Cextracts contents from databases
Deasy to find
15.单选题 (2分)
The history progress of harnessing data is that ()
1) ()reporting and human analysis can be made on historical data
2) () can analyze the current data to improve business transaction
3) () Real-Time Analytics Processing to make the Realtime decision and improve Realtime business response
AOLAP: Online Analytical Processing; OLTP: Online Transaction Processing; RTAP: Real-Time Analytics Processing;
BOLTP: Online Transaction Processing; OLAP: Online Analytical Processing; RTAP: Real-Time Analytics Processing;
COLAP: Online Analytical Processing; RTAP: Real-Time Analytics Processing; OLTP: Online Transaction Processing;
DOLTP: Online Transaction Processing; RTAP: Real-Time Analytics Processing; OLAP: Online Analytical Processing;
16.单选题 (2分)
Based on the requirement we can build the business model, it includes ( ) and ( ). ( )
AConceptual Model, Logic Model
BLogic Model, Physical Model
CProcess model, Data model
DProcess model, logical model
17.单选题 (2分)
Which of the following big data characters best describes Data in Many Forms? ( )
AVolume
BVariety
CVeracity
DVelocity
18.单选题 (2分)
The most often used internal data acquisition tool is ( )
ADatawarehouse
BETL (Extract, Transform, load)
CData Trigger
DIncremental data extraction
19.单选题 (2分)
Deep web content includes ()
1 Pages that are not referred to by search engines due to lack of directed links
2 Non-web files accessible on the web, such as picture files, Pdf and word documents, etc.
3 A dynamic page obtained by querying the back-end online database by filling in the form.
4 Content that requires registration or other restrictions to access.
A1234
B124
C123
D234
20.单选题 (2分)
TensorFlow allows developers to create ( )-structures that describe how data moves through a ( ), or a series of processing nodes. Each node in the graph represents a ( ), Each connection or Edge between nodes is a ( ).
ADataflow Graphs, Graph (DAG), multidimensional data array or tensor, mathematical operation
BGraph (DAG), Dataflow Graphs, mathematical operation, multidimensional data array or tensor
CDataflow Graphs, Graph (DAG), mathematical operation, multidimensional data array or tensor
DGraph (DAG), Dataflow Graphs, mathematical operation, multidimensional data array or tensor
21.单选题 (2分)
Database connection programming interfaces such as ( ) can support SQL access by applications to the database, but they cannot provide complex functions such as transaction management, concurrent scheduling, buffer management, heterogeneous database conversion and inheritance in a distributed computing environment. This introduces the ( ). It is a layer of software that provides data exchange functions on top of the database. When the system is extended and need to access cross-platform heterogeneous databases, OS could be UNIX, Linux or Windows, forms could be mails, XML documents, EJB components, Web services, images, audio/video files or For other unstructured data, And the technology of the big data application layer is also diversified and various standards. The design of the ( ) needs to be compatible with various standard technologies and products, which introduces the ( ).
AODBC and JDBC; DAL data access layer; Unified data access interface; Unified data access interface;
BODBC and JDBC; DAL data access layer; DAL data access layer; Unified data access interface;
CDAL data access layer; ODBC and JDBC; DAL data access layer; Unified data access interface;
DODBC and JDBC; DAL data access layer; Unified data access interface; DAL data access layer;
22.单选题 (2分)
Which of the following is NOT the dimensionality reduction? ()
AWavelet transformation
BAttribute subset selection
CPrincipal component analysis
DData Cube Aggregation
23.单选题 (2分)
The ( ) annotation transparently translates your Python programs into TensorFlow graphs. ()
ATf.keras
Btf.function
CPremade Estimators
Dtf.data
24.单选题 (2分)
Web crawler crawling process is (B)
a) A list of uniform resource addresses called seed URL and use it as the link entry for crawling. When the crawler visits these seed URL s, it identifies all the needed links on the page and adds them to the queue to be crawled.
b) Put the already downloaded URL into the crawled URL list
c) Extract the new URL into the URL queue to be crawled and put them in the to be crawled URL queue according to strategy
d) The webpage links are taken out from the queue to be crawled, then Read URL, do the DNS resolution, and web pages were download into the Downloaded web library.
e) all the process will end until the queue for crawling is empty.
Aabcde
Badbce
Cacbde
Dadcbe
25.单选题 (2分)
Which of the following statement of data reduction is NOT right? ( )
AData reduction (subtraction) technology is used to help obtain a condensed data set from the original huge data set, and make this condensed data set maintain the integrity of the original data set
BData analysis on the condensed data set is obviously efficient higher, and the results of analysis are basically the same as those obtained by using the original data set
CThe time spent on data reduction could exceed or "offset" the time saved by analysis on the reduced data.
DThe data obtained by the reduction is much smaller than the original data, but can produce the same or almost the same analysis results.
26.单选题 (2分)
The execution model is based on BSP (Bulk Synchronous Processing) model. In this model, there are multiple processing units proceeding in parallel in a sequence of "Supersteps".Within each "Superstep", the processing sequence will be ()
a)each processing units first receive all messages delivered to them from the preceding "superstep",
b)When all the processing unit finishes the message delivery (hence the synchronization point)
c)may queue up the message that it intends to send to other processing units.
d)The queued up message will be delivered to the destined processing units but won't be seen until the next "superstep".
e)manipulate their local data
f) the next superstep can be started,
g)the cycle repeats until the termination condition has been reached.
Aaedcbfg
Baecdbfg
Cacedbfg
Dadecbfg
27.单选题 (2分)
( ) extract new or modified data in the database since the last extraction, at the same time, it normally would not have a big impact on the running business system. ()
AIncremental data extraction
BFull extraction
CTimestamp Extraction
DTrigger
28.单选题 (2分)
HANA improved the data analysis performance in data warehouse, Not because ()
AIt eliminates unnecessary complexity and latency
BAccelerate through simplification
CLeveraging the power of in-memory computing allows HANA to bring OLTP, transaction processing, and OLAP, data analytics, back together in one database.
DSpecialized data warehouses for reporting and analytics required the moving, transformation and pre-processing of transactional data, which introduces a huge complexity: sometimes an enterprise may hold three different copies of the same data
29.单选题 (2分)
The process begins by the ( ) issuing a query that is then passed to the ( ) . The ( ) contains information, such as the data dictionary and session information, which it uses to generate an ( )designed to retrieve the needed information from each underlying Node. Parallel Execution represents the implementation of the ( ) through the parallel computing of Node 1 to Node n. And the query results return to master node. ()
AClient, Master Node, Master Node, execution plan, execution plan
BMaster Node, Client, Master Node, execution plan, storing plan
CClient, Master Node, Client, execution plan, execution plan
DMaster Node, Client, Master Node, execution plan, storing plan
30.单选题 (2分)
In the many components of Spark, which is designed for Machine Learning? ( )
ASpark SQL
BSpark streaming
CMLlib
DGraph X
31.单选题 (2分)
Which description is not sure about Jim Gray ( )
ARelational database founder
BNautical sport enthusiast
CDivided scientific research into four types of paradigms
DBig data scientist
32.单选题 (2分)
The right order of reading data in HDFS.
a)Distributed Filesystem makes an RPC call to the namenode to determine location of datanodes where files are stored in form of blocks. For each block, the namenode returns address of datanodes (metadata of blocks and datanodes) that have a copy of block. Datanodes are sorted according to proximity (depending of network topology information).
b)The client opens the file by calling open () method on Distributed Filesystem.
c)The client then calls read () on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file.
d)The Distributed Filesystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.
e)Data is streamed from the datanode back to the client (in the form of packets) and read () is repeatedly called on the stream by client.
f) When the client has finished reading, it calls close () on the FSDataInputStream
g)When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block.
4-2-2.jpg
AABDCEGF
BBADCEGF
CBADCEFG
DBACDEGF
33.单选题 (2分)
Which of the following stage is the main reason of big data? ( )
AOperation and business system
BUser-generated content
CPerception stage
Dsocial media
34.单选题 (2分)
About Data Modeling design levels descriptions: Which one is correct matching?( C )
1)Based on the user's data function requirements. functions and association relationships are obtained, Entity Class corresponding to the business elements and functions.
2)More details of data entities, including primary keys, foreign keys, attributes, indexes, relationships, constraints, and even views, with data tables, data columns, value ranges, object-oriented classes, XML tags and other forms to describe.
3)The storage implementation of data, including data partition, data table space, and data integration.
A1-Conceptual model design 2-physical model design3- logical model design
B1- Logical model design 2-Physical model design3- Conceptual model design
C1-Conceptual model design 2- logical model design3- Physical model design
D1- Physical model design2- Conceptual model design3- logical model design
35.单选题 (2分)
Spark has several components to facilitate different type of computing tasks, like streaming,Graph etc. the components include ( )
1)Spark Core API2)Resilient distributed dataset (RDD),
3)Spark SQL 4)Spark topology
5)Spark Streaming 6)MLlib (Machine Learning Library)
7)GraphX8)Sklearn
A12345
B13456
C13567
D13578
36.单选题 (2分)
The correct big data lifecycle is ( )
Adata governance data collecting, data storing and data analyzing
Bdata collecting, data governance, data storing and data analyzing
Cdata collecting, data storing, data governance and data analyzing
Ddata collecting, data storing, data analyzing and data governance
37.单选题 (2分)
Data cleaning technology does not include ( )
AData transformation
BCleaning of missing data
CDeduplication of data
DPerform anomaly detection on the data set
38.单选题 (2分)
Which of the following is NOT the numerosity reduction? ()
APrincipal component analysis
BData Cube Aggregation
CClustering
DSampling
39.单选题 (2分)
In the execution of Graph Parallel Computing, which describes the roles of the master?( )
1)coordinate the execute of supersteps in sequence
2)signals the beginning of a new superstep to all workers after knowing all of them has completed the previous one
3)pings each worker to know their processing status
4)periodically issue "checkpoint" command to all workers who will then save its partition to a persistent graph store
A123
B134
C124
D1234
40.单选题 (2分)
Among the following which one is about idea, learning, notion, concept, synthesized, compared, thought-out, discussed? ()
AData
BInformation
CWisdom
DKnowledge
41.单选题 (2分)
In the following, which one is shared nothing architecture. ( )
ASMP
BNUMA
CMPP
DNone of them
42.单选题 (2分)
There are only 2 kinds of operation of RDD (Resilient Distributed Dataset), ( ). In ( ), data can be filter, joined map, reduced but no calculation is executed, only in ( ) the calculation can be done, and the value result can be generated. ( )
Amap and reduce, map, reduce
Btransformations and action, action, transformation
Ctransformations and action, transformation, action
Dmap and reduce, reduce, map
43.单选题 (2分)
What attributes subset selection method showed in the diagram? ( )
AForward Stepwise Attributes subset selection
BBackward Stepwise Attributes subset selection
CCombine forward selection and backward deletion
DDecision tree (decision tree) induction
44.单选题 (2分)
Relational databases and NoSQL databases have their own advantages and disadvantages and cannot be replaced by each other, () application scenarios: Key business systems in telecommunications, banking and other fields need to ensure strong transaction consistency; () application scenarios: Non-critical business (such as data analysis) of Internet companies, traditional companies.
ANoSQL database Relational database;
BRelational database; NoSQL database
CNoSQL database, NoSQL database;
DRelational database; Relational database
45.单选题 (2分)
( )is a user-friendly API standard for machine learning, will be the central high-level API used to build and train models. ( )
ASaveModel
BTensorFlowHub
CPremade Estimators
DTf.keras
46.单选题 (2分)
1.Which of the following are the choices of attributes subset selection methods?( C )
1)Forward Stepwise Attributes subset selection
2)Backward Stepwise Attributes subset selection
3)Combine forward selection and backward deletion
4)Principal component analysis
5)Reduction based on statistical analysis
6) Decision tree (decision tree) induction
A12346
B12345
C12356
D123456
47.单选题 (2分)
According to Gartner, there is estimated 20% data of organization is ( ) data, the other majority is ( ) data. ()
Astructured, unstructured
Bunstructured, structured
Cstructured, semi-structured
Dunstructured, semi-structured
48.单选题 (2分)
In the following picture, what are the right terms for each number ?
test 1-6.jpg
AData sources, Data storage, Data collection, Data Processing, Data Visualization, Report monitoring
BData sources, Data collection, Data storage, Data Visualization, Data Processing, Report monitoring
CData sources, Data collection, Data storage, Data Processing, Data Visualization, Report monitoring
DData sources, Data collection, Data storage, Data Processing, Report monitoring, Data Visualization
49.单选题 (2分)
How to deal with fan-out URLs in seed URLs, which is the links of the link, which involves web crawler crawling strategies. Which one is not the often used Crawling strategies ( )
ADepth first
BBreadth first
CFirst In-First out
DPartial PageRank Strategy
50.单选题 (2分)
( ) is responsible for resource monitoring and job scheduling, ( ) monitors the health status of all ( ) and Jobs, and if it finds a failure, it will transfer the corresponding tasks to other nodes. ( ) will track the task execution progress, resource usage, and other information, and inform the ( ), and ( ) will select the appropriate task to use these resources when resources become free. ( )
AJobTracker, JobTracker, TaskTrackers, JobTracker ,TaskScheduler, TaskScheduler
BJobTracker, TaskTrackers, JobTracker, JobTracker ,TaskScheduler, TaskScheduler
CJobTracker, JobTracker, JobTracker , TaskTrackers,TaskScheduler, TaskScheduler
DJobTracker, JobTracker, TaskTrackers, TaskScheduler, JobTracker ,TaskScheduler
51.单选题 (2分)
Which of the following is NOT data transform component? ( )
AField mapping
BData calculation
CData split
DEliminate duplication
还木有评论哦,快来抢沙发吧~