1、大数据管理与数据质量 - 美国金融业中的对策,汪时奇 (博士) 处理速度 容量限制 数据质量,Overview,数据 = Data = 信息 (并非数字集合) 数据科学 (约)= 信息科学 为何研究大数据? 因为相关产品(如硬盘, memory, CPU等)价格指数下降 因为信息爆炸 因为大数据导致许多新问题 大数据研究是多学科的综合(IT, DM, BI, BA, ) 实业界对大数据问题的对策 (见下文),1. 数据库策略,1.1 Database (DB) performance 1.2 DB space,1.1 DB performance,Auditing 2 tables: a sm
2、all active ),1.2 DB space,Space arrangement for even distribution (e.g. 1 huge table uses a few data files) Cleaning procedure with defragment Partition design with cleaning plan,2. Applications (软件) (Java example),Using advanced language (e.g. Java or C#) 2.1 Memory(内存) 2.2 Disk/network space 2.3 P
3、erformance 2.4 Maintainability,2.1 Memory,Minimize big objects creation and coexistence GC (Garbage Collection) or null big objects once out of scope Choose appropriate GC type gc() Try to split one big object to small objects Use mutable class for frequently changed big objects (e.g. StringBuilder,
4、 instead of String),2.2 Disk/network space,Smart clean and archive processes e.g. archive zipped old or not used files to low speed network space and delete very old files from that space Smart logging settings e.g. log4j size rolling e.g. Avoid duplicated or trivial logging info Monitor for spaces,
5、2.3 Performance,Avoid redundant treatment (in big loops) Maximize reuse Multi-threading DB accessing Logging - avoid slow options (e.g. line #),2.4 Maintainability,SOA principles Lose coupling, reusability, granularity, modularity, composability, componentization, interoperability, JEE patterns (DAO
6、, DTO, Biz Delegation, ) Design patterns (23) and MVC Creation Structure Behavior (e.g. Visitor) OOP principles Abstraction, encapsulation, polymorphism, Open/Close,3. 数据质量控制,3.1 Business 3.2 Process A. Failover & DR (Disaster Recovery) B. QA (Quality Assurance) (see for details) C. UAT (User Accept
7、ance Test) 3.3 Technology,3.1 Business,Reduce manual work; Increase automation Complete approval system for manual work E.g. 1 level = 2 levels or 3 levels approval Extend view points to confirm data quality Reduce redundancy systems (e.g. due to merge, due to vendors) Schedule Cleansing (see detail
8、s) Enhance Reconciliation (see details) Build Trust level (see details) Try to cover all rare cases,3.1.E Cleansing,When At system merge At major change How Develop detection applications Deliver mismatch reports to IT & business Find solutions on both IT & business,3.1.F Reconciliation,Where 1+ sub
9、systems have data for same contents. 1+ subsystems have independent date change functionality. What Run & improve recon. app. routinely. Categorize reports by urgency. Analyze reports. Debug or adjust biz rule or apply Cleansing.,3.1.G Trust level,When At 1+ fixed data inputs Inputs are independent
10、Must decide final details from inputs How (based on) Provider level (for a detailed data group) Data history Samples: Bloomberg, Reuter, Telekurs, DTCC, ; Moody, S&P, Fitch.,3.2.A Failover & DR,Failover DB: 2+ at diff. locations; real-time replication App Active-Active: Cluster with Load Balancing A
11、ctive-Passive Auto (via SAN) Manual + Auto DR DB: e.g. daily or hourly or real-time replication App: Manual switch,3.3 Technology,DB design Constraint Check (for sensitive table values) Normalization (to reduce duplications) Validation processes (to find conflict data) Application design Data integration check E.g. cryptography signature E.g. CRC check Data display (e.g. Excel missing leading 0, date=num),