Operation and Monitoring Guide


Table of Contents

Audience

This document provides operation and monitoring instructions for the Microsoft CityNext Big Data Solution Accelerator. The audience for this document includes IT administrators and operators who are in position to operate and maintain the Microsoft CityNext Big Data Solution Accelerator.

Note: The sections “Daily Action Checklist” and “Monthly Action Checklist” list all the required actions that need to be taken in order to operate the Microsoft CityNext Big Data Solution Accelerator. The “Monitoring Manual”, “Troubleshooting Manual”, and “Maintenance Manual” sections provide step-by-step instructions for the daily and monthly action list sections mentioned previously.


Daily Action Checklist

3 Daily Action List.png


Monthly Action Checklist

4 Monthly Action List.png
Back to top

Monitoring Manual

An operator is required to monitor the system every day. The solution accelerator provides three ways that an operator can monitor system health.
  1. System Health Dashboard
  2. Email Notification
  3. Historical Report

System Health Dashboard

The system health dashboard is built in relation to the SCOM distribution diagram. It uses a diagram method to illustrate the entire health of the system regarding the solution accelerator services, as well as other relevant services.

Access

  • Log on to inscom01 via platadmin01
  • Open the System Center Operation Console
  • Monitoring -> CityNextMPMonitoring -> OverallStatus_Tree
5.1.1 Entry.png

Characteristics

  • The system health dashboard is a tree based diagram.
  • Every node has 2 alternative status states: healthy (green), error (red)
  • The branches of the tree (bottom level) are material automated monitoring tests executed by SCOM agents across all the virtual machines.
  • All parent nodes, up to the root, are logical monitors whose status depends on the children nodes. If one or more children nodes generate an error, then the parent node generates an error as well.

System health dashboard steps

Determining the failed service

  • Check the root node “citynext”.
  • If it is green:
    • The system is totally healthy and the operator is done with the daily monitoring action.
  • If it is red:
    • Expand this node, and check which child node generated an error in relation to any of the 3 categories: “CAM.citynext”, “CSM.citynext”, “DI.citynext”
5.1.3.1 Figure out Fault Service in System Health Dashboard.png
  • If “DI.citynext” is red:
    • Expand this node to check a specific service:
      • Push Injection web service (PISWebSrv.DI.citynext)
      • Pull controller windows service (PullControllerWinSrv.DI.citynext)
      • Data source management web service (DSMWebSrv.DI.citynext)
      • Data source management Database (DSMDB.DI.citynext)
5.1.3.1 Figure out Fault Service in System Health Dashboard 2.png
  • If “CAM.citynext” is red:
    • Expand this node to check a specific service:
      • Blob web service (BlobWebSrv.CAM.citynext)
      • CAM Agent windows service (CAMAgentWinSrv.CAM.citynext)
      • Hadoop (Hadoop.CAM.citynext)
      • Index Database (IndexDB.CAM.citynext)
      • Index Engine Windows Service (IndexEngineWinSrv.CAM.citynext)
      • Index web service (IndexWebSrv.CAM.citynext)
      • MDM web service (MDMWebSrv.CAM.citynext)
      • OData Web service (ODataWebSrv.CAM.citynext)
5.1.3.1 Figure out Fault Service in System Health Dashboard 3.png
  • If “CSM.citynext” is red:
    • Expand this node to check a specific service status:
      • Configuration web service (ConfigWebSrv.CSM.citynext)
      • CSM web service (CSMWebSrv.CSM.citynext)
      • Core Database (CoreDB.CSM.citynext)
      • Configuration Database (ConfigDB.CSM.citynext)
      • DQS web service (DQSWebSrv.CSM.citynext)
5.1.3.1 Figure out Fault Service in System Health Dashboard 4.png

Determining the error instance for the failed service

  • In section "System health dashboard steps", the operator should have already discovered which service node(s) is shown in the red.
  • Expand the service node. Discover which leaf node is red.
    • The priority is instance node > NLB/Listener
    • Service Node (<ServiceName>.citynext)
      • The priority is DependsOn> Hardware > software > instance
      • NLB/listener VIP (<servicename>.citynext)
        • The SCOM health check which hit this NLB/listener
      • First Instance in NLB (“N01.<ServiceName>.citynext”)
        • “N01.<ServiceName>.citynext”
          • The SCOM health check which hit the current service instance API
      • Depends on
        • Other service nodes that the current service instance depends on
        • Hardware
          • Host name of the VM which is hosting the current service instance
            • Hard disk
              • Hard disk usage for each disk of the VM which is hosting the current service instance
            • OS
              • CPU, RAM usage of the VM which is hosting the current service instance
        • Software
          • The runtimes, software, and middleware which the current service instance depends on
      • Second instance in NLB (“N02.<ServiceName>.citynext”)
        • Similar to first instance in NLB

Determining the root cause of the error instance

  • Two different types of monitoring
    • Solution Accelerator component monitoring
      • Drill down the failed service until you reach the instance layer
      • For public entry point failures check the configuration settings
      • Error instances need to continue in order to expand
      • If an error occurs in the service instance box you need to follow the solution accelerator distribution in order to find the related host, and the remote to that host, to check the service log for further investigative details (refer to section 5.4 for more information)
    • Software and hardware monitoring
      • Click on the faulty node; then click “Health Explorer” in the right panel
      • Cancel “Scope is only unhealthy child monitors” to view the full list (optional)
      • Drill down to the bottom level for the alert; then click the “State change Events” tab button
      • Use the previous information in order to find a resolution to fix the error

5.1.3.3 Figure out Root Cause of Fault Instance.png
  • Which VM is hosting the error instance?
    • Expand “Hardware” to retrieve the host name of the VM
5.1.3.3 Figure out Root Cause of Fault Instance 2.png
  • Where is the binary location of the error instance? Where is the log?
    • Refer to Service deployment topology, find the location of the binary and the log of the error instance on that VM.
    • Go to the log folder; sort by “date modified”
    • Open the top log file and search the same moment with the error time
    • Check the exception message and stack trace

Multiple failure priority

Component group level priority

DI.citynext>CAM.citynext>CSM.citynext

Component level priority

Separate all solution accelerator components into four different priority levels.
  • Level 1
    • Citynext.Configuration DB, Hadoop
  • Level 2
    • Configuration service, Citynext.Core DB, Citynext.DsmDB, CamIndexDB,
  • Level 3
    • MDM services, DSM service, CSM service. Index Service, Index Engine windows service, CAM Agent window service
  • Level 4
    • PIS service, Pull Control win service, DQS service, OData service


Email Notification

Email notification is used to actively notify the operator about any system failure in real-time.

Email format

  • Title
  • Body
    • Summary Information
      • Severity
      • Source($FailureItem)
      • Path($HostServer)
      • Created(Failure occurs date)
    • Knowledge
      • Summary (basic solution)
      • Alert ID(unique GUID for failure)

Email notification steps

  • Open the recipient mailbox
    • To set the recipient mailbox, please refer to the configuration and administration guide
  • If there is no email for the current day:
    • Done
  • If there are emails for the current day:
    • Log on to inscom01
    • Copy the value of the “Source” field from the email to the search box; then select “Alerts” to execute it
    • Select the top alert in order to check the current status; if already auto resolved, then skip this alert.
    • If the alert is still shown, verify whether this alert is from the solution accelerator component or from any hardware/software
    • Refer to the steps in section 4.1.3.3


Historical Report

The solution accelerator generates a healthy report daily and stores those reports for historical analysis in order to identify stabilization issues and capacity extension requirements.

Historical report steps

  • Log on to inop01
  • Open IE, http://inmgr:8100/SCOMReport
  • Select one from the table; click “detail” on the bottom right
  • Within this report is a list of all monitoring items, and those items will be categorized with a different status:
    • Error
      • Error means real failure, and may not exist since it’s not a real-time result. The administrator will have to view it one by one.
    • Warning
      • Warning is usually a signal to alert your attention that some monitor may be OK for now, but might be close to failing. The administrator should view it one by one.
    • Healthy
      • Healthy means that everything is ok.
  • Figure out the detail error description for each error
    • Open SCOM console
    • Copy the value of the “AlertContent” column to the search box; then select “Alert”
    • Filter the correct one by the “Host” column and the “Alert time” column
    • Click the source content in the alert details panel
    • Click the item in the “Name” column; then click “Health Explorer”
    • Cancel “Scope is only unhealthy child monitors” to view full list (optional)
    • Drill down to the bottom level for the alert; then click the “State change Events” tab button
    • Based on the “Created” time in the email click the correct failure
  • If the same error is shown consistently, it is required to log on to the hosting VM to check:
    • Stabilization status of the failed service instance
      • Whether there is a virus
      • Whether there is a memory leak, thread dead lock, etc.
    • Remaining disk space, memory usage, CPU
      • Whether more instances are required to scale out
      • Whether more H/W needs to be plugged in to scale up
    • To locate the issued service instance, please refer to section "Determining the root cause of the error instance"
Note: This report only shows the historical error status at the moment the error is reported, instead of the current status. It is only used to analyze the usual failed service and the capacity extension requirement, not an up-to-date status report.

Back to top

Troubleshooting Manual

Besides receiving monitoring alerts, operators will be called more often by the customer, a data administrator, and/or the Management Studio user for their specific issues. As a general troubleshooting strategy, the operators should ask themselves the same sorts of questions in order to help debug an issue; which may come from the system monitoring tool or the user. This chapter will help the operator determine where to go from there.

Identify the Failure Scope

Ask yourself several questions in order to identify the scope of the failure:
  • Can the user reproduce this issue on their side?
    • If no -> not a failure (retrieve the log for further investigation if possible)
    • If yes -> go to the next question
  • Can another user account reproduce this issue?
    • If no -> it is a user-specific issue
    • If yes -> go to the next question
  • Is specific data (tiles/queries/ingestions/entities) unavailable (returns no data, HTTP error) for all users, but the rest of the data for the same feature still works for all users?
    • If no -> go to the next question
    • If yes -> it is a data-specific issue
  • Is a specific feature (tiles/queries/ingestions/entities) unavailable for all users, but the rest of the features still work?
    • If no -> it is an environment-specific issue
    • If yes -> it is a service-specific issue
Note: For the errors displayed on the SCOM system health dashboard, they are more than likely all service-specific and environment-specific issues.

User-specific Issue

Scenario: A user has no permission to log in to the dashboard or Management Studio

Steps:
  • Log on to inad01; open user and computer management
  • Confirm whether the user is registered as a domain user
  • Confirm whether the user account is joined to the correct security group
    • For customizable dashboard login, the user should be in citynext\dashboardusers
    • For inop01 login, the user should be in citynext\platadmusrgrp
    • For Management Studio, the user should be in citynext\platadmusrgrp
    • For a specific data query/tile, the user should be in the ACL group for the specific entity schema

Scenario: Cannot open the dashboard

  • Steps:
    • Confirm whether they can connect to the Internet
    • Confirm whether they installed the correct version/build of the client. Is it the first time going through the setup process or are they re-deploying in order to overwrite the old version?
    • Confirm whether they configured the correct backend URL
Back to top

Data-specific Issue

Scenario: One of the tiles cannot retrieve data or one DQS query cannot retrieve data, but the rest of the tiles work

  • Log in to the client
    • Open the property of the tile, figure out which DQS query it relies on
    • Open IE, http://<fearrinternetaddress>/dqs/query/<dqs_queryname>
    • If it is able to return expected data, it is a client side issue
  • Log on to inop01
    • Open Management Studio
    • Go to City Services Management -> Business Entity Management
    • Verify whether there is a DQS definition with the name that tile required
    • If the query type is Odata
      • Record the query statement
      • Go to City Artifacts Management -> Entity Schema Management
      • Verify whether there is a valid entity schema definition bound to DQS
      • Verify whether there is enough permission for the user
        • Citynext\platsvcacct is always required
        • The user account used to log in to the client is also required
      • Click edit for that entity schema; record the binding object schema name and mapping column
      • Go to City Artifacts Management -> Object Schema Management
      • Verify there is a valid object schema, and the row count is not 0
      • Verify the mapping cell contains a valid value
    • If the query type is Hive
      • Record the query statement
      • Log on to dbhdp01
      • Open Hive shell
      • Use default_1
      • Show tables
        • The DQS required Hive table should exist
      • Select "*" from <table_name>
        • The data should exist in the DQS required Hive table

Scenario: A particular object schema does not show data in Management Studio

  • Open Management Studio -> City Artifacts Management -> Object Schema Management
    • Confirm whether the row count = 0
    • Confirm whether there is a valid diagnostic status and whether it passed
    • Click “data”, confirm whether data is returned
  • Open Management Studio -> Data Ingestion -> pull/push source management
    • Check the source name, which is the same as the object schema name
    • Check run status
    • If it failed, click it to see the detail exception message and stack trace
  • Further investigation
    • Get source ID of that particular object schema/data source name
    • Log on to dbsql01 and launch SSMS; connect to sqllistener
6.3.2 Further investigation script.png
  • Confirm whether there is data in HBase
    • Log on to dbhdp01
    • Prolong source ID, append “0” as much as possible on prefix to ensure the new string length is 9. Example:
      • Source ID = 134
      • Add six zeros (“0”) before it
      • HBase row key = 000000134
    • F:\hdp\hadoop\hbase-0.98.0.2.1.1.0-1621-hadoop2\bin\hbase shell
    • scan 'Object', {STARTROW => '<hbase_row_key>|000000000000000000000000000', ENDROW => '<hbase_row_key>||ZZZZZZZZZZZZZZZZZZZZZZZZZZZ'}
    • If there is data returned, then HBase contains data. It is a MDM service issue
    • If there is no data, it is because data ingestion cannot ingest data in CAM
  • Debug with detail logging for data ingestion
    • Log on to dbsql01
    • Launch SSMS; connect to sqllistener
    • Run
Cannot resolve image macro, invalid image name or id.
  • Check latest 1000 events for data ingestion
Back to top

Service/Feature-specific Issue

A service/feature-specific issue can often be caught by the monitoring tools mentioned in section "Monitoring Manual".

Scenario: A single service displays an error in the system health dashboard

Steps:
  • Go down to the leaf node of the failed service node (icon is a pair of glasses); open Health Explorer to check the detailed error information provided by SCOM
  • On the “knowledge” tab, there are some common solutions provided. Try them first.
  • For web service, it is always worth trying to restart the website and application pool
  • For windows service, it is always worth trying to restart
  • Further investigation
    • Log on to the affected VM; get the service log
    • Sort by date modified; open the latest log modified and search what happened in the error time stamp
    • If the exception is about the host not being found, and the connection is actively refused, record the url it sent and do same investigation on the failed service that the current service sent the request to.
    • If it looks like a feature exception, email the support contact; also attach the SCOM alert description and service log.

Scenario: Multiple services display an error in the system health dashboard

Steps:
  • Open the health status dashboard
  • Follow the prioritized strategy; select the highest priority failure node; fix it as the previous section mentioned.
  • Wait 5~10 minutes to let the health status dashboard refresh. Often times some error nodes will turn green due to the fix of the highest priority node.
  • Stay on the prioritize strategy, select next highest priority failure node, fix it as the previous section mentioned.
  • Loop till all nodes display green
The prioritization is:
  • Under CityNext root:
    • CAM.citynext > CSM.citynext > DI.citynext
  • Under CAM.citynext
    • Hadoop.CAM.citynext > IndexDB.CAM.citynext > MDM.CAM.citynext > OData.CAM.citynext > others
  • Under specific service node
    • Service instance node > NLB node
  • Under specific service instance node
    • Hardware > Software > DependOn > Service Monitor

Scenario: All tiles/queries cannot retrieve data from a client

Steps:
  • Open the system health dashboard and resolve all of the errors there
  • Check the client status
    • Open the property of the tile, figure out which DQS query it relies on
    • Open IE, http://<fearraddress>/dqs/query/<dqsqueryname>
    • If it is able to return expected data, it is a client side issue
  • Log on to inop01
    • Check the ARR status
      • http://fearr/dqs/query/<dqs_queryname>
      • http://fearr/odata/model/default/v1/<entityschemaname>set
    • Check the DQS status
      • NLB: http://apsvc/dqs/query/<dqs_queryname>
      • Instance1: http://apsvc01/dqs/query/<dqs_queryname>
      • Instance2: http://apsvc02/dqs/query/<dqs_queryname>
    • Check the Odata status
      • NLB: http://apsvc/odata/model/default/v1/<entityschemaname>set
      • Instance1: http://apsvc01/odata/model/default/v1/<entityschemaname>set
      • Instance2: http://apsvc02/odata/model/default/v1/<entityschemaname>set
    • Check the status of HBase and Hive
      • Log on to indbhdq01
      • All HDP services should run:
        • F:\hdp\hadoop\hbase-0.98.0.2.1.1.0-1621-hadoop2\bin\hbase shell
        • F:\hdp\hadoop\hive-0.13.0.2.1.1.0-1621\bin\hive

Scenario: Hive data cannot be accessed by DQS or HOH

Steps:
  • Open the system health dashboard and resolve all of the errors there
    • The “Hive” service is should be shown as red
  • Log on to apmgr01
    • Open Hive ODBC (bit-64)
    • Click “Test”
    • If it passes, then this is a false-positive alert from the SCOM health dashboard
    • If it produces an exception like:
      • System.Data.Odbc.OdbcException (0x80131937): ERROR HY000 HortonworksHiveODBC (35) Error from Hive: error code: '0' error message: 'ExecuteStatement finished with operation state: ERROR_STATE'.
        • Log on to dbhdp01
        • Manually restart hiveserver2, metastore and derby
        • Open Hive ODBC; test it again
    • If it still fails:
      • Log on to dbhdp01
      • cd c:\deploy\scripts
      • .\ StartStopHDPRemote.ps1 –stop
      • .\ StartStopHDPRemote.ps1 –start
      • Wait for 5 minutes
      • Test Hive ODBC again
Back to top

Environment-specific Issue

An environment-specific issue can often be caught by the monitoring tools mentioned in section "Monitoring Manual".

Scenario: The SQL database cannot retrieve data

Steps:
  • Log on to dbsql01
    • Open SSMS, check whether there are 6 databases within the SQL AG with the status of “Synchronized”
Cannot resolve image macro, invalid image name or id.

Scenario: The HBase/Hive cannot retrieve data

  • Log on to indbhdq01
  • All HDP services should run
  • http://dbhdp01:50070/dfshealth.jsp
    • No dead node
    • Live node = 3
  • http://dbhdp01:60010/master-status
    • Dbhdp03/04/05 3 regional servers and regions should be online
  • http://dbhdp01:8089/cluster
    • Dbhdp03/04/05 3 nodes should be online
    • Applications: there are jobs running or continually running
  • F:\hdp\hadoop\hbase-0.98.0.2.1.1.0-1621-hadoop2\bin\hbase shell
  • List
Cannot resolve image macro, invalid image name or id.
  • Scan ‘ObjectSchema’
    • The data should be retrieved instead of any region issue
  • Log on to dbhdp01
  • F:\hdp\hadoop\hive-0.13.0.2.1.1.0-1621\bin\hive
  • use default_1
  • show tables
    • All tables should be listed
  • For HDP services, it is always worth trying to restart
    • Log on to dbhdp01/02/03/04/05
    • F:\hdp\hadoop\stoplocalservice.cmd
    • Wait until all local services are stopped in 5 VMs
    • F:\hdp\hadoop\startlocalservice.cmd
  • If data is broken, you need to restore the HBase database (Please refer to section "Solution Accelerator Data Backup/Restore")

Scenario: The web service is not responding

Steps:
  • Refer to section Service Deployment Topology to get the host VM
  • Log on to the corresponding host VM
  • Open Windows firewall, check whether the port is allowed
  • If it is an Internet facing service, open the Azure management portal; check to see if the endpoint for the port is opened
  • Open IIS, check whether the port is bound to the service

Scenario: IIS failover is not working (DSM, MDM, CSM web services start and then stop immediately)

Steps:
  • Log on to apmgr02
    • Open IIS
    • Stop 3 websites
    • Stop 3 application pools
    • Run an IIS reset with the administrative PowerShell command line
  • Log on to apmgr01
    • Open IIS
    • Start 3 websites
    • Start 3 application pools
    • Run an IIS reset with the administrative PowerShell command line
    • Open the Failover Cluster Manager -> Role
    • Move all 3 roles to apmgr01
    • Bring all roles online

Scenario: The caching service is not working

Steps:
  • Log on to apsvc01, apsvc02
  • Open the caching PowerShell with administrator privileges
  • Run Get-CacheClusterHealth, the 2 nodes with the cache name “DataQueryResultCache” should be displayed
  • If one or both caching services are down, try to run:
    • Stop-cachecluster
    • start-cachecluster
  • Check the log from event viewer

Scenario: Cannot connect to the VM

Steps:
  • Open the Azure management portal, check whether the VM is running
  • Use RDP to connect to the VM, check if you have remote access to the server
  • Use RDP to connect to another available VM; ping the suspect VM
  • If the VM is shutdown, restart it using the Azure management portal
  • If the VM is running, but disconnected, try a restart or resize from the Azure management portal
  • If the VM is still disconnected from the client/another VM, call Azure technical support

Scenario: There is a Windows Azure data center outage

Steps:
  • Open the Azure management portal, there will be a notification on the bottom right to describe the affected scope
  • Open http://azure.microsoft.com/en-us/status/ to check the service status of your data center
  • Call Azure technical support if you cannot recover the VM after the data center failure is recovered

Scenario: All VMs are stopped and deprecated due to a subscription issue (Azure)

Steps:
  • Re-enable the Azure subscription
  • Open the Azure portal; start all VMs
  • Log on to dbsql01
    • Open the SSMS and select AlwaysOn High Availability; right-click Availability Group Listener and click Add Listener.
      • Enter the listener name as "SQLListener"; the port is “1433”
      • Select “Static IP” as Network Mode and add the static IP 192.168.1.190
      • Open PowerShell as an administrator and then run the following script to change the listener IP address (replace the highlighted variables):
5.5.8 Scenario All VMs are stopped and deprecated due to a subscription issue Azure.png
5.5.8 Scenario All VMs are stopped and deprecated due to a subscription issue Azure 2.png
  • Update the allowed IP list in the Azure endpoint AC for SQL-AG-NLB
    • Log in to the Azure management portal
      • Click the Endpoints tab for DBSQL01
      • Select “SQLAG1433” in the endpoints list; click the button “MAMAGE ACL”
      • Update all Cloud Service Public IP addresses in the permit rule to the current public IPs for the other VMs
  • Update all internal load balance IP addresses for the DNS server if needed
  • Update the SQL listener IP address in the firewall-outbound-allow local IP pools for apmgr01/02, apsvc01/02, fedi01/02, inmgr01/02, inop01, and inscom01
Back to top

Integrity Verification (Troubleshooting)

After recovery, you can run a set of verification BVTs to confirm that major functionalities are back to normal
  • Open management studio -> Diagnostics -> functional self-test
  • Click new on the bottom left
  • Wait until the progress is completed (approximately 1 hour)
  • Click view detail on the bottom right; check whether all 38 cases passed

Further Support

If you cannot resolve the issue by debugging, send an email to the CityNext support mailbox with the information below:
  • Error description
  • A screenshot
  • The SCOM alert information
  • The service log
Back to top


Maintenance Manual

Change Management

A change management tool is required to triage and track every action against the production environment. It also provides a rollback possibility for any failed execution.

As a minimum requirement, a form is used to track the actions. See an example below:
7.1 Change Management.png

Solution Accelerator Data Backup/Restore

Data backup

The solution accelerator executes an automatic data backup daily. Operators should check whether it was successfully generated and uploaded to Azure storage.

SQL backup:
  • Scope:
    • System DB
    • Index DB
    • Configuration DB
    • Core DB
    • DSM DB
    • Verification Service History DB (RubberDuck DB)
    • Monitoring DB
  • Execution Plan:
    • Monday ~ Saturday: Partial Backup
    • Sunday: Full Backup
HBase backup:
  • Scope:
    • Blob
    • Data Model
    • Entity Schema
    • Object
    • Object Schema
    • System
  • Execution Plan:
    • Monday ~ Saturday: Skipped
    • Sunday: Full Backup
Hive Backup:
Due to the HOH mechanism, the Hive table and views can be synced, as long as HBase data is recovered. A Hive backup is not specifically required.

Log on to inop01
  • Open Azure storage that was configured as the target storage for backup files; check whether there are new zip files named “SQL<timestamp>.zip” and “HBase<timestamp>.zip”.
    • It is located in c:\deploy\scripts\databackupremote.ps1
  • Open c:\deploy\scripts\DataBackup_<timestamp>.log; check whether there is an exception there
  • Open the task scheduler; check whether the latest schedule was successfully completed

Data restore

Sends maintenance outage notifications to all affected users and administrators. Data restore is only required when the data is broken and you need to rollback the data status to a previous restore set time.

SQL Restore:
  • Manually trigger a full backup of current data
    • Use SSMS to connect to sqllistener
    • Open maintenance plan -> fullbackup
    • Execute it
    • The SQL data will backup to \\inop01\backup\sql
  • Connect to current primary node
    • Select the target DB; remove it from AG group
    • Connect to current secondary node
    • Remove the target DB from the local drive
  • Connect to current primary node
    • Download and unzip the SQL backup file from Azure storage
      • Confirm that the latest full backup (previous Sunday) is downloaded
      • Confirm that the latest partial backup (day before) is downloaded
    • Select the target DB; right click -> task -> Restore -> database -> device
    • Add the latest full backup .bak file and the partial backup .bak file
    • Select Options -> overwrite existing database
    • Select Options -> close existing connections for the current database
    • Click OK
    • Open administrator PowerShell prompt on the current primary node
    • If the current primary node is dbsql01, run: c:\deploy\scripts\createsqlag.ps1 -dbname {restored db name}
    • If the current primary node is dbsql02, run: c:\deploy\scripts\createsqlag.ps1 -dbname {restored db name} -sqlnode1 dbsql02 -sqlnode2 dbsql01
HBase Restore:
  • Manually trigger a full backup of the current data
    • Log on to dbhdp01
    • Run: c:\deploy\scripts\databackup.ps1 –hbasebackup –hbasebackupfolder f:\tmp
    • The SQL data will backup to f:\tmp
  • Download and unzip the latest HBase backup file from Azure storage to inop01
  • Run: c:\deploy\scripts\databackup.ps1 –hbaserestore –hbaserestorefolder <backupfilefolder>
    • Note: This script will truncate the current table, so make sure you perform a manual backup as the first step for this section
Hive Restore:
  • Perform the HBase restore first
  • Open Management Studio -> City Artifact Management -> Entity Schema Management
  • Create a temporary entity schema without any binding
  • The solution accelerator will trigger a full HOH and automatically generate Hive tables and views based on the current HBase data.
  • Delete the temporary entity schema

Log Archiving

The solution accelerator executes an automatic log archive daily. Operators should check whether it was successfully generated and uploaded to Azure storage.
  • Scope:
    • Every VM
      • Service Log (under c:\citynextbdp)
      • IIS log
      • Windows Event Log
  • Log on inop01
    • Open the Azure storage that was configured as the target storage for the backup files; check whether there is a new zip file named “<VMHostName>_<timestamp>.zip”.
      • It is located in c:\deploy\scripts\logarchiveremote.ps1
    • Open c:\deploy\scripts\LogArchive_<timestamp>.log; check whether there is an exception inside
    • Open task scheduler; check whether the latest log archive scheduler was successfully completed

Account Administration

Please refer to sections Software Deployment Topology and Service Accounts and Groups to see security groups and accounts configuration.

Password expiration duration:
  • User account: 90 days until it expires
  • Service account: never expires
The solution accelerator will notify the operator about the accounts whose passwords are close to expiring via SCOM email notification. See an example below:
7.4 Account Administration.png

An operator could log on to inad01 and open active directory user management to reset the password for users.

Windows Update

Some necessary security patching for Microsoft products may lead to a VM/service restarting, and may also affect the availability of the CityNext Big Data Solution Accelerator. Security patching must be centrally managed and scheduled to avoid unexpected down time.

If you follow the appropriate guides, the solution accelerator will disable auto updates for each machine to avoid randomized restarts. Operators can select an off-peak time to apply patches to all VMs.

To apply an update on-demand, the operator should:

Anti-virus Scan and Update

A scheduled scan To apply an update on-demand, the operator should:

Capacity Extension

SCOM monitoring will notify the operator about a system overload in terms of CPU, memory, and hard disk. In addition to the data in the SCOM historical report, the operator can identify whether it is a temporary rush hour phenomenon or a consistent capacity insufficiency. As a result, the operator should scale up existing VMs or scale out new service instances in order to enlarge service capacities on-demand.

Scale Up:
  • Send a maintenance outage notification to all affected users and administrators
  • Disable the SCOM alert notification
    • Log on to inop01
    • Open the SCOM operation console
    • Administration -> Subscriptions -> CityNext Incident Mail Notification -> Disable
  • Shutdown the scaled VM
  • Increase the CPU or memory:
    • Open the Azure Management Portal -> Virtual Machine -> Configuration -> Select a larger scale from the drop down list; click save
    • Wait until Azure completes the update
    • Start the VM
  • Increase the hard disk
    • Open the Azure Management Portal -> Virtual Machine -> Attach empty disk; select a larger empty HD to attach
    • Wait until Azure completes the update
    • Start the VM
    • Open disk management; format the new disk; assign a volume
    • Copy all files from the small HD volume to the new one
    • Open the Azure Management Portal -> virtual machine -> detach small disk
    • In the VM, change the larger HD volume to the previous volume
  • Run the verification service on Management Studio (refer to section "Integrity Verification (Troubleshooting)")
  • Check the SCOM health status dashboard for everything shown in green
  • Restore the SCOM notification
  • Send a notification to all affected users and administrators

Service Scale Out:
  • Open the Azure Management Portal -> Virtual Machine
  • Create a new VM; join the corresponding cloud service and corresponding availability set (refer to section "Infrastructure and IP Allocation")
  • Setup the static IP and Azure internal/external load balance with binding port (refer to section "Infrastructure and IP Allocation", as well as the admin guide)
  • Join the domain and assign a hostname (refer to the admin guide)
  • Install the necessary software (refer to section "Software Deployment Topology, as well as the admin guide)
  • Log on to the new VM
  • Copy c:\deploy\scripts from inop01 to local
  • Edit c:\deploy\scripts\gamultiboxtopologystage1.ps1 and gamultiboxtopologystage2.ps1
  • Create a new row for the new machine (refer to the similar service instance VM)
  • Run:
    • C:\deploy\scripts\DeployMultibox.ps1 -Topology gamultiboxtopologystage1.ps1
    • C:\deploy\scripts\DeployMultibox.ps1 -Topology gamultiboxtopologystage2.ps1
  • Add a node for the SCOM health dashboard (Please refer to the admin guide)
  • Open the SCOM health dashboard, and verify it is healthy (green)
  • Run the verification service, and verify all have passed
Note: This scale out instruction is NOT fit for SQL and the Hadoop cluster

Certificate and Key Management

Certificate expiration will lead to associated services being disabled. Certificate management in terms of expiration alerting, key protection, and lifecycle management are mandatory. For the solution accelerator, the certificates below are involved:
  • Azure Certificate
  • HTTPS Certificate (optional)
  • Single-Sign On ADFS Certificate (optional)
To ensure data is not affected by malicious modification, encryption key management is required as well. In scope:
  • SQL DB Encryption Key (optional)
  • HBase DB Encryption Key (optional)
The solution accelerator does not provide a tool to manage certificate and keys. This section is a reminder for operators to manage the relevant certificate and keys using their preferred approach.

Azure Subscription Management

Verify that the Azure subscription is not close to expiring (https://account.windowsazure.com/Subscriptions).

Hotfix

  • Send a maintenance outage notification to all affected users and administrators
  • Disable the SCOM alert notification
    • Log on to inop01
      • Open the SCOM operation console
      • Administration -> Subscriptions -> CityNext Incident Mail Notification -> Disable
  • Backup the affected data and service binaries
    • Manually trigger a full backup for SQL and HBase
      • Log on to inop01
      • Edit c:\deploy\scripts\databackupremote.ps1
        • Change the full backup day for SQL and Hbase to the current day
      • Run c:\deploy\scripts\databackupremote.ps1
  • Backup the affected service binaries (Refer to section "Service Deployment and Topology")
    • Backup the affected service binaries with a timestamp
  • Follow the hotfix guide provided by the CityNext engineering team to deploy the hotfix
  • Run the integrity verification (Refer to section "Certificate and Key Management" to run the BVT)
  • Verify that the SCOM health dashboard is shown as being healthy
  • Restore the SCOM notification
  • Send a notification to all affected users and administrators

Integrity Verification (Maintenance)

These scenarios will require to run verifications:
  • Troubleshooting and recovery
  • Data restore
  • Hotfix and upgrade
You could run a set of verification BVTs to confirm that major functionalities are back to normal
  • Open management studio -> Diagnostics -> functional self-test
  • Click new on the bottom left
  • Wait until the progress is completed (approximately 1 hour)
  • Click view detail on the bottom right; check whether all 38 cases passed
Back to top


Appendix

Infrastructure and IP Allocation

8.1 Infrastructure & IP Allocation.png

8.1 Infrastructure & IP Allocation 2.png

Service Deployment Topology

8.2  Service Deployment Topology.png
8.2 Service Deployment Topology 2.png
8.2 Service Deployment Topology 3.png
8.2 Service Deployment Topology 4.png

Software Deployment Topology

8.3 Software Deployment Topology.png

Service Accounts and Groups

8.4 Service Accounts & Groups.png

User Accounts and Groups

8.5 User Accounts & Groups.png


Contacts and Support

For any issue, please send an email to mscitynextbigdata_sp@microsoft.com.


Back to top

Last edited Feb 16, 2015 at 5:45 AM by gheadd, version 34