Senior Big Data Engineer Resume
Boise, ID
SUMMARY
- Over 8+ years of Big Data Hadoop Ecosystems experience in ingestion, storage, querying, processing and analysis of big data.
- Experience in dealing with Apache Hadoop components likeHDFS, MapReduce, HIVE, HBase, PIG, SQOOP, Spark and Flume Big Data and Big Data Analytics.
- Hands on Experience on Python programming for data processing and to handle Data integration between On - prem and Cloud DB or Data warehouse.
- Hands-on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQSand other services of the AWS family.
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Experience in utilizing SAS Procedures, Macros, and other SAS application for data extraction using Oracle and Teradata.
- Involved in writingdata transformations, data cleansingusingPIG operationsand good experience in data retrieving and processing usingHIVE.
- Hands on experience in installing, configuring Hadoop ecosystems such asHDFS, MapReduce, Yarn, Pig, Hive, HBase, Oozie, Sqoop, flumeandKafka.
- ConfiguredSpark Streamingto receive real time data fromKafkaand store the stream data to HDFS and process it usingSparkandScala.
- Expertise with Python, Scala and Java in Design, Development, Administrating and Supporting of large-scale distributed systems.
- Expertise in Business Intelligence, Data warehousing technologies, ETL and Big Data technologies.
- Experience in writingPL/SQLstatements - Stored Procedures, Functions, Triggers and packages.
- Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, Calculated fields, Sets, Groups, Parameters etc., in Tableau.
- Experience with building data pipelines using Azure Data Factory, Azure Databricks, and stacking data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and concede database access.
- Good knowledge in implementing various data processing techniques using Apache HBase for handling the data and formatting it as required.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Hands-On experience on Analyzing SAS ETL, Implementation of Data integration in Informatica using XML, Webservices, SAP ABAP, SAP IDoc.
- Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
- Proficient inData Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and Data Exportthrough use of ETL tools such as Informatica.
- Analyzed data and provided insights with R Programming and Python Pandas
TECHNICAL SKILLS
Big Data Tools/ Hadoop Ecosystem: Map Reduce, Spark, Airflow, Nifi, HBase, Hive, Pig, Sqoop, Kafka, Oozie, Hadoop
Databases: Oracle 12c/11g/10g, Teradata R15/R14, MY SQL, SQL Server, No SQL-Mongo DB, Cassandra, Hbase, Snowflake
ETL/Data warehouse Tools: Informatica and Tableau.
BI Tools: SSIS, SSRS, SSAS.
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Cloud Platform: Amazon Web Services (AWS), Microsoft Azure
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena, MS Azure- Data Lake, Data Storage, Data Bricks, Data Factory
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
Operating System: Windows, Unix, Sun Solaris
Methodologies: System Development Life Cycle (SDLC), Agile
PROFESSIONAL EXPERIENCE
Confidential, Boise, ID
Senior Big Data Engineer
Responsibilities:
- Handle billions of log lines coming from several clients and analyze those using big data technologies likeHadoop (HDFS), Apache KafkaandApache Storm.
- Work with Architects, Stakeholders and Business to design Information Architecture of Smart Data Platform for the Multistate deployment in Kubernetes Cluster.
- Hivetables are created as per requirement wereInternalorExternaltables defined with appropriatestatic, dynamic partitions and bucketing, intended for efficiency.
- Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
- Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Worked extensively onsparkandMLlibto develop aregression modelfor cancer data.
- Hands on design and development of an application using Hive (UDF).
- Developed Simple to complex MapReduce streaming jobs using Python language that are implemented using Hive and Pig.
- Load and transform large sets of structured, semi structured data using hive.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
- We used the most popular streaming toolKafkato load the data on Hadoop File system and move the same data to Cassandra NoSQL database.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
- Worked inAWSenvironment for development and deployment of custom Hadoop applications.
- Created and maintained various DevOps related tools for the team such as provisioning scripts, deployment tools, and development and staging environments on AWS, Rack space and Cloud.
- Experienced with AWS services to smoothly manage application in the cloud and creating or modifying the instances
- DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
- Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
- Writing map reduce code using pythonin order to get rid of certain security issues in the data.
- Synchronizing both the unstructured and structured data using Pig and Hive on business prospectus.
- Used Pig Latin at client side cluster and HiveQL at server side cluster.
- Importing the complete data from RDBMS to HDFS cluster usingSqoop
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
- Configured Spark Streaming to receive real time data from theKafkaand store the stream data to HDFS.
- Migrated existingMapReduceprograms toSparkusingScalaandPython.
- ImplementedSpark SQLto connect toHiveto read the data and distributed processing to make highly scalable.
- Implemented Partitioning, Dynamic Partitions and Buckets inHIVEfor efficient data access.
- Exported the Analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team UsingTableau.
- Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
- IntegratedCassandraas a distributed persistent metadata store to provide metadata resolution for network entities on the network
- Involved in various NOSQL databases likeHbase, Cassandrain implementing and integration.
- Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
- Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Created Hive base script for analyzing requirements and for processing data by designing cluster to handle huge amount of data for cross examining data loaded in Hive and Map Reduce jobs.
Environment: HDFS, Hive, Scala, Sqoop, Spark, Tableau, Yarn, Cloudera, SQL, Terraform, Splunk, RDBMS, Elastic search, Kerberos, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Ranger, Git, Kafka, Openshift, CI/CD(Jenkins), Kubernetes, Confluence, Shell/Perl Scripting, Zookeeper, Jira.
Confidential, San Antonio, TX
Big Data Engineer
Responsibilities:
- Create custom logging framework for ELT pipeline logging using Append variables in Data factory
- Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
- Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
- Developed and designeddata integrationandmigration solutionsinAzure.
- Kept our data separated and secure across national boundaries through multiple data centers and regions.
- Implement Continuous integration/continuous development best practice using Azure DevOps, ensuring code versioning
- Developed various Oracle SQL scripts, PL/SQL packages, procedures, functions, and java code for data
- Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources
- Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
- Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
- Creating Data factory pipelines that can bulk copy multiple tables at once from relational database to Azure data lake gen2
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, DevOps, Spark YARN
- Built real time pipeline for streaming data using Events hub/Microsoft Azure Queue and Spark streaming.
- Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.
- Used Databricks for encrypting data using server-side encryption.
- Used Airflow to monitor and schedule the work
- Involved in creating HDInsight cluster in Microsoft Azure Portal also created Events hub and Azure SQL Databases.
- Used Delta Lake as it is an open-source data storage layer which delivers reliability to data lakes.
- ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
- Extracted and updated the data into HDFS using Sqoop import and export.
- UtilizedAnsible playbookfor code pipeline deployment
- Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
- Delivered de normalized data forPower BIconsumers for modeling and visualization from the produced layer in Data lake
- Responsible to manage data coming from different sources through Kafka.
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing Developed Apache Spark applications by using spark for data processing from various streaming sources.
- CreateSpark Vectorized panda user definedfunctions for data manipulation and wrangling
- Transfer data in logical stages from System of records to raw zone, refined zone and produce zone for easy translation and de normalization
- Setting up Azure infrastructure likestorage accounts, integration runtime, service TEMPprincipalid, app registrations to enablescalable and optimizedutilization of business user analytical requirements in Azure.
- Worked on a clustered Hadoop for Windows Azure using HDInsight and Hortonworks Data Platform for Windows.
- Writing PySpark and spark Sql transformation in Azure Databricks to perform complex transformations for business rule implementation
- Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
- Implement IOT streaming with Databricks Delta tables and Delta lake to enable ACID transaction logging
Environment: Hadoop, Spark, MapReduce, Kafka, Scala, JAVA, Azure Data Factory, Data Lake, Databricks, Azure DevOps, PySpark, Agile, Power BI, Python, R, PL/SQL, Oracle 12c, SQL, No SQL, HBase, Scaled Agile team environment
Confidential, Sterling, VA
Hadoop Developer/Data Engineer
Responsibilities:
- Worked with HBase and Hive scripts to extract, transform and load data intoHBaseandHive.
- Implemented multi-data centre and multi-rack Cassandra cluster
- Monitoring systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.
- Extensively used Map Reduce component of Hadoop
- Responsible for writing Pig scripts to process the data in the integration environment
- Responsible for setting up HBASE and storing data into HBASE
- Working closely with application/ business and database team to understand the functionality better and develop performance oriented reports
- Written MapReduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase - Hive Integration. Worked on YUM configuration and package installation through YUM.
- Responsible for importing and exporting data into HDFS and Hive.
- Analyzed data using Hadoop components Hive and Pig.
- Also deployed Kibana with Ansible and connected to Elastic Search Cluster. Tested Kibana and ELK by creating a test index and injected sample data into it.
- Successfully test Kakfa ACL's with anonymous users and with different hostnames
- Worked on moving all log files generated from various sources toHDFSfor further processing.
- Developed workflows using customMapReduce,Pig,Hive, andSqoop.
- Work with network andLinuxsystem engineers to define optimum network configurations, server hardware and operating system.
- Worked on extending Hive and Pig core functionality by writing custom UDFs using Java
- Involved in importing data from MS SQL Server, MySQL and Teradata into HDFS using Sqoop
- Played a key role in dynamic partitioning and Bucketing of the data stored in Hive Metadata
- Good experience in writing Spark applications using Python and Scala.
- Designed and developed automation test scripts using Python.
- Performed dimensional data modelling using Erwin to support data warehouse design and ETL development.
- Designing and working with Cassandra Query Language knowledge in Cassandra read and write paths and internal architecture
Environment: Hadoop, HDFS, Map Reduce, Kafka, Python, Hive, Cassandra, Ansible, AWS, Git.
Confidential, Hamilton, NJ
Hadoop Developer
Responsibilities:
- Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS.
- Used Sqoop to import and export data from HDFS to RDBMS and vice-versa.
- Exported the analyzed data to the relational database MySQL using Sqoop for visualization and to generate reports.
- Created HBase tables to load large sets of structured data.
- Managed and reviewed Hadoop log files.
- Worked extensively with HIVE DDLs and Hive Query language (HQLs).
- Analyzed the data using Map Reduce, Pig, Hive and produce summary results from Hadoop to downstream systems.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMs.
- Processed data into HDFS by developing solutions.
- Created Map Reduce Jobs to convert the periodic of XML messages into a partition avro Data.
- Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.
- Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Used AWS Glue for the data transformation, validate and data cleansing.
- Used Sqoop widely in order to import data from various systems/sources (like MySQL) into HDFS.
- Created components like Hive UDFs for missing functionality in HIVE for analytics.
- Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various.
- Used different file formats like Text files, Sequence Files, Avro.
- Cluster co-ordination services through Zookeeper.
- Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
Environment: Hadoop, HDFS, Map Reduce, AWS, Hive, Pig, Sqoop, HBase, Shell Scripting, Oozie, Oracle 11g, Ad-hoc Queries, MS Excel, Windows
Confidential
SQL Developer
Responsibilities:
- Participated in testing of procedures and Data utilizing, PL/SQL to ensure integrity and quality of Data in Data warehouse.
- Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple heterogeneous information sources.
- Worked to ensure high levels of Data consistency between diverse source systems including flat files, XML and SQL Database.
- Developed and run ad-hoc Data queries from multiple database types to identify system of records, Data inconsistencies, and Data quality issues.
- Performed Tableau administering by using tableau admin commands.
- Involved in defining the source to target Data mappings, business rules and Data definitions.
- Ensured the compliance of the extracts to the Data Quality Center initiatives
- Metrics reporting, Data mining and trends in helpdesk environment using Access
- Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to customers.
- Provided business intelligence analysis to decision-makers using an interactive OLAP tool
- Created T/SQL statements (select, insert, update, delete) and stored procedures.
- Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and Update Strategy.
- Built reports and report models using SSRS to enable end user report builder usage.
- Created Column Store indexes on dimension and fact tables in the OLTP database to enhance read operation.
Environment: SQL, PL/SQL, T/SQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP.