We provide IT Staff Augmentation Services!

Sr. Big Data Architect/lead Engineer Resume

0/5 (Submit Your Rating)

SUMMARY

  • Having around 10+ years of strong IT experience in design, development, deploying, supporting and implementation of enterprise applications and BI/DWH solutions.
  • Experienced in implementing Big DataTechnologies - Hadoop ecosystem/HDFS/Map-Reduce Framework, Spark with python (Pyspark),Impala, Sqoop, Oozie, Storm, Kafka, Cassandra, Python, Zookeeper and HIVEdata warehousing tool.
  • Created workflow using AWS Lambda, S3, EMR and Kinesis.
  • Used Attunity and Airflow tool to fetch data from MySQL, Oracle, and Micro services into S3 Data Lake.
  • Excellent knowledge on Hadoop ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
  • More than 7 years of cloud experience (AWS) and working/conceptual understanding on AWS managed services like S3, EMR, EC2, Lambda.
  • Extensively used RDBMS like Oracle and SQL Server for developing different applications.
  • Good experience in using TDCH for exporting and importing data efficiently from Teradata into Hadoop ecosystem.
  • Strong experience on Hadoop distributions like Cloudera and Hortonworks.
  • Experience in working with D-Streams in Streaming,Accumulators,Broad castvariables, various levels of caching and optimization techniques in Spark.
  • Worked on real time data integration using Kafka,Spark streaming and Cassandra
  • Exposure in setting up data importing and exporting tools such as Sqoop from Teradata to HDFS and FLUME for real time streaming.
  • Exposure to work extensively in ETL tools like Talend Open Studio.
  • Worked with RDBMS Teradata and utilities like Fastload, Multiload, Tpump and Fastexport.
  • Good Experience in Classic Hadoop Admin & Development, Yarn Architecture along with various Hadoop daemons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager, Node Manager and Application Master.
  • Experience in writing Impala and Hive Queries for processing and analyzing large volumes of data.
  • Strong experience of Pig, Hive and Impala analytical functions, extending Hive, Impala and Pig core functionality by writing Custom UDFs.
  • Experience on Apache Spark, Streaming and Data Frames using Python.
  • Experience in Managing Hadoop Clusters using Cloudera Manager Tool.
  • Experience on Spark SQL and Spark Streaming applications.
  • Expertise in writing Python and Unix shell scripts.
  • Worked with onshore, offshore, and international client, business, product, and technical teams.
  • Great team player and team builder, highly motivated, willing to lead, fast learner, ability to learn new technology quickly and seamlessly manage workload to meet the deadline.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, Map Reduce, Oozie, Hive, Pig, Sqoop, Flume, Zookeeper and HBase, Spark, spark-sql, Impala, Mapr-DB, Oracle Big Data Discovery, Kafka, Nifi

Hadoop Ecosystems: MapR, Cloudera, AWS EMR, Horton Works.

Servers: Application Servers (WAS, Tomcat), Web Servers (IIS6, 7, IHS).

Operating Systems: Windows 2003 Enterprise Server, XP, 2000, UNIX, Red Hat Enterprise Linux Server release 6.7

Databases: SQL Server 2005, SQL 2008, Oracle 9i/10g, DB2, MS Access2003, Teradata.

Languages: C, C++, Java, XML, JSP/Servlets, Struts, spring, HTML, Python, PHP, JavaScript, jQuery, Web services.

Data Modeling: Star-Schema and Snowflake-schema.

ETL Tools: Knowledge on Informatica & IBM Data stage 8.1, SSIS

PROFESSIONAL EXPERIENCE

Confidential

Sr. Big Data Architect/Lead Engineer

Responsibilities:

  • Worked on large data files using Pyspark (parquet format files).
  • Developed and build scripts in Python and shell scripting languages.
  • Launch AWS EMR clusters and EC2 instances and understanding the process of AWS IAM properties.
  • Selecting the right nodes, instance types according to the account limit to increase the performance of the ETL.
  • Stored data in S3 buckets and perform read, transformations and actions on S3 data using Spark Data frames and Spark SQL context on Pyspark.
  • Experience in Spark with platform for processing billions of data which uses in-memory data processing.
  • Experience in extracting data from HDFS using Hive, presto and performed data analysis using spark with Python.
  • Experience in creating hive external tables and good understanding the techniques of partitioning, bucketing in hive and perform joins on hive tables.
  • Involved in designing Hive schemas, using performance tuning techniques like partitioning, bucketing.

Environment: AWS, Airflow, Jenkins, SQL, Python, Apache Spark, Impala, Hive.

Confidential

Sr. Data Engineer

Responsibilities:

  • Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such asHadoop, Map Reduce Frameworks, HBase, and Hive.
  • DevelopedSpark RDDtransformations, actions, and DataFrame's, case classes, Datasets for the required input data and performed the data transformations usingSpark-Core
  • Responsible for building scalable distributed data solutions usingHadoopand involved in Job management using Fair scheduler and Developed job processing scripts usingOozie workflow.
  • Performed data analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
  • UsedSpark APIoverCloudera Hadoop YARNto perform analytics on data inHiveand developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
  • Involved in performance tuning ofSpark Applicationsfor setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Optimizing of existing algorithms inHadoopusingSpark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Worked on setting up and configuringAWS's EMR Clustersand Used AmazonIAMto grant fine-grained access toAWSresources to users
  • Evaluate deep learning algorithms for text summarization usingPython, TensorFlow and Theano on Cloudera Hadoop System
  • Developed data pipeline programs withSpark Python APIs,data aggregations with Hive, and formatting data (json) for visualization, and generating. E.g., High charts: Outlier, data distribution, Correlation/comparison
  • Involved in handling large datasets using Partitions,Sparkin Memory capabilities, Broadcasts inSpark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Performed Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python.
  • Designed, developed, and did maintenance of data integration programs in a Hadoop RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
  • Worked on aPOCto compare processing time of Impala withApache Hivefor batch applications to implement the former in project.
  • Responsible for developing data pipeline withAmazon AWSto extract the data from weblogs and store inHDFSand worked extensively withSqoopfor importing metadata fromOracle.
  • Involved in creating Hive tables and loading and analysing data using hive queries and developed Hive queries to process the data and generate the data cubes for visualizing.
  • UsedSpark-Streaming APIsto perform necessary transformations and actions on the fly for building the common learner data model which gets the data fromKafkain near real time and persists intoCassandra.
  • Implemented schema extraction forParquet and Avro fileFormats inHiveand worked with Talend open studio for designing ETL Jobs for Processing of data.
  • UsedAWS Data Pipelineto schedule anAmazon EMR clusterto clean and process web server logs stored inAmazon S3 bucket.

Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Python, Kafka, Hive, Sqoop, AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Oozie, Jenkins, ETL, Data Warehousing, SQL, Nifi, Cloudera, Oracle 12c, Linux.

Confidential

BigData Lead Engineer

Responsibilities:

  • Developing and maintaining a Data Lake containing regulatory data for federal reporting with big data technologies such as Hadoop Distributed File System (HDFS), Apache Impala, Apache Hive and Cloudera distribution.
  • Developing different ETL jobs to extract data from different data sources like Oracle, Microsoft SQL Server, transform the extracted data using Hive Query Language (HQL) and load it into Hadoop Distributed file system (HDFS).
  • Involved in importing the data from different sources into HDFS using Sqoop and applying transformations using Hive, spark and then loading data into Hive tables.
  • Fixing data related issues within the Data Lake.
  • Primarily involved in Data Migration process using AWS by integrating with GitHub repository and Jenkins.
  • Designed, developed, and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
  • Primarily responsible for designing, implementing, Testing, and maintaining database solution for AWS.
  • Experience working with Spark Streaming and divided data into different branches for batch processing through the Spark engine.
  • Implementing new functionality in the Data Lake using big data technologies such as Hadoop Distributed File System (HDFS), Apache Impala and Apache Hive based on the requirements provided by the client.
  • Communicating regularly with the business teams along with the project manager to ensure that any gaps between the client’s requirements and project’s technical requirements are resolved.
  • Developing Python scripts using Hadoop Distributed File System APIs to generate Curl commands to migrate data and to prepare different environments within the project.
  • Coordinating the Production releases with the change management team using Remedy tool.
  • Communicating effectively with team members and conducting code reviews.

Environment: Hadoop, Data Lake, AWS, Python, Spark, Hive, Cassandra, ETL Informatica, Cloudera, Oracle 10g, Microsoft SQL Server, Control-M, Linux.

Confidential

Big Data Developer

Responsibilities:

  • Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
  • Developed highly optimized Spark applications to perform various data cleansing, validation, transformation, and summarization activities according to the requirement
  • Data pipeline consists of Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyse operational data.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Scala.
  • Analysed the SQL scripts and designed the solution to implement using Scala.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.
  • Ingested syslog messages parsed them and streamed the data to Kafka.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.
  • Exported the analysed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
  • Analysed the data by performing Hive queries (Hive QL) to study customer behaviour.
  • Used Hive to analyse the partitioned and bucketed data and compute various metrics for reporting.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Scheduled and executed workflows in Oozie to run various jobs.

Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, MapReduce, Sqoop, ETL, Java.

Confidential

System Engineer

Responsibilities:

  • Provide 1st level support by logging, investigating, monitoring, resolving and closure of reported incidents and requests, meeting all SLA’s and achieving customer satisfaction.
  • Providing assistance in installing and uninstalling standard applications via Software Center which are deployed via SCCM.
  • Liaising with other supports groups to resolve service requests or incidents.
  • Providing knowledge updates to the team and documentation updates to improve support Process.
  • Actively involved in using Active Directory and Exchange for user account provisioning, email access, Email Distribution List, Shared mailbox and contacts, shared folder access and Troubleshooting issues caused by access.
  • Handling calls from Toll’s external customers.
  • Participating in 24X7 shift rotations and support to Toll global employees in worldwide.
  • Involved in training and mentoring new Service Desk Operators joining the team.
  • Working closely with Operation Bridge and Incident management teams and handled escalations from the business as required.
  • Familiarity and adherence to core ITIL Processes employed at Toll while delivering IT services to Toll Businesses.
  • Update and maintain corporate databases records in accordance with Toll IT processes and Policies ensuring data quality of information entered.
  • Delivered troubleshooting solutions on WIN XP, Win7 and remote connections to terminal servers assisting global users.
  • Supported network printer issues.
  • Install and resolved MS Office, Office 365 issues, SharePoint, Lync, Skype.

We'd love your feedback!